• No results found

Defect Detection and OCR on Steel

N/A
N/A
Protected

Academic year: 2021

Share "Defect Detection and OCR on Steel"

Copied!
70
0
0

Loading.... (view fulltext now)

Full text

(1)

Defect Detection and OCR

on Steel

(2)

Jakob Grönlund and Angelina Johansson LiTH-ISY-EX--19/5220--SE Supervisors: Karl Holmquist

isy, Linköpings universitet

Alexander Poole

SICK IVP Linköping

Examiner: Per-Erik Forssén

isy, Linköpings universitet

Computer Vision Laboratory Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden

(3)

In large scale productions of metal sheets, it is important to maintain an effective way to continuously inspect the products passing through the production line. The inspection mainly consists of detection of defects and tracking of ID num-bers. This thesis investigates the possibilities to create an automatic inspection system by evaluating different machine learning algorithms for defect detection and optical character recognition (OCR) on metal sheet data. Digit recognition and defect detection are solved separately, where the former compares the object detection algorithm Faster R-CNN and the classical machine learning algorithm NCGF, and the latter is based on unsupervised learning using a convolutional au-toencoder (CAE).

The advantage of the feature extraction method is that it only needs a couple of samples to be able to classify new digits, which is desirable in this case due to the lack of training data. Faster R-CNN, on the other hand, needs much more training data to solve the same problem. NCGF does however fail to classify noisy images and images of metal sheets containing an alloy, while Faster R-CNN seems to be a more promising solution with a final mean average precision of 98.59%. The CAE approach for defect detection showed promising result. The algorithm learned how to only reconstruct images without defects, resulting in reconstruc-tion errors whenever a defect appears. The errors are initially classified using a basic thresholding approach, resulting in a 98.9% accuracy. However, this classi-fier requires supervised learning, which is why the clustering algorithm Gaussian mixture model (GMM) is investigated as well. The result shows that it should be possible to use GMM, but that it requires a lot of GPU resources to use it in an end-to-end solution with a CAE.

(4)
(5)

To begin with we would like to express our many thanks to Alexander Poole, our supervisor at SICK, who always supported us at setbacks as well as celebrating our successes. We would also like to thank the whole deep learning initiative team who lended us their precious computing servers. A special thanks goes to John Stynsberg who not only patiently helped us with the training framework, but also given us valuable inputs regarding the thesis and been a great sounding board in general deep learning matters. We would also like to show our appreci-ation to Wayne Way for sharing his expertise within the field.

We also want to thank our academic supervisor Karl Holmqvist for guiding us through some quite difficult obstacles, and for all the valuable feedback on the thesis. Our final thanks goes to our examiner Per-Erik Forssén for being a great advisor from the very beginning of the project and for making this thesis work possible.

Linköping, June 2019 Jakob Grönlund and Angelina Johansson

(6)
(7)

1 Introduction 1 1.1 Background . . . 1 1.2 Purpose . . . 1 1.3 Problem formulation . . . 2 1.4 Strategy . . . 2 1.5 Delimitation . . . 3 1.6 Attribution . . . 3 2 Theory/Related work 5 2.1 Preprocessing . . . 5

2.2 Convolutional Neural Networks . . . 6

2.3 Optimizers and loss functions . . . 7

2.4 OCR . . . 8

2.4.1 Normalization-Cooperated Gradient Feature Extraction . . 8

2.4.2 OCR with object detection algorithms . . . 12

2.5 Defect detection . . . 15

2.5.1 Autoencoders . . . 15

2.5.2 Convolutional Autoencoders . . . 16

2.5.3 Anomaly detection using CAE . . . 17

2.5.4 Clustering with Gaussian Mixture Model . . . 17

2.6 Evaluation metrics . . . 19

2.6.1 Accuracy . . . 20

2.6.2 Fβ-score . . . 20

2.6.3 Intersection over Union . . . 20

2.6.4 Mean Average Precision . . . 21

3 Method 23 3.1 Data collection . . . 23

3.2 Data labeling . . . 24

3.3 OCR . . . 25

3.3.1 Faster R-CNN . . . 25

3.3.2 Feature extraction method . . . 26

3.4 Defect detection . . . 27

(8)

3.5 Classification of anomalies . . . 29

4 Results 33 4.1 OCR . . . 33

4.1.1 Faster R-CNN . . . 33

4.1.2 Feature extraction method . . . 37

4.2 Defect detection . . . 39

4.2.1 Convolutional Autoencoder . . . 40

4.2.2 Convolutional Autoencoder with GMM . . . 43

5 Discussion 47 5.1 Result . . . 47

5.1.1 Results for Faster R-CNN . . . 47

5.1.2 Results for feature extraction method . . . 48

5.1.3 Results for CAE . . . 49

5.1.4 Results for Convolutional Autoencoder with GMM . . . 50

5.2 Method . . . 50

5.2.1 Data collection . . . 50

5.2.2 OCR . . . 50

5.2.3 CAE and CAEGMM . . . 51

5.2.4 Ethical and societal aspects . . . 53

6 Conclusion 55 6.1 Answers to the research questions . . . 55

6.2 Future work . . . 56

(9)

1

Introduction

1.1

Background

In large scale productions of metal sheets, it is important to maintain an effective way to continuously inspect the products passing through the production line. If a metal sheet has large defects, it must be removed and disposed of or recy-cled since it does not reach the set requirements. Moreover, to link the damaged products to a specific customer order, each metal sheet is given an ID number made of small dimples with a depth of 0.1 millimeter. The sheets are currently inspected manually which is time consuming, expensive and in some cases un-certain since the defects and ID numbers are difficult to find in the production environment. However, recent advances in computer vision provides great possi-bilities to solve this with deep learning and real-time data. With deep learning, it is possible to train deep neural networks (DNN’s) to perform different types of complex tasks, such as anomaly detection and Optical Character Recognition (OCR). They are therefore highly suited for automatic detection and classification systems that constantly analyze images of the metal sheets in the production line. This thesis describes how deep learning can be used to automatically detect anoma-lies and read ID numbers on metal sheets. These two problems are however solved independently of each other.

1.2

Purpose

The purpose is to find a good strategy for automatic digit recognition of dotted ID numbers on metal sheets and to investigate whether unsupervised anomaly detection has the potential to distinguish defects from non-defects.

(10)

1.3

Problem formulation

The main problems are to detect and classify ID-numbers and defects in intensity images and corresponding height values of metal sheets. These problems can be further divided into the following sub problems.

• The data is received in a noisy environment where the metal sheets might vibrate and can be slightly rotated. Can these issues be handled using image processing methods or are more strict constraints needed when sampling data?

• Are depth value and grayscale images sufficient for defect detection and digit classification?

• What is a good strategy for detection of anomalies in metal sheets?

• An anomaly is not necessarily a defect, how can anomalies be classified as defects or non-defects?

• If the anomalies form an ID-number, is it possible to determine its position and create a bounding box of the area?

• Given a limited area of the image containing digits with a specific font and size, how can the digits be read and classified?

1.4

Strategy

The character recognition part will be evaluated with two methods, one of these is the object detection algorithm Faster R-CNN and the other is a classic machine learning algorithm called Normalization-Cooperated Gradient Feature (NCGF) extraction. Both methods are trained on images of metal sheets that contain dot-ted digits, collecdot-ted at a steel production company. A small portion of the data set will be exempted from the training data and only used for evaluation of the trained models. The major reason to why both methods will be evaluated is that they require different amounts of data. NCGF only needs a couple of samples to be able to classify characters and Faster R-CNN, such as any neural network, re-quires much more data. NCGF is a method for classifying characters and can not detect their position in an image. Therefore will two approaches be evaluated, one where Faster R-CNN has trained on 10 classes (’0’ to ’9’) and one that has been trained on a single class ’digit’. The output from the network that trained on a single class will then be classified by NCGF.

The anomaly detection on the other hand, is based on convolutional autoencoders (CAE) with unsupervised training. Unsupervised anomaly detection is a hot topic in many areas, such as cybersecurity, complex system management and medical care [32]. It is however a challenging task to create a robust implementation when the dimensionality of the problem is high. It is thus desirable to reduce the dimensionality before estimating whether it contains defects or not [32]. This

(11)

is what the autoencoder (AE) is for, it learns how to reconstruct data in normal conditions, resulting in a large reconstruction error when anomalies appear. Us-ing the reconstruction error as the only classification feature reduces the problem to one dimension. However, since the defect classification most certainly is not as easy as defining an error threshold, a Deep Autoencoding Gaussian mixture model (DAEGMM) will be evaluated as well. DAEGMM increases the dimension-ality of the classification problem by adding features derived from the convolu-tional autoencoder, making it easier to separate anomalies that are defects from the ones who are not.

1.5

Delimitation

A delimitation in this project is that only dimples are counted as defects and the goal is to know whether there is a defect in the image or not, not where in the image it is. The type and size of the defects will not be considered, they will only be classified as defects or non-defects.

The two sub parts, digit recognition and anomaly detection, will not be integrated in a complete system. This work will only show how well the proposed methods perform separately. ID numbers that are split between two images will be ig-nored in this study. It is however an issue that needs to be considered when implementing data acquisition and data preprocessing in the full system.

1.6

Attribution

Angelina Johansson’s main responsibilities has been implementing and writing about digit recognition, while Jakob Grönlund mainly has been responsible for implementing and writing about defect detection. This means that Angelina wrote the sections 2.4, 3.3, 4.1, 5.1.1, 5.1.2 and 5.2.2. Jakob wrote sections 2.5, 3.4, 4.2, 5.1.3, 5.1.4 and 5.2.3. The remaining sections that have not been men-tioned here were written by both group members.

(12)
(13)

2

Theory/Related work

This chapter describes the underlying theory for this master thesis. Section 2.1 mentions some preprocessing methods that are used to alter the used data. Sec-tions 2.2 and 2.3 describes the core fundamentals of convolutional neural net-works and how the weights of the netnet-works are optimized. In 2.4, the theory be-hind the two methods Normalized-Cooperated Gradient Feature extraction and Faster R-CNN is explained and section 2.5 describes how anomalies in the sheets can be found.

2.1

Preprocessing

Images usually needs to be preprocessed to reduce the complexity and increase the accuracy of future algorithms. In this case, there is data that contains oscilla-tions and may complicate calculaoscilla-tions. One possible way to remove oscillaoscilla-tions in range data is to calculate the average range of the data and then subtract all pixels with the average value.

A way to enlarge the edges and minimize gaps in features of a binary image is to apply morphological dilation. This operation takes two sets, A and B, in the 2D integer space, Z2. B is first reflected by its origin and the shifted set is translated by z, this is expressed as ( ˆB)z. The dilation of A by B, denoted by A ⊕ B, is the

set of all displacements, z, such that ˆB and A overlap by at least one element,

see equation (2.1). ∩ refers to the intersection between the two sets and ∅ is the definition of an empty set [26].

A ⊕ B = {z|( ˆB)zA , ∅} (2.1)

Opening, denoted as A ◦ B, is another morphological process that removes small objects in an image while preserving the shape and size of bigger objects and is

(14)

(a) Edge image with no applied morphology.

(b) Edge image with applied opening morphology.

(c) Edge image with applied dilation morphology.

(d) Dilation and opening applied on edge image.

Figure 2.1: The result of applying morphology on a binary edge detected image.

defined as

A ◦ B = (A B) ⊕ B. (2.2)

Opening A by B is the erosion of A by B (A B), followed by a dilation of the result by B. Erosion is defined by equation (2.3). It indicates that the set of all points in B translated by z is contained in A. If not all points in z are contained in A, then the element will be eroded [26]. The effect of opening dilation can be seen in figure 2.1.

A B = {z|(B)zA} (2.3)

2.2

Convolutional Neural Networks

Convolutional Neural Networks, or CNN’s, are commonly used in different deep learning algorithms, and will appear in some of the methods described below.

(15)

This section will thus give a basic understanding of how and when to use them. CNN’s are designed to process data with locally dependent information. This could for example be 2D images, audio spectrograms or 3D volumes [15]. CNN’s have the ability to capture properties of natural signals by applying local con-nections, shared weights, pooling and multiple layers. The structure of different CNN architectures can differ greatly since there are many ways to implement and combine convolutional layers [15].

Figure 2.2:Example of convolutional neural network.

In a convolutional layer the input feature map is filtered with a set of weighted kernels that slides over the input and produces a new set of feature maps. The number of output feature maps is equal to the number of filter kernels. The result is then passed through a non-linear activation function to make sure that the feature map values are within a specific interval. The weights can then be trained by back-propagating gradients in the same way as a regular deep neural network, resulting in a set of filters that extracts different features from the input image [15]. In many cases, it is beneficial to downsample the feature maps, and a common way to do that is by using a pooling layer. The level of downsampling is specified by a so-called stride, which is the number of pixels to skip when sliding over the image. A pooling layer also comes with different options for feature extraction, where max-pooling is commonly used. Max-pooling basically slides a filter kernel over the input data, with the given stride as step length, and outputs the maximum value for each sub region. Figure 2.2 above illustrates an example of a CNN composed of one convolutional layer and one pooling layer. Using pooling layers in the CNN architecture adds an invariance to distortions and small shifts [15].

2.3

Optimizers and loss functions

A loss function is a function that evaluates a candidate solution, where the goal is to find the solution that minimizes the loss function. Two common loss functions

(16)

are `1loss and `2loss. `1 loss is the sum of absolute differences between the

tar-get and the predicted values and `2loss is the sum of squared distances between

the target and the predicted values. Smooth `1loss is another loss function and

is often used in bounding box regression. It is a combination of `1and `2-loss,

which results in less sensitivity to outliers [9]. It behaves as `1when the absolute

value of the loss is high and as `2otherwise. Finding the solution that minimizes

the loss function requires some kind of optimizer. An optimizer tries to find the optimal step for updating the weights in the neural network, thus navigating the network towards the minimum value in the loss function. A commonly used opti-mization method is the so-called Adam. Adam is a method for efficient stochastic optimization with a low memory requirement. It computes individual adaptive learning rates for all weights in the network using estimates of previous gradi-ents. This can be seen as keeping the momentum from earlier steps, and is one of the reasons for why Adam optimizer converges relatively fast [13].

2.4

OCR

Optical Character Recognition is the conversion of typed, handwritten or printed characters into a machine-encoded format [25]. Below follows different approaches for solving OCR.

2.4.1

Normalization-Cooperated Gradient Feature Extraction

There are several papers that evaluate normalization and feature extraction on Chinese characters. Character recognition systems generally consist of three ma-jor components: character normalization, feature extraction and classification [19]. The normalization transforms images to the same dimensionality and re-duces the within-class shape variation [18]. One of the feature extraction tech-niques is NCGF that together with the normalization method ’Line Density Pro-jection Interpolation’ (LDPI) proves to yield the lowest error rate on CASIA database containing binary images of handwritten Chinese characters [16]. ’Fisher linear discriminant analysis’ (FLDA) in section 2.4.1 then takes the output from NCGF and reduces the dimensionality of the features and the reduced vector is finally classified by the ’Modified Quadratic Discriminant Function’ (MQDF). A further explanation of how these methods cooperate can be found in section 3.3.2.

Line Density Projection Interpolation

Pseudo-two-dimensional (P2D) normalization techniques have been proven to improve either the computation efficiency or recognition accuracy for character recognition. The normalization methods form 2D coordinate functions by com-bining 1D coordinate functions row-wise or column-wise [16]. One of these tech-niques are LDPI as described in [17]. It partitions the line density map d(x, y) into three soft strips, see figure 2.3, in both vertical and horizontal directions. This is done by using weight functions in both directions, see equation (2.4) for

(17)

partitioning of the horizontal density map dx(x, y) using weight functions in y

axis.

Figure 2.3:Three strips of the character image. For visual convenience, the character image is displayed instead of the line density map.

dxn(x, y) = wi(y)dx(x, y), i = 1, 2, 3 (2.4)

The corresponding used weights, wi(y), can be seen in equation (2.5), where ycis

the vertical centroid and H1is the height of the input image. The vertical density

map dy(x, y) is computed similarly.

w1(y) = ycy yc , y < yc w2(y) = 1 − w1(y), y < yc w2(y) = 1 − w3(y), y ≥ yc w3(y) = y − yc H1−yc , y ≥ yc (2.5)

The density values are then projected onto the x and y axis respectively. See equation (2.6) for projection onto x axis, pix(x). These projections are then

nor-malized to unity sum, hx(x) in equation (2.7), which are accumulated to give 1D

coordinate mapping functions, x0 in equation (2.8), and are then interpolated to generate 2D coordinate functions by equation (2.9). The vertical density func-tions dyi(x, y), i = 1, 2, 3 are equalized and interpolated similarly [18]. LDPI will

not only result in normalized image size, but also the shape of the character will be normalized. pxi(x) = X y dxi(x, y), i = 1, 2, 3 (2.6) hx(x) = px(x) P xpx(x) (2.7) x0 = W2 x X u=0 hx(u) (2.8)

(18)

x0(x, y) = ( w1(y)x01(x) + w2(y)x02(x) y < yc w3(y)x03 (x) + w2(y)x02 (x) y ≥ yc (2.9)

Normalization-Cooperated Gradient Feature

When the input image has been normalized, it is time to apply a feature extrac-tion model such as NCGF, which is proposed in [16]. This is a direcextrac-tional de-composition method that is applied directly on the original image, and the gra-dient elements are then mapped to direction maps. There are studies where sim-ilar methods are applied on a normalized image, but according to [19] and [16], NCGF outperforms the other methods both when it comes to accuracy and per-formance on Chinese and Japanese handwritten characters.

By applying NCGF, the normalized gradient can be computed from the original image without explicitly generating a normalized image. The method views both the original and the normalized image as a function in continuous 2D space and associates them by coordinate normalization (which in this case is performed by LDPI). The original image can be seen as many unit squares, where each unit square is mapped to a quadrilateral (polygon with four edges) in the normalized plane. By assigning the character image to the pixels of the normalized image that overlap with the corresponding quadrilateral, the normalized image is ob-tained.

The gradient g in the original image is computed by Sobel operators for each pixel. The gradient is then assigned to eight direction planes. Each contour pixel is said to have eight chaincodes, see figure 2.4, and this decomposition has according to [17] better recognition accuracy than using four orientations. If a gradient lays between two directions, then it is decomposed into two components as shown in figure 2.5. The gradient direction of the original image at each pixel is directly assigned to direction planes. In each direction plane, the pixels overlapping with the corresponding quadrilateral are assigned values that are equal to the prod-uct of the overlapping area times the gradient component length times ||g0||

/||g||

where g0 is the normalized gradient. This results in direction planes where the direction of the original plane and the magnitude of the normalized gradient is saved [16].

Fisher Linear Discriminant Analysis

Data with high dimensionality can lead to high computational cost and lower performance. FLDA is a method that reduces the dimensionality of the data, the main idea behind it is to maximize a function that will give a large separation between classes, but also to minimize the variance within each class and thereby minimize the class overlap. Equation (2.10) is the function that is maximized, W is a weight vector that represents the desired transformation. The generalization of the within-class covariance for K classes is given by SW in (2.11), which is the

(19)

Figure 2.4:The eight directions of chaincodes.

Figure 2.5:The direction of a gradient is decomposed into two of the stan-dard chaincode directions.

projection. SB in (2.13) is the covariance between classes and is calculated by

taking the outer product of the difference between the local class mean mk and

global mean m times total number of data points Nk in class k. W is then the

largest eigenvectors of the D0largest eigenvalues in (2.14) [3].

J(W ) = W TS BW WTS WW (2.10) SW = K X k=1 Sk (2.11) Sk = X nk (xnmk)(xnmk)T (2.12) SB= K X k=1 Nk(mkm)(mkm)T (2.13) W = maxD0(eig(S−1 WSB)) (2.14)

(20)

Modified Quadratic Discriminant Function

Classifiers main function is to map input data to a class. When working with feature extraction of handwritten characters, Modified Quadratic Discriminant Function (MQDF) is one of the most commonly used classifiers [16].

g2(x, wi) = k X j=1 [(x − µi)Tφij]2 +1 δi  ||x − µi||2− k X j=1 [(x − µi)Tφij]2  + k X j=1 log λij+ (d − k) log δi (2.15)

MQDF is a nonlinear classifier that is suitable for high-dimensional features and large number of classes. It reparametrizes the covariance matrix into eigenval-ues and eigenvectors, the small eigenvaleigenval-ues are then truncated to denoise the unstable estimation [29]. MQDF is computed by equation (2.15) where d is the dimension of the feature vector(after dimensionality reduction) denoted by x and

µiis the mean vector of class ωi. λij and φij, j = 1, ..., d, are the eigenvalues and

eigenvectors of the covariance matrix of class ωi. The eigenvalues are sorted in a

decreasing order and the respective eigenvector is sorted accordingly. The minor eigenvalues are replaced with the constant δiwhich is proportional to the average

feature variance and k denotes the number of principal axes [17].

2.4.2

OCR with object detection algorithms

Object detection does in this paper refer to methods for detecting and classifying different objects in images. An object can for example be a specific digit, where the object detection algorithm learns to detect it and classify it as one of the possible digit classes 0 to 9.

R-CNN and Fast R-CNN

R-CNN aims to generate a bounding box around the main objects in an image. To do this, the first step of the process is to create a large amount of region proposals using selective search. Selective search is a search algorithm that uses the image structure to guide the sampling process [31]. In R-CNN it is used to pairwise com-pare and group regions based on their similarity such as color and texture. The regions are then warped to a square and passed through a CNN with five convo-lutional and two fully connected layers [10]. A more thorough explanation of the mentioned CNN is found in [14]. The classification is finally done on the CNN output using a Support Vector Machine (SVM). To tighten the bounding boxes around the objects, they are fed into a linear regression model that generates the

(21)

final output of R-CNN [10].

Fast R-CNN speeds this process up with a technique called RoI Pooling. It splits the input feature map into a fixed number of regions, k, where the regions are approximately of the same size. Max-pooling is then applied to each region which contributes to a fixed size feature map for each RoI, which are then passed on to a fully connected layer. The output will be two sibling layers, one that produces softmax probability estimates and one that outputs four values that represent the coordinates of the bounding box. See the architecture for Fast R-CNN in figure 2.6. The difference to R-CNN is that Fast R-CNN trains the CNN, classifier and regressor in a single model instead of using different models for each part, and instead of passing each region proposal through the CNN one by one, the whole image is passed once. [9].

Figure 2.6:The architecture of Fast R-CNN. It takes an entire image and a set of region proposals as input. For each object proposal, a RoI pooling layer is applied and yields a classifier and bounding box regressor.

Faster R-CNN

Faster R-CNN is built upon the same idea as Fast R-CNN, but introduces a region proposer that uses the feature maps from the CNN instead of a separate selective search algorithm, making the region proposer nearly cost free. The object de-tection system is thus composed of two modules that shares convolutional layers, one that produces the proposed regions and one that uses the proposed regions to detect and classify objects (Fast R-CNN) [28]. The different components in Faster R-CNN are illustrated in figure 2.7 below.

(22)

Figure 2.7:The architecture of Faster R-CNN. The region proposal network (RPN) and the object detection network share the same convolutional layers. This makes the technique faster than previously mentioned R-CNN and Fast R-CNN.

A Region Proposal System (RPN) outputs a set of rectangular boxes around pro-posed objects, where each box comes with a certainty score. The proposals are generated by sliding a small network over the feature map from the last shared layer of the CNN. The outputs from this mini network are then used as inputs to a fully connected regression layer and a fully connected classification layer [28]. Consequently, the fully connected layers are shared across all spatial locations. The algorithm proposes several regions simultaneously at every sliding-window location with the maximum possible proposals k. The regression layer will thus output the coordinates of the boxes (center coordinates, width and height) and the classification layer will output a probability of ’not object’ or ’object’ for each box [28]. The proposals are then parameterized relative to k anchor boxes. An anchor box is associated with a scale and an aspect ratio, where k = scale × ratio, and centered at the sliding window. Using different scales results in a pyramid of anchors with different ratios, which is cost-efficient in comparison to other meth-ods [28]. This approach is translation invariant, meaning that if one object is translated, the proposal is as well, and the prediction function should manage to predict the proposal in both locations.

The loss function for training the RPN is based on a binary class label, ’not ob-ject’ or ’obob-ject’, assigned to each anchor box [28]. There are two cases for assign-ing a positive label, if the anchor/anchors have the highest IoU overlap with the ground truth or if an anchor has an IoU overlap greater than 0.7 with any ground truth box. The non-positive anchors with a IoU overlap lower than 0.3 for all ground truth boxes are assigned a negative label, the anchors that are neither

(23)

positive nor negative, do not contribute to the training [28]. The loss function is defined as: L({pi}, {ti}) = 1 Ncls X (Lcls(pi, pi)) + 1 Nreg X (piLreg(ti, ti)), (2.16)

where i is the anchor index in a mini-batch, pi is the probability of being an

ob-ject and pi is the ground-truth label, which is 1 for positive anchors and 0 for negative anchors. tiis a 4D vector representing the parameterized coordinates of

the predicted bounding box and ti is the ground truth. Lclsis log loss over the

classes and Lreg= R(tit

i), where R is a smooth L1loss function. The two terms

are then weighted and normalized by Nclsand Nreg. Each batch in Faster R-CNN

processes one image at a time, Nclsis the mini-batch size and Nregis the number

of anchor locations [28].

The bounding box regressors are applied on features of the spatial size of the feature map, where there are one regressor per anchor and where all regressors have their own weights. This makes it possible to predict boxes with different sizes even though the features have the same size [28]. It is possible to use dif-ferent feature extractors, but according to [28], the Resnet-101 yields the highest accuracy, see [11] for the architecture of Resnet network.

2.5

Defect detection

Anomaly detection algorithms aims to detect anything that differs from the nor-mal conditions. This section will cover how this can be done using convolutional autoencoders.

2.5.1

Autoencoders

The basic concept of autoencoders is that they perform dimension reduction of the input using an encoder module, and then tries to reconstruct it again us-ing a decoder module. This makes the target of the autoencoder the input itself. An autoencoder can thus be distinguished into two primary features: An auto-associative feature, where it tries to copy the input to the output, and a bottle-neck feature, where it compress the information to a low dimensional space, a so called latent space [2, 22]. The vanilla architecture of an autoencoder can be described as follows:

• An input layer with the size of the input data x  Rd.

• One or multiple hidden layers forming a bottleneck of the system by en-coding the input. The output values of a hidden unit is given by fW(x) =

F(W1x + w1) ≡ h, where W1is the weight matrix (input to latent space), w1

is the bias vector and F is a nonlinear function that operates component wise [5].

(24)

• A decoder that attempts to generate the output features by reconstructing the input. The output is given by gU(h) = F(W2h + w2) ≡ y, where W2is the

weight matrix (hidden to output) and w2is the bias vector [5].

Since the data cannot be compressed with small errors for every x, the goal is to train the encoder to capture the main factors of variation. The training will how-ever only learn the encoder to compress data similar to the training data [2]. This also suggests that training on a homogeneous dataset will give an autoencoder that performs with high accuracy on that specific type of data, but result in large errors for data that vary from normal conditions. In some applications, where a more generalized model is required, that might give a negative effect. It is how-ever a desirable quality in defect detection. By only training the autoencoder on data without anomalies, the resulting model will give large errors whenever an anomaly appears [30]. This is actually one of the greatest advantages with autoencoders, by only training on data that represents the normal appearance, unsupervised training is applicable and no segmentation or labeling is required [6, 30].

2.5.2

Convolutional Autoencoders

Fully connected autoencoders ignores the 2D structure of an image, which does not only mean that they lack contextual information, but also forces each feature to be global. Convolutional autoencoders (CAE) on the other hand consists of convolutional layers and are thus able to capture localized features in the image [21]. The architecture of convolutional autoencoders is intuitively similar to the conventional autoencoder, except that the weights are shared among all locations in the image [21]. As in fully connected autoencoders, the input is downsampled in an encoder and reconstructed in a decoder. The decoder consists of one or multiple pooling layers, where each layer downsamples the image with a given stride. The decoder performs the opposite operations and uses backwards strided convolution (deconvolution) to generate an output of the same size as the input [20]. A general encoder in a CAE can be defined as,

fW(x, θ1) ≡ h (2.17)

where θ1is the weights and biases of the encoder. A decoder can then generally

be defined as,

gU(h, θ2) ≡ y (2.18)

where θ2 is the weights and biases of the decoder [21]. Using only one bias per

input channel or feature map makes the filters specialize on features of the whole input, decreasing the degrees of freedom [21]. Figure 2.8 below illustrates the structure of a CAE, where the image is compressed to the latent space and then reconstructed again.

(25)

Figure 2.8:CAE with a convolution and pooling layer in the encoder and a convolution and deconvolution layer in the decoder.

2.5.3

Anomaly detection using CAE

A study from 2018 showed that defect detection and recognition can be per-formed on metallic surfaces using CAE [30]. They implemented a cascade of two CAE sharing the same structure, where the output of the first acts as input to the second. The final output of the architecture was a prediction map represent-ing the defect probability for each pixel. The prediction map was subsequently passed to a dense CNN for classification [30]. This solution does however require data labeling and since the appearance of anomalies in some cases has a high inter-class variance, a solution like this might need an extensive number of la-beled defects.

Other non-metallic related studies have shown that this can be done with a slightly different approach. Instead of trying to predict whether each pixel is a defect or not, they simply taught the CAE to reconstruct data in normal condition, expect-ing large reconstruction errors for data with anomalies [7]. This way, the appear-ance of the anomaly is no longer important and unsupervised training can be used to train the CAE. Using the reconstruction error when classifying defects re-duces the dimensionality of the problem to one. However, it is sometimes impos-sible to separate the classes in only one dimension. In these cases, it is necessary to use additional features for classification, such as other error measurements or features derived from the latent space [32].

2.5.4

Clustering with Gaussian Mixture Model

There are various algorithms available for classifying data, some requires super-vised training and some do not. A Gaussian mixture model (GMM) is a method based on unsupervised training and aims to find the optimal mixture of Gaussian distributions to model the input data [32]. In other words, GMM is a probabilistic model that assumes that all data points are generated from a mixture of Gaussian distributions, and aims to cluster the points using information about the covari-ance structure [24]. A great advantage with GMM is its ability to cluster relatively complex distributions, in comparison to k-means that is more restricted [32].

(26)

The parameters of a K-component GMM can be denoted as θK = {ωk, µk, Σk},

where ωk is the mixture weight, µk is the mean vector and Σk is the covariance

for the Gaussian mixture component k = {1,2...K} [12]. A common way to estimate these parameters, in order to find the optimal Gaussian mixture, is by using the Expectation-Maximization (EM) algorithm. The EM algorithm can be described as follows: Given N training samples, X = {x1, x2...xN}, the expectation step

calcu-lates the posterior probability p(k|xn, θK) for each Gaussian mixture component

(k = 1,2...K) using the Gaussian probability p(xn|µk, Σk), which is the probability

that xnbelongs to mixture component k, see equation (2.19) below [12].

p(k|xn, θK) =

ωkp(xn|µk, Σk)

PK

k=1ωkp(xn|µk, Σk)

(2.19) Equations (2.20) to (2.22) below shows the maximization step, where the param-eters ωk, µk, and Σkare estimated for each mixture component. [12, 32].

ωk = 1 N N X n=1 p(k|xn, θk) (2.20) µk = PN n=1ωknxn PN n=1ωkn (2.21) Σk = PN n=1ωkn(xnµk)(xnµk)T PN n=1ωkn (2.22) Even though a GMM has the ability to cluster more complex distributions, ap-plying it directly on high dimensional data is still a very difficult task. If the dimensionality of the data is too high, any sample could be a rare event with low probability to observe. It is thus a good idea to extract low dimensional fea-tures from the data. However, since GMM usually is learned by algorithms such as EM, it is not trivial to perform an end-to-end optimization of dimensionality reduction and density estimation that favors GMM learning. This challenge is addressed in [32], where the paper proposes an algorithm called ’Deep Autoen-coding Gaussian Mixture models (DAEGMM)’. DAEGMM utilizes an estimation network that takes the low dimensional space from the AE and the reconstruc-tion error as parameters and outputs mixture membership predicreconstruc-tions for each sample. Figure 2.9 below shows an example of a Convolutional Autoencoding Gaussian mixture model (CAEGMM), where the difference from DAEGMM is that it uses a CAE instead of a AE.

It is, with the predictions from the estimation network and the low dimensional representation, possible to estimate the GMM parameters and evaluate the sam-ple log-likelihood/energy in one end-to-end solution [32]. The samsam-ple energy can be calculated as,

(27)

Figure 2.9:Example of a CAEGMM, where zris the reconstruction error and

zcthe low dimensional space.

E(z) = − log  ωk K X k=1 exp(−12(z − µk)TΣ −1 k (z − µk)) p |2πΣk|  , (2.23)

where ωk is derived from the predictions of the estimation network and z is a

vector containing the low dimensional representation of the data togehter with the reconstruction error [32]. The loss function is then given by,

Loss = 1 N N X i=1 L(xi, x 0 i) + λ1 N N X i=1 E(zi) + λ2P (Σ). (2.24)

The loss function in (2.24) is composed of three components. L(xi, x

0

i) is the

re-construction error generated by the CAE. E(zi) denotes the sample energy, and

minimizing the energy maximizes the likelihood that the sample exists in the Gaussian mixture. Finally, to avoid triggering trivial solutions when the diagonal entries in the covariance matrices degenerate to 0, small values are penalized by

P (Σ). The parameters λ1 and λ2 are used as weights for the sample energy and

P (Σ) [32].

2.6

Evaluation metrics

To evaluate the performance of each model, some evaluating techniques will be applied. The following section will describe the techniques that are relevant for this thesis.

(28)

2.6.1

Accuracy

As described in [4], the performance of the text recognition model can be done by calculating the accuracy, see equation (2.25). The correctly identified numbers, the true positive (TP) and true negative (TN) results, are divided by the total

number of tests, which is the sum of all true positive, true negative, false positive (FP) and false negative results (FN). This gives information about the percentage

of digits that are correctly predicted, but does not provide information about how wrong the incorrect predictions are.

Accuracy = TP + TN TP + FP + TN+ FN

(2.25)

2.6.2

F

β

-score

In defect detection, the most important part is to make sure no defects are missed. However, it is also important to not detect too many false defects, since the system then fails to optimize the process. Fβ-Score considers both precision and recall to

compute the score of the test, see equation (2.28) [8, 24]. Precision is the number of true positive results divided by the sum of true positive and false positive results, see equation (2.26), whereas recall is the number of true positive results divided by the sum of true positives and false negatives, see equation (2.27). Since it is important to make sure that as few defects as possible goes undetected, it can be a good idea to weight the Fβ-score towards recall. In F2-score the β value is

set to 2, which weights recall higher than precision by placing more emphasis on false negatives [8]. P recision = TP TP + FP (2.26) Recall = TP TP + FN (2.27) FβScore = 2+ 1)P recision × Recall P recision × β2+ Recall (2.28)

2.6.3

Intersection over Union

A way to evaluate bounding boxes is to use Intersection over Union (IoU) [27]. IoU measures the similarity between the predicted region, PM, and the

ground-truth region, GT, by dividing their overlapping area with their union area, see

equation (2.29).

I oU (GT, PM) =

Area(GTPM)

Area(GTPM)

(29)

2.6.4

Mean Average Precision

Mean average precision (mAP) is an evaluation metric that can be used to evalu-ate Faster R-CNN [28]. If the IoU between a predicted and ground truth box is ≥0.5, then the box is rated as a true positive. If more than one detection over-laps one ground truth box, then the detection with highest IoU is taken and the other are counted as false positives. Both true and false negative samples are then ordered by their confidence, i.e. the softmax output from Faster R-CNN, and precision and recall are calculated for each detection. The precision is plot-ted against recall and this creates a precision-recall curve (PR curve), this will most probably create a curve that will go in a zigzag pattern since precision falls with false positives and goes up with true positives. This zigzag pattern is then smoothed using interpolation, see figure 2.10. The area under the curve is de-fined as average precision (AP), see equation (2.30), where p is the interpolated precision and r is the recall. AP is computed per class and the mean of the results yield mAP [23]. AP = 1 Z 0 p(r)dr (2.30)

Figure 2.10: The interpolation of a precision-recall curve can be seen as a red dotted line in the figure.

(30)
(31)

3

Method

3.1

Data collection

Data was collected using SICK IVP’s laser triangulation camera Ranger 3. The camera sampled both depth values and intensity with a horizontal pixel density of 51.2 pixels/cm. If the images are sampled with higher density, more cameras are needed to cover the short side of the metal sheet and therefore also increase the cost for the user. It is therefore preferable to use lower pixel density, but as a consequence this will lead to a lower resolution.

Two cameras and a laser were mounted above the metal sheet transportation line as showed in figure 3.1 below. To decrease vibration artefacts the laser was placed above one of the transportation wheels where the sheets experience minimal os-cillations.

Figure 3.1:Data collection setup.

In order to only sample when a metal sheet was in view of the cameras, a photo sensor was placed under the transportation line together with a decoder that

(32)

sured the size of each image. Data was then sampled for approximate 24 hours, resulting in 13574 images with the dimensions 2560 × 2000. A few metal sheets with known defects were also put on the transportation line to be sure to at least capture some defects. Some of the collected sheets had an alloy that made the surface coarse, see figure 3.2, this could potentially complicate both classifica-tion and detecclassifica-tion of digits and anomalies.

Figure 3.2: An intensity image of a metal sheet with an alloy, these sheets are not counted as defected.

3.2

Data labeling

Data labeling for text recognition was done using an open source tool called VGG Image Annotator (VIA) [1]. The software lets the user define regions in an image and classify them. In this case, each digit in the images was annotated by draw-ing a box around it and classifydraw-ing it as a number between 0-9. The result was a JSON file that stored the coordinates of every bounding box, more specifically it stored the coordinates of the left corner together with the width and height. The JSON file was further parsed to fit the training of Faster R-CNN.

Two separate text files were created from this JSON file. One with each bounding box tagged with a class of 0-9 and another file where every bounding box had the class ’digit’.

(33)

3.3

OCR

Faster R-CNN is a network that both detects and classifies objects in images. Since some classes had few training samples, see table 3.1, it was interesting to compare if Faster R-CNN was able to classify each digit with a class of 0 to 9 or if it per-formed better with one class ’digit’. The second option resulted in 10 times more samples for the sole class and may therefore be more accurate. The bounding boxes were then classified with a feature extraction method.

A total of 519 images containing digits were saved, this corresponded to 6108 samples of digits and 26 of these images were used for evaluation, which corre-sponds to 333 digits. The same evaluation data was used for both Faster R-CNN, see 2.4.2, and for the feature extraction method, see section 2.4.1. Table 3.1 shows the number of samples per digit that were used in the training process of Faster R-CNN. As can be seen in the table, some of the digits were considerably more occurring than other.

Class Number of samples

0 883 1 694 2 871 3 882 4 577 5 508 6 269 7 303 8 491 9 321

Table 3.1:Number of samples per class in the existing data set.

3.3.1

Faster R-CNN

A Keras implementation of Faster R-CNN was found at https://github.com/ you359/Keras-FasterRCNN. This code was cloned and altered to fit the data set for this thesis, where the input was intensity images of the metal sheets. The reason why intensity data was chosen was that the dimples had a characteristic look when sampling with laser triangulation that could possibly help the net-work to find the digits. Tensorflow was used as backend and the architecture of ResNet50 was applied. A pretrained model for Resnet-50 named resnet50 _weights_tf_dim_ordering_tf_kernels.h5was used to speed up the train-ing. The images had to be rescaled to 2304 × 1800 pixels since larger images would make the GPU run out of memory. This was done with an interpola-tion method called INTER_CUBIC. Optimally, no downsampling should be ap-plied since this removes information from the image. One possible solution to

(34)

this problem would be to split an image to smaller images, but this would most likely result in broken digits, which would complicate the training. The network trained for 250 epochs. The architecture and parameters were the same for both networks. The stride and number of ROI’s were the same as in the paper, see section 2.4.2, which was 16 and 300 respectively.

The digits in the dataset were always upside down, and since their directions only differed by a few degrees, an augmentation that rotates the digits by -5, 0 or 5 degrees was implemented, which was applied prior to the rescaling. The anchor box sizes, see section 2.4.2, and the box ratios were also altered to fit the existing data. The anchor box sizes were changed to [45, 64, 128] and the box ratios were set to [[1, 2], [1, 1.5], [1, 1]]. The anchor box sizes and scales were fitted to the size of the digits, the smallest notated width was approximately 50 pixels and the largest was approximately 71 pixels. Since the anchors were applied after the resizing of the images, the anchors sizes were also modified with the same scale as the image (50 × 0.9 = 45 and 71 × 0.9 ≈ 64). The size ’128’ was added in case there were some abnormal digits that are better fitted in a bigger anchor, but since the anchor is just an object proposal, the box regression model could still predict a smaller bounding box by using the bigger anchor. When multiplying the scales [[1, 2], [1, 1.5], [1, 1]] with the proposed anchor sizes, boxes that were close to the size of the annotated boxes in the data were received and could be used for training.

3.3.2

Feature extraction method

The feature extraction was applied on the bounding boxes received from Faster R-CNN. It was only applied on the bounding boxes received from the network that had trained on a single class ’digit’. The input images were first preprocessed, the vibrations were removed and an edge detection algorithm with a kernel size of 5 × 5 was applied. This resulted in a binary image, where the perforations in the metal sheet resulted in white dots. A morphological dilation with a kernel size of 5 × 5 was then applied to connect the dots in each digit, then followed by a mor-phological opening to remove noise. Figure 2.1 shows how the edge image was affected by morphology. Figure 2.1a is without dilation and opening, figure 2.1b shows when only opening is applied, figure 2.1c was processed with dilation only and figure 2.1d is the result of both dilation and opening. The image with no ap-plied morphology has visible digits, but the training samples contain much noise and the image with both dilation and opening results in samples that were more similar to the training data in the articles [16–19]. Therefore, the morphological dilation was followed by morphological opening applied on the binary images. Figure 3.3 describes the workflow for digit recognition using the feature extrac-tion method. There were two main stages, training and recogniextrac-tion. The training data was collected by extracting blobs of a digit from the preprocessed binary image. Each digit was then placed in a corresponding folder, e.g. all images that had the label ’3’ would be put into a folder called ’3’.

(35)

Figure 3.3:Workflow for the baseline solution.

Next step was to train on the collected samples. The features of the characters were extracted by ’Normalization-Cooperated Gradient Feature’ (NCGF) together with the character normalization method ’Line Density Projection Interpolation’ (LDPI), see section 2.4.1. This resulted in eight direction planes with the size of 64 × 64 pixels. From each direction plane, 8 × 8 feature values were extracted by Gaussian blurring, see figure 3.4, which as a result gave a feature vector with the dimension of 512. The dimension was then reduced to 160 by ’Fisher Linear Discriminant Analysis’ (FLDA), see section 2.4.1. This was done to reduce the classifier complexity and improve the recognition accuracy. Now that there were features for each sample, the average vector of each class was computed. The pa-rameters of the classifier ’Modified Quadratic Discriminant Function’ (MQDF), section 2.4.1 were then calculated and saved in a ’.net’-file. Several numbers of eigenvectors were tested to see which one gave the best accuracy by 5-fold cross -validation.

The recognition step took a blob as input and the features were extracted. The trained MQDF then evaluated the features and the character that gave minimum distance was the recognition result.

3.4

Defect detection

The autoencoder was implemented in several sizes and trained on a smaller dataset in order to find the CAE architectures that seemed most promising. Two final ar-chitectures were then trained on a larger dataset, one smaller architecture that

(36)

Figure 3.4:8 × 8 feature values.

downsampled the images to 16 × 20 pixels and one larger architecture that down-sampled them to 8 × 10 pixels. The smaller architecture consisted of 11 layers of convolution and pooling in the encoder module, and 11 layers of convolution and deconvolution in the decoder module. The larger version contained an addi-tional downsampling layer in the encoder and upsampling layer in the decoder. All layers used a 3 × 3 filter kernel and the max-pooling layers used stride 2 in both directions. A Tensorflow implementation of a ReLu-function was used as activation function in all layers. To reduce the loss of semantic information, the number of features was doubled after each max-pooling layer in the encoder, re-sulting in 512 feature maps for the smaller network and 1024 feature maps for the larger. The number of feature maps was then halved again after each decon-volution layer in the decoder, resulting in an output with the same dimensions as the input. A detailed description of the smaller architecture is shown in figure 3.5 below.

Before the training phase was initiated, anomaly free data was collected by sort-ing out data with large amount of edges, where the edges were found by apply-ing an edge detector on the height map. The resultapply-ing dataset, containapply-ing only height values, was further divided into 16 patches, padded to fit an image size of 512 × 600 pixels and normalized between 0 and 1. The reason for only using depth values was that anomalies only count as defects if they appear as dimples in the metal, and by only using depth values the detection algorithm will not be sensitive to other types of anomalies, such as discolorations. The patches were then stored as numpy-files and finally converted into TFrecords. A TFrecord, or TensorFlow record, is a format for storing a sequence of binary records. This re-sulted in a training dataset of 107424 images, where 10% was used for validation. For evaluation, a separate dataset was used containing 11083 images without de-fects and 205 images with dede-fects. The dede-fects where mainly gathered by adding metal sheets with known defects on the production line, which makes the real defect/non-defect relation unknown.

In the training phase, two loss functions were compared, `2-loss and `1-loss.

(37)

Figure 3.5:Illustration of output feature maps at each CAE layer. The output thickness at each layer represent the number of feature maps, and the width and height their dimensions.

was used. The learning rate was initialized to 0.0001 with a stair case decay of 0.9 every 30000 step. No rotation was applied since it was important that the au-toencoder learned how to reconstruct the vibrations, and they always appeared in a specific direction. However, since the vibrations always propagated along the image y-axis, flipping the image around the y-axis would not affect the recon-struction, and could thus be used to duplicate the size of the dataset.

3.5

Classification of anomalies

The initial approach was a straight forward pixel-wise thresholding on the re-construction error image, which is the difference between the input image and the reconstructed image. Since the output image was not always perfectly re-constructed, even for the areas without defects, the first step was to filter the error-image with an averaging filter of size n × n to remove high frequent noise. The pixel-wise threshold was then received by multiplying the mean value of the error-image with a factor t. All pixels with values below that threshold were then set to zero and all above was set to 1, resulting in a binary image where all pixels are classified as defects (1) or non-defects (0). To classify the image, the filtered and thresholded image was summed over all pixels, and if the sum was above a final threshold, f , it was classified as defect. The final threshold was applied to sort out anomalies that were too small to count as defects. The CAE:s was then evaluated using a grid search on all combinations of t = 3, 4, ..., 15, n = 3, 5, ..., 13

(38)

and f = 3, 6, ..., 99 on the test dataset. The set of parameters with the highest

F2-score was selected for each CAE separately, using both `1- and `2-loss.

The initial approach required a very accurate reconstruction to work properly. The second approach was thus implemented to allow the CAE to do some mis-takes. The idea was not only to study the reconstruction error, but also the latent space representation, and train an estimation network that for each sample esti-mates the probability of belonging to a certain mixture of Gaussian distributions. This allows more complex sample distributions and increases the dimensional-ity of the classification problem. Additionally, instead of only representing the reconstruction error with one feature (as in the first approach), both relative Eu-clidean distance and cosine similarity were used. Figure 3.6 below shows the sample distribution of the two error measurements and the principal component of the latent space representation.

Figure 3.6:Sample distribution of the reconstruction error and and the prin-cipal component of the latent space representation, where the blue dots are samples without defects and the red dots samples with defects

As described in section 2.5.4, a small Gaussian component estimation network was implemented. The network was composed of 5 dense layers with an output of 10, 64, 32, 16 and k features respectively and a ReLU activation function after each layer, except from the last layer that was followed by a softmax activation function. The network took a n + 2 dimensional vector as input, containing the n-dimensional latent space representation, the relative Euclidean distance and the cosine similarity. The latent space representation was reduced to the num-ber of feature maps in the latent space by calculating the mean value for every feature map. This reduced the size of the feature vector from 328194 to 514, which was done to reduce the complexity of the GMM estimation. However, a 514 dimensional feature was still very large, which is why a CAE with a 8x10x64 dimensional latent space was tested as well. A final attempt to reduce the dimen-sionality was by applying principal component analysis (PCA) and then only use

(39)

the primary component as feature. This reduced the feature vector without com-promising the size of the latent space for the CAE, since the reduction of the latent space representation was done after the compression step. The estimation network was then optimized by finding the GMM parameters that minimizes the sample energy. The GMM parameters were calculated using all samples in the batch, while the sample energy was calculated for each sample. The loss function was implemented as equation (2.24) in section 2.5.4. The GMM implementation resulted in a large set of new hyperparameters that needed tuning. Figure 3.6 do however show that samples without defects, blue dots, forms 3 clear clusters, which is why the estimation network only was tested using 2, 3, 4 and 5 Gaussian mixture components. The other parameters were initialized based on the results in [32].

Since the Convolutional Autoencoding Gaussian mixture model (CAEGMM) de-scribed above was limited to estimate the GMM using only 50 samples, a final approach was evaluated where the GMM and CAE were separated. This was done by storing the feature values for each sample in an array, where the features were extracted using the PCA approach mentioned above. The array was then fed to a separate GMM implementation made by scikit-learn [24]. The difference from the previous approach was that the GMM was estimated using all samples, and that the CAE and GMM did not share loss-function and were not optimized as an end-to-end solution.

(40)
(41)

4

Results

The results of the thesis are described in this chapter. The outcome for ID-number detection and classification is presented first in section 4.1 and section 4.2 de-scribes the result for defect detection.

4.1

OCR

Figure 4.1a shows a 3D representation of one metal sheet, this is a portrayal of both intensity and range data. The vibrations that the sheet was experiencing when transported through a production line can be seen as dark and light stripes. Figure 4.1b shows the intensity data of the same sheet. Left column is the original data and the right one has been processed with vibration removal. As can be seen, there is no difference between the two and therefore it does not matter which one is used in training. Figure 4.1c shows that the vibration removal process does indeed improve the data that is used in feature extraction.

4.1.1

Faster R-CNN

The final average precision (AP), see section 2.6.4, for the network that predicts class ’digit’ was 94.69%, the accuracy was 94.35%, 317 true and 7 false positives were found and 12 digits were not found at all. Since only one class was evalu-ated in this case, the mean average precision (mAP) is the same as AP.

Table 4.1 shows the result for Faster R-CNN using classes 0 to 9. The network did in two cases misclassify that there were two ’1’ next to each other when there only was one, see image 4.2. The false positive ’3’ and ’5’ were also a result of a double detection of the same digit. The false positive ’6’´s were actually the letter ’S’ in the sheet that were very similar to the digit ’6’, see figure 4.3. But

(42)

(a) 3D representation of a metal sheet containing both range and intensity data.

(b) Intensity representation of metal sheet and the result of vibration removal.

(c) Metal sheets preprocessed with edge detection filter.

Figure 4.1:The result of applying morphology on an intensity and a binary edge detected image.

(43)

Figure 4.2:Digit ’1’ is in some cases predicted as two numbers. This can be seen as two yellow rectangles around the digit.

Figure 4.3: The letter ’S’ is present in a couple of evaluation images and is in two cases predicted to be ’6’.

Average Precision Accuracy TP FP FN

0 97.67% 97.67% 42 0 1 1 96.23% 96.97% 32 2 1 2 97.37% 97.37% 37 0 1 3 99.96% 97.96% 48 1 0 4 97.73% 97.73% 43 0 1 5 97.22% 94.59% 35 1 1 6 99.67% 89.47% 17 2 0 7 100.00% 100.00% 17 0 0 8 100.00% 100.00% 32 0 0 9 100.00% 100.00% 25 0 0 Tot 98.59% 97.62% 328 6 5

Table 4.1:The AP for each class for the final Faster R-CNN.

except from this, there were no other misclassified digits, see confusion matrix in table 4.2. Five digits were falsly negative, in all cases except for one, those digits were the ones that was closest to the edge of the image.

Figure 4.4 shows some result images for both networks, these are zoomed in on areas with digits. The third row shows images with digits that are split by the image border and therefore the box with class prediction is not visible. Figure 4.5 and 4.6 shows how the training loss for the different networks was changed over epochs.

(44)

Figure 4.4: Cutouts from three evaluation images. The left column shows the result from Faster R-CNN training on only class ’digit’ and right column shows the bounding boxes from Faster R-CNN training on classes ’0’ to ’9’. Each bounding box has a square that shows the predicted class and the prob-ability of being that class. The contrast and brightness are altered to simplify the interpretation of the result.

Figure 4.5: The total loss of Faster R-CNN training on class ’digit’.

Figure 4.6:The total training loss of Faster R-CNN training on classes ’0’ to ’9’.

(45)

Actual/Predicted 0 1 2 3 4 5 6 7 8 9 S 0 42 0 0 0 0 0 0 0 0 0 0 1 0 32 0 0 0 0 0 0 0 0 0 2 0 0 37 0 0 0 0 0 0 0 0 3 0 0 0 49 0 0 0 0 0 0 0 4 0 0 0 0 43 0 0 0 0 0 0 5 0 0 0 0 0 36 0 0 0 0 0 6 0 0 0 0 0 0 17 0 0 0 2 7 0 0 0 0 0 0 0 17 0 0 0 8 0 0 0 0 0 0 0 0 32 0 0 9 0 0 0 0 0 0 0 0 0 25 0

Table 4.2: Confusion matrix for the predictions of Faster R-CNN, the columns represents the predicted class and each row represent actual class.

4.1.2

Feature extraction method

Table 4.3 shows the accuracy for some classifiers using different numbers of eigen-values. More eigenvalues than shown in the table were tested, the accuracy started to decline at 14 eigenvalues because of the estimation error of the co-variance matrix. 6 to 13 eigenvalues gave the same accuracy of 99.6%, therefore was the mean value of 10 ((6 + 13)/2) eigenvalues chosen for MQDF, section 2.4.1. Afterwards, two separate feature extraction models were trained, one with digits from 5 images and one from 15 images and evaluated on figure 4.7. Both net-works could classify the sheet correctly and therefore it would be enough to use training samples from 5 images, but to also include samples from noisy sheets, the model that was trained on 15 images was chosen to be used further on. This resulted in 20-30 samples per class, these images did both contain noisy and noise free characters.

Nr of Eigenvalues 1 5 10 15 40

Accuracy 98.7% 99.1% 99.6% 99.1% 2.56%

(46)

Figure 4.7:A preprocessed metal sheet.

Table 4.4 shows the AP for each class (0 to 9), mAP and number of true positives, false positives and false negatives for when NCGF was applied on the bound-ing boxes from Faster R-CNN. The confusion matrix in table 4.5 shows correctly and falsely classified digits when using NCGF. The confusion matrix was not af-fected by the precision and recall from Faster R-CNN, that is, no false negative or positive bounding boxes were evaluated, while the results in table 4.4 was. The column of false negatives in table 4.4 includes both digits that have been missed by Faster R-CNN and incorrectly classified digits.

Average Precision Accuracy TP FP FN

0 67.63% 66.67% 30 2 13 1 55.90% 59.18% 29 17 3 2 62.55% 60.00% 24 2 14 3 65.40% 67.86% 38 8 10 4 67.31% 64.52% 40 18 4 5 90.38% 86.84% 33 2 3 6 35.03% 38.10% 8 3 10 7 94.12% 88.89% 16 1 1 8 60.07% 64.29% 27 10 5 9 61.88% 61.54% 16 1 9 Tot: 66.93% 65.74% 261 64 72

Table 4.4: The AP for each class when applying NCGF on the predicted bounding boxes.

(47)

Actual/Predicted 0 1 2 3 4 5 6 7 8 9 0 30 1 1 1 8 0 1 0 0 0 1 0 29 0 0 0 0 0 0 0 0 2 0 5 24 6 0 0 0 1 1 0 3 0 4 0 38 2 0 0 0 1 0 4 0 2 0 0 40 0 0 0 0 0 5 0 1 0 0 0 33 0 0 0 0 6 0 0 0 0 0 1 8 0 7 0 7 0 1 0 0 0 0 0 16 0 0 8 1 2 0 0 1 0 1 0 27 0 9 1 1 0 0 5 0 1 0 0 16

Table 4.5: Confusion matrix for the predictions of NCGF, the columns rep-resents the predicted class and each row represent actual class.

As predicted in section 3.1, the sheets containing an alloy did complicate the read-ing with NCGF since the edge detection filter found many edges and resulted in noisy images, see figure 4.8. It can also be seen that two ’A’´s and one ’S’ had been falsely detected in the second row, this contributed to false positives in the result. NCGF classified the ’A’´s as ’4’ and ’S’´s as ’8’ which is comprehensible since those letters are similar to the predicted digits.

Figure 4.9 shows the bounding boxes applied on a preprocessed metal sheet that originally had less visible dimples. As can be seen to the right in the picture, Faster R-CNN predicted that there were digits there, but since no edges were found, no real digits are visible. In these cases, NCGF often predicted that it was digit ’1’ since the features of a single dot are more similar to the features of that digit in comparison to any other.

NCGF performed well on images such as figure 4.7 since no noise was present and the digits were clearly visible. As can be seen in table 4.4, the digit ’1’ is often false positive, which as described before depends on that NCGF often classified strange looking digits as ’1’. The three false negative samples are digits that were missed by Faster R-CNN rather than a miss by NCGF. The reason why class ’6’ has more false negatives than positives are that most of the sixes were misclassified as ’8’. Faster R-CNN did also misclassify several letters ’S’ to be a digit, but since these were almost only present in metal sheets containing an alloy, these were not classified as ’6’ by NCGF.

4.2

Defect detection

This section presents the results from the convolutional autoencoder (CAE) and the convolutional autoencoder with GMM (CAEGMM), where the loss functions

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Exakt hur dessa verksamheter har uppstått studeras inte i detalj, men nyetableringar kan exempelvis vara ett resultat av avknoppningar från större företag inklusive

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

function angle_index=index_angles(raw_edges, width, height, threshold).

Comparing with the base model results in table 4.1 and the no image model results in table 4.2 we can see that the image features and the type augmentation feature synergize in or-