Detection of Non-Ferrous Materials with Computer Vision

(1)

Master of Science Thesis in Computer Science

Department of Electrical Engineering, Linköping University, 2020

Detection of Non-Ferrous

Materials with Computer

Vision

(2)

Fredrik Almin LiTH-ISY-EX--20/5321--SE

Supervisor: Karl Holmquist

isy_{, Linköpings universitet}

Fredrik Noring

Combitech AB

Examiner: Maria Magnusson

isy_{, Linköpings universitet}

Computer Vision Laboratory Department of Electrical Engineering

(3)

Abstract

In one of the facilities at the Stena Recycling plant in Halmstad, Sweden, about 300 tonnes of metallic waste is processed each day with the aim of sorting out all non-ferrous material. At the end of this process, non-ferrous materials are manually sorted out from the ferrous materials. This thesis investigates a com-puter vision based approach to identify and localize the non-ferrous materials and eventually automate the sorting.

Images were captured of ferrous and non-ferrous materials. The images are processed and segmented to be used as annotation data for a deep convolutional neural segmentation network. Network models have been trained on different kinds and amounts of data. The resulting models are evaluated and tested in ac-cordance with different evaluation metrics. Methods of creating advanced train-ing data by mergtrain-ing imagtrain-ing information were tested. Experiments with ustrain-ing classifier prediction confidence to identify objects of unknown classes were per-formed.

This thesis shows that it is possible to discern ferrous from non-ferrous mate-rial with a purely vision based system. The thesis also shows that it is possible to automatically create annotated training data. It becomes evident that it is possi-ble to create better training data, tailored for the task at hand, by merging image data. A segmentation network trained on more than two classes yields lower prediction confidence for objects unknown to the classifier.

Substituting manual sorting with a purely vision based system seems like a viable approach. Before a substitution is considered, the automatic system needs to be evaluated in comparison to the manual sorting.

(4)

(5)

Abstract v

Acknowledgements

I want to extend much gratitude toward my supervisors, Karl Holmquist and Fredrik Noring. The guidance provided by your knowledge and input have led me to avoid many pitfalls and instead focus my efforts on more fruit-bearing projects. I would also like to extend sincere thanks the examiner of this thesis project, Maria Magnusson. The significant commitment you made to helping finish this thesis was very appreciated. I want to direct another thank you toward the machine vision team at Combitech for their support.

I would also like to thank Malin Lindberg for letting me perform the thesis work at Combitech.

(6)

Notation ix

1 Introduction 1

1.1 Background . . . 1

1.2 Purpose and aim . . . 2

1.3 Problem formulation . . . 2 1.4 Limitations . . . 3 1.5 Context . . . 3 2 Theory 7 2.1 Machine learning . . . 7 2.1.1 Neural networks . . . 8 2.1.2 Deep learning . . . 12 2.2 Computer vision . . . 14 2.2.1 Background modelling . . . 14 2.2.2 Semantic segmentation . . . 15 2.3 Model evaluation . . . 16 3 Method 19 3.1 Data collection . . . 19 3.2 Image preprocessing . . . 21

3.2.1 White balance correction . . . 21

3.2.2 Background modelling . . . 21 3.2.3 Noise removal . . . 22 3.2.4 Downsampling . . . 23 3.3 Merged dataset . . . 23 3.4 Data split . . . 25 3.5 Network implementations . . . 25 3.6 Network evaluation . . . 26

3.7 Network certainty scores . . . 27

3.8 Manual annotation . . . 28

4 Results 29

(7)

Contents vii

4.1 Automatic annotation . . . 29

4.2 Neural Network evaluation . . . 31

4.2.1 Model 1 . . . 32 4.2.2 Model 2 . . . 34 4.2.3 Model 3 . . . 36 4.2.4 Model 4 . . . 38 4.2.5 Model 5 . . . 40 4.2.6 Test results . . . 42 4.3 Confidence maps . . . 43 4.3.1 Binary classifier . . . 43 4.3.2 Multi-class classifier . . . 44 4.4 Addition of classes . . . 44

4.5 Testing a model on real data . . . 46

5 Discussion 49 5.1 Manual and automatic annotation . . . 49

5.2 Three-class neural networks . . . 50

5.2.1 Data split . . . 51 5.3 Confidence maps . . . 51 5.3.1 Binary classifier . . . 51 5.3.2 Multi-class classifier . . . 51 5.4 Additional classes . . . 52 5.5 Quality assurance . . . 52 5.6 Evaluation . . . 52 5.7 Method . . . 53 5.8 Source criticism . . . 53 5.9 Concluding discussion . . . 54 6 Conclusions 55 6.1 Research questions . . . 55 6.2 Future work . . . 57

6.2.1 Comparison to manual sorting . . . 57

6.2.2 Evaluation . . . 58

6.2.3 Uncommon objects . . . 58

6.2.4 Network architectures . . . 58

6.2.5 Data collection . . . 58

(8)

(9)

Notation

Abbreviations

Abbreviation Meaning

cnn _{Convolutional neural network}

fn _{False negative}

fp _{False positive}

gpu Graphics Processing Unit

hrnet High resolution Network

iou Intersection over Union

mog Mixture of Gaussians

nllloss Negative Log Likelihood loss

pcb Printed Circuit Board

relu Rectified Linear Unit

rgb Red Green Blue

sgd Stochastic Gradient Descent

tn True negative

tp _{True positive}

(10)

(11)

1

Introduction

The use of deepconvolutional neural networks (CNNs) have in the latest decade

proven to provide results that exceed any previously used methods in image clas-sification problems, as seen and made popular by Krizhevsky et al. [14]. Deep CNNs have since been used for all sorts of image-related classification and de-tection purposes [15]. One such problem is semantic segmentation, the task of segmenting an image into different parts and then assigning a class to each seg-mented image part [23]. When such systems reach a certain accuracy, a plethora of possibilties for automation is created. An example of such a possibility is the automation of the, currently manual, sorting of trash in recycling facilities. If cameras and image data can be used to detect which objects need to go where, recycling can be both streamlined and automated.

This thesis was conducted at Combitech AB as part ofThe Circular Initiative,

an initiative from Stena Recycling. It aims to use computer vision and machine learning in order to classify which objects in a stream of materials are made of ferrous material and which are not.

1.1 Background

Machine learning is the school of making algorithms improve themselves by train-ing on data and optimiztrain-ing a function to perform a certain task. With sufficient data to train on, the algorithm will hopefully reach performance such that it can be generalized to work on data that was not trained on. Machine learning is use-ful to process large amounts of data, and if such amounts of data exist it can be used to substitute advanced algorithms and often provide a more robust result-ing function. The use of machine learnresult-ing applications has spiked in the recent years, mainly due to advances in computational power and the efficiency of deep learning in areas such as computer vision [8]. The computational capacity is only

(12)

a limiting factor during training. The machine learning models can, once trained, be run with much less computational demands.

Computer vision is the field of extracting information from images as well as creating models of the real world from images [11]. There are many ways in which to apply deep learning for computer vision problems, such as image de-tection and classification. If certain objects are to be detected such as cars, bikes, cats, dogs and people - all these cars, bikes, cats, dogs and people need to be labelled (annotated) in their respective images. This annotation can be done in different ways: Boxes can be drawn around the objects and used as data [18]. Another way is semantic segmentation, where a class is assigned to each pixel [23]. If a machine learning algorithm is to learn a specific task, data to train on is required. The amount of data is preferably rather large, or the algorithm may not be able to detect the underlying patterns such that it can generalize to previ-ously unseen data. While images usually are easy to come by, annotated images are not [25]. Manual annotation is labor-heavy and expensive which creates a demand for automated annotation based on image content [12]. If training data can be generated automatically from a set of source images, cheap possibilities are created for machine learning algorithms to be implemented.

If data is sparse or homogeneous a machine learning algorithm may be over-trained on the small amount of data and not able to generalize to data that does not compare with the homogenity of the data that was trained on. Data aug-mentation, the school of modifying original data, can be used to create a larger variance in the training data [20]. If, for example, images are rotated or have their brightness level changed, it can create invariance in the model toward rotations and lighting changes.

1.2 Purpose and aim

The work in this thesis focuses on automatic detection of non-ferrous materials that are mixed with ferrous materials on an assembly line. The non-ferrous ma-terials are currently sorted out manually. The basis of this thesis is the idea of using cameras, computer vision and machine learning to localize the non-ferrous materials in order to eventually automate the sorting process.

The aim is to deliver a proof of concept as to whether or not it is possible for a deep learning based system to complete the task described above with reasonable performance, as well as deliver insights regarding problems and implementation strategies in a real world scenario at the actual recycling plant. Yet another aim is to investigate whether or not the process of annotating image data can be suffi-ciently automated.

1.3 Problem formulation

(13)

1.4 Limitations 3

1. Can a neural network be trained to separate non-ferrous materials from ferrous materials in images?

2. How well can the process of creating annotation images for a segmentation network be automated?

3. How much data and training, and what kind of data, is needed to produce a neural network that can perform image classification of ferrous and non-ferrous material?

4. How much can augmented data improve the performance of a segmentation network?

5. How to handle object types that do not belong to a class recognized by the classifier?

1.4 Limitations

The thesis work was conducted with the following limitations: • Only iron and iron-like objects were used as ferrous material

• Only copper, copper-like objects and cables were used as non-ferrous mate-rial

• Training images were captured of only one type of object at a time

1.5 Context

This section covers information about Stena and the process which the thesis aims to help automate in order to place the work in its context. The images and information presented in this section were taken from a study visit where the author and other representatives from Combitech went to the Stena Recycling plant in Halmstad, Sweden.

Stena Recycling is a market leader in collecting, refining and recycling all kinds of scrap. About 300 tonnes of metallic waste goes through the recycling plant each day, where most of it comes from industrial waste. Producing new raw material out of scrap gives large environmental benefits. Recycling one tonne of iron saves one tonne of carbon dioxide emissions.

The process of sorting out the non-ferrous materials from the stream of fer-rous material consists of several steps. Most of the steps are automatic meth-ods, such as magnetic sorting. All non-ferrous materials are not automatically sortable, however. Therefore, four employees are stationed in a hut to manually sort out the remainder of the non-ferrous materials as the last step in the sorting process. Images of the hut are presented in figure 1.1.

(14)

(a)The sorting hut. (b)Inside the sorting hut.

Figure 1.1:Images of the hut in which the manual sorting is done.

The non-ferrous materials that make their to this last step may for example be pieces of cables or printed circuit boards (PCB’s), but mostly it is copper wire. Examples of the in- and out-flow are presented in figure 1.2.

(a)Scrap on assembly line before sorting.

(b)Scrap on assembly line after sorting.

Figure 1.2: Images of the assembly line going in to and out of the sorting

(15)

1.5 Context 5

Once non-ferrous materials have been sorted out, the ferrous materials are melted into an alloy. If pieces of copper make their way through, the properties of the resulting alloy may be severely impaired. Good recycled materials pro-duced from properly sorted scrap can be sold as high quality raw materials to for example steel mills, smelters and foundries.

Ideally only ferrous materials are left on the assembly line after manual sort-ing. These ferrous materials are then dropped in an assembly area, as seen in figure 1.3.

Figure 1.3:The author collecting garbage.

The idea behind this thesis is using camera information to localize the non-ferrous material in order to automate the sorting. If exact locations of unwanted materials are known, these materials could be removed by mechanical means such as a robotic arm or broom.

(16)

(17)

2

Theory

The following chapter aims to provide the needed information behind the work in this thesis. The basic ideas of machine learning and deep learning are pre-sented as well as a few image processing techniques in order to give the reader an understanding of these research fields.

2.1 Machine learning

An algorithm is a sequence of instructions to be carried out to transform an in-put to an outin-put [3]. For certain complicated tasks, defining this sequence of instructions to reach a desired output can be difficult. For certain other tasks there simply is no algorithm that can consistently provide good results, as it is bound by rigid instructions. In machine learning, the aim is to substitute these sequences of instructions with a general model with many parameters [3]. This model is not designed to perform just one task, but can perform many tasks de-pending on how its parameters are set. The ”learning” part represents how the values of these parameters are adjusted in order for the model to ”learn” how to perform the designated task [3]. The parameters in this case correspond to the instructions that are used in a regular algorithm. Learning is achieved by letting the model train on data, produce an output and compare that output to what the output is supposed to be. With the information of whether or not the output was correct, the model parameters can be adjusted in order to produce a more accu-rate result in the future. With sufficient data and training, the originally general model eventually becomes specialized to produce desired outputs in accordance with the training data. This model can then be used to substitute an algorithm for the task at hand [3]. In short, machine learning can be described as a set of methods to to detect patterns in data and use these patterns to make decisions under uncertainty. Unlike a regular algorithm that gives an output depending

(18)

on how the input went through predefined instructions, a machine learning al-gorithm will give an idea of what the output of an alal-gorithm is most likely to be based on patterns in the data [16]. In even shorter terms, machine learning uses applied statistics and probability theory to substitute algorithms. Machine learn-ing is widely used with huge application in for example computer vision areas such as face and handwriting recognition [16].

The most widely used form of machine learning is what is known as

super-vised learning, which is also the form that was described above [16]. This means

to learn a mapping from input x to output z by comparing with already defined

input-output pairs (which is the ”supervised” part, as opposed to unsupervised

learning where there are no correct answers to compare to). The input is made

up of a set of features that represent what the actual data represents, and the output contains some sort of label or labels which the input is supposed to be transformed into [16]. If the goal of the model is to perform some sort of classifi-cation, for example ”does this picture portray a dog or a cat?”, the inputs are the features of the picture and the output is either the label ”dog” or the label ”cat”.

2.1.1 Neural networks

Neural networks are machine learning models that were originally inspired by the way that neurons are connected in a human brain [9]. The artificial neurons in an artificial neural network are structured in layers. All of the artificial neu-rons in one layer are connected with all of the artificial neuneu-rons in the subsequent layer by a set of connecting arcs, as seen in figure 2.1 [9]. The figure portrays a simple architecture of a fully connected feedforward neural network with one in-put layer, one hidden layer and one outin-put layer. The inin-put features are numbers

represented by x1through xnand the output labels are represented by z1through

zm.

(19)

2.1 Machine learning 9

Each neuron in a fully connected feedforward neural network has connections to all neurons in the previous and the subsequent layer. These connections are

also represented with numbers referred to asweights. The input to a neuron is

the sum of all values calculated in earlier neurons, or the original input, each

multiplied with the weight of the connection and finally an addedbias, see figure

2.2. The figure illustrates what happens for every single neuron. After the

sum-mation has been done, anactivation function is applied to the resulting sum, as

described below.

Figure 2.2:Illustration of a neuron in a neural network.

The mathematics of the neuron can be expressed as

y = ϕ(b +

n

X

i=1

(wixi)), (2.1)

where y represents the output of the neuron, b denotes the bias, ϕ denotes the

activation function and wixi denotes the input multiplied by the weight from

incoming connection i.

It is important to note that not all neural networks follow this exact structure. There are other types of networks that include other techniques such as feedback connections and skip connections or networks where all neurons of neighboring layers are not connected [8].

Activation function

The activation function of a neuron in a neural network transforms the output of the summation node in order to clamp values and introduce non-linearity to the model. The most commonly used activation function in deep models is the

Rectified Linear Unit (ReLU) activation function. [17]. The ReLU activation

(20)

f (y) =        y if y ≥ 0, 0 if y < 0. (2.2)

i.e. positive values of y are kept, whereas negative values are set to zero. Other

common activation functions include for example the sigmoid function and the

hyperbolic tangent function [13].

The ReLU function comes in different versions as well. ReLU may suffer from the fact that its gradient is zero for negative values. This can be solved by using what is known as the Leaky ReLU function, which replaces the second row in (2.2) with 0.001y. This introduces a small positive gradient for values where y < 0 [13].

Another variant is theExponential Linear Unit (ELU) which is defined as

f (y) =        y if y ≥ 0, a(ey−₁₎ _{if y < 0,} (2.3)

where a is a hyperparameter with the constraint a ≥ 0. This function moves mean activations toward zero. [6].

Loss function and optimization

In order for the neural network to learn, its weights must be updated in accor-dance with its performance. This is done by an optimizer that minimizes an

error measure. This error measure is calculated by what is referred to as aloss

function. The loss function aims to represent how good the network prediction

is. For supervised training, this loss represents how well the prediction compares

to the ground truth data. In this thesis, a function referred to as Negative Log

Likelihood loss (NLLloss) was used. NLLloss is a form of cross-entropy loss and is

represented as l(x, y) = −log( e x[y] P jex[j] ), (2.4)

where y denotes the correct target class, x[j] denotes the probability of the pre-diction x belonging to class j [1].

The function inside the logarithm in (2.4) is known as a Softmax function.

Applying softmax for each class results in a probability distribution representing the conditional probabilities for the input belonging to each class [1].

When the loss has been specified, it needs to be minimized. This is the

”learn-ing” part and is accomplished bybackpropagation and an optimization algorithm

[8]. Backpropagation is the process of calculating the derivatives of the losses in order to find a minimal value and update the weights accordingly. When a min-imal loss has been found, optmin-imality has hopefully been reached. It is the task of the optimization function to find that minimal value. The most common

opti-mization algorithm in machine learning isStochastic Gradient Descent (SGD) [8].

SGD aims to use data samples from a data set and iteratively searches for a min-ima in the loss function by calculating their gradients. With known gradients the

(21)

algorithm can iteratively look for a minima by moving in the steepest direction. The process of iteratively updating the weight is mathematically described as

wk+1 = wk−η

∂l

∂wk

, (2.5)

where wkdenotes the weight for iteration k, _∂w∂l denotes the derivative of the loss

with respect to a weight and η represents the learning rate. The learning rate is a hyperparameter that is used to control how much effect one iteration may have on the weight update.

Regular gradient descent uses all available data to calculate the gradients. SGD updates the weight vector for a random or new data sample, or several data

samples in what is referred to as aminibatch. Using a random subset of

sam-ples instead of all data samsam-ples drastically increases the speed of the algorithm, making SGD much cheaper to employ than regular gradient descent.

Regularization

If training data is homogeneous (such as structurally similar images) there is a risk that a neural network may not be able to generalize to data it was not trained

on. This is known asoverfitting and is usually spotted by a model performing

much better on training data than on test data. There are techniques to prevent

overfitting such asdata augmentation and batch normalization [8]. Applying such

methods is known asregularization.

Data augmentation is the school of modifying the original training data [20]. If a neural network was to be trained on 1000 training images over 10000 iter-ations, each image would be used 10 times to train the network. If the images are instead modified before being fed to the network, the risk of training the network on identical images is decreased. Normal techniques for augmenting data include rotation by random angle, cropping a random part of the image and flipping the images either horizontally or vertically [20]. The ambition is to, for example, create invariance toward objects being rotated instead of overtraining a model to only recognize objects that are placed in a certain angle.

When feeding data to a neural network, it is common to feed data in groups

instead of using one data sample at a time. These groups are referred to asbatches

[8]. Batch normalization normalizes the output of an activation layer by subtract-ing the mean of the batch and dividsubtract-ing by the batch standard deviation [8]. This is mathematically described as

H0 = H − µ

σ , (2.6)

where H0

is the normalized batch, H denotes the batch data, µ denotes the mean of the batch and σ denotes the standard deviation. Batch normalization was in-troduced in 2015 by Ioffe and Szegedy [10] as a way to accelerate training of deep neural networks. Originally it was assumed batch normalization mitigated

problems withinternal coviariate shift, while others have argued for example that

(22)

the workings are under debate, the actual effect of batch normalization is evident and empirically proven; it does accelerate training and prevents overfitting.

2.1.2 Deep learning

When stacking large amounts of hidden layers in a neural network the network

gains ”depth” and machine learning with such networks is referred to as deep

learning [8]. The idea of deep learning has been around for almost 50 years but

has been reignited during later years with the evolution of big data and computa-tional capacity. [8].

Convolutional layers and pooling

Deep neural networks commonly includeconvolutional layers, which are in turn

split into two parts. The first part is the actual convolution step. Convolutional layers substitute large matrix multiplications from fully connected layers with local convolutions which has proven to provide good results for, among other types of data, images [8]. This is due to spatial invariance provided by convo-lutional layers. Convolving two dimensional data is performed by defining a convolutional matrix kernel, rotating it 180 degrees, and computing an element-wise multiplication for each data point where the kernel can be applied. This

operation without rotating the kernel is what is known ascorrelation, which is

actually what is used in most convolutional layers. The convolutional rotation is irrelevant for learned coefficients, and the process is most often referred to as a convolution anyway [8].

The output of the convolution stage is then subject to an activation function such as the ReLU function in order to introduce nonlinearity. This is called the detector stage [8].

It is common to apply apooling function after a convolutional layer. Pooling

considers a local region and applies a function to extract a value from the

val-ues in this region. Common examples include taking the maximum value (

Max-Pooling) or the mean value (AvgMax-Pooling) from the local region, see figure 2.3. If

pooling is applied to regions such as the one in figure 2.3 and subsequent regions without overlapping earlier regions, the data size would be reduced by a factor of two per spatial dimension. Applying a function to every other data point like this

is referred to as applying the function with astride of two. Pooling is commonly

used to reduce data size while preserving the important statistics in the data [8]. The convolutional layers come with some advantages such as the fact that they employ more sparse interaction than fully connected layers [8]. The convo-lutional kernel is generally smaller than the input resulting in local areas being used to feed forward in the network, instead of for example an entire image. The same idea applies to the pooling function. Another advantage is that convolu-tional layers share weights, reducing the size of the model [8]. The shared weights and sparsity of interactions result in fewer parameters and operations being used and performed which in turn makes the method cheaper to employ.

(23)

Figure 2.3: Illustration of MaxPooling and AvgPooling applied to a local

region.

Convolutional neural networks

A deep neural network with one or more convolutional layers is referred to as a

Convolutional Neural Network (CNN) [8]. A simple architecture of a CNN is

illus-trated in figure 2.4 which is supposed to perform digit recognition. The input features are the image pixels and the output is a probability distribution repre-senting probabilities for input image displaying numbers zero through nine.

The model includes two convolutional layers with ReLU activation functions and MaxPooling, as described in section 2.1.2. These layers apply different filters to the images and transforms the data into a higher dimensional feature space. For an image classification network where the input is an image, the feature ex-traction usually results in many significantly smaller images of only a few pixels. These new features are then fed to the next part of the network, which consists of two fully connected layers as explained in section 2.1.1. The output of the fully connected layers are finally fed to a softmax layer as described in section 2.1.1.

(24)

Figure 2.4: Illustration of a simple CNN for handwriting recognition. The input images are 28x28 grayscale images of handwritten numbers. The out-put is a vector of 10 elements denoting the conditional probabilities of the input representing the numbers zero through nine. Image source: [7].

2.2 Computer vision

The goal of acomputer vision (often also referred to as machine vision) system is

to recover useful information about a scene from its two-dimensional projections [11]. Images are most often two-dimensional projections of a three-dimensional world. In other words, machine vision is the school of extracting desired informa-tion from image data.

2.2.1 Background modelling

One area of computer vision is to extract what is background and what is not

(denoted as foreground). One way of doing this is by applying a probabilistic

model to distinguish background pixels from foreground pixels in a steady scene

[22]. The probabilistic model used in this thesis is an adaptiveMixture of

Gaus-sians (MoG) as proposed by Stauffer and Grimson [22]. The MoG model assigns

a number of Gaussian distributions to each pixel position in an image sequence. This sequence can be iterated through and the Gaussian distribution parameters are updated so that they center around the most probable values for each pixel

(25)

2.2 Computer vision 15

position in the image sequence. Consider footage from a security camera of a pedestrian crossing a street for example. The most probable values for each pixel position in the image sequence will then represent background. If the value of a pixel suddenly deviates from its expected value, it is likely that this pixel sud-denly belongs to the pedestrian instead of the street. Pixels that exceed a certain threshold of deviation from their expected value will be considered foreground. This way, images can be segmented into background and foreground.

2.2.2 Semantic segmentation

Semantic segmentation is one of the holy grails in computer vision, as stated by Zhou et al. [26]. The task in semantic segmentation is to cluster parts of images together which belong to the same object class [23]. This type of functionality has many applications such as detecting road signs, tumors and medical instruments in operations.

Semantic segmentation networks

One way of performing semantic segmentation is by applying deep CNN’s trained to do just that [23]. Unlike image classification, as shown in the example pre-sented in figure 2.4 where the output is a label of numbers zero through nine, semantic segmentation requires a classification for each pixel in an image. To handle the fact that the output thus must be of the same size as the input,

seman-tic segmentation networks are commonly split into two parts: anencoder and a

decoder [26]. The encoder extracts features as described in section 2.1.2, resulting

in smaller images. After encoding, the images are fed through a decoder that

usesupsampling layers in order to blow up the images to their original size. The

process is illustrated in figure 2.5.

Figure 2.5:Example of a semantic segmentation network. This is not a

com-plete map of a working architecture, but a conceptual example to highlight the encoding and decoding parts of the network. Illustration inspired by figures in the thesis report by Estgren [7].

(26)

One such network has been developed by theComputer Science and Artificial Intelligence Laboratory (CSAIL) at MIT and uses, among other architectures, an

encoding architecture calledHigh resolution Network Version 2 (HrNetV2) [24,

26].

Pretrained networks

When training networks to perform tasks such as semantic segmentation, the encoder may be trained to recognize for example edges between different regions in images. This is done by amplifying the weights toward filters that highlight edges. Tasks such as edge detection, and other significant features that may be of help for semantic segmentation, may not necessarily be highly dependent on the data it is applied on. If an encoding architecture has been trained to segment an image into regions with good results, the encoder may provide promising results on other unknown data as well.

If a network is to be trained to segment images, values of the weights will have to be initialized at some value. Training will then be required to amplify the weights toward filters that find features which are good for segmenting images. It may save significant training time to initialize the weights with values from a network that has already been trained to find such features. Using pretrained weights is a common way of saving training time and many network

implemen-tations are built uponbackbone networks that are pretrained.

HrNetV2 has been trained on a dataset called theADE20K dataset with 25000

images of varying indoor and outdoor scenes which contain a total of 150 classes [26]. The images average 19.5 instances (different objects/regions) and 10.5 ob-ject classes per image [26].

A pretrained HrNetV2 encoder and a C1 decoder was used in this thesis.

For more information on these implementations and architectures, see papers by Wang et al. [24] and Zhou et al. [26].

2.3 Model evaluation

The performance of a neural network can be evaluated in several ways. In this section, a few evaluation methods are defined. It is also good to keep in mind that visual inspection of segmentation results yields useful information as well.

A common way to evaluate any machine learning classifier is to use a confu-sion matrix. A confuconfu-sion matrix for a binary classification problem, such as ”does this picture portray a dog?”, is displayed in figure 2.6a. A correctly classified dog will be sorted as a true positive (TP). An image that is predicted to portray a dog but doesn’t will be sorted as a false positive (FP). An image that is predicted to not portray a dog but does will be sorted as a false negative (FN). Images that do not portray dogs and are predicted to not portray dogs and will be sorted as a true negatives (TN) [21].

The structure of the confusion matrix is similar in the multi-class case, see figure 2.6b. However, the true and false positives and negatives must be calcu-lated for each class. Consider a classifier with the task of discerning whether or

(27)

2.3 Model evaluation 17

not an image displays a dog (class 1), fish (class 2) or bird (class 3). The true positives for the dog class will still be the images that were correctly classified as dogs. However, the false positives will be the images that were classified as dogs but actually portray either a fish or a bird. The false negatives will be images of dogs that were classified as fishes or birds. The true negatives will be all of the images of fishes and birds that were not predicted to portray dogs. The TP, FP, FN and TN measures for class one in a three-class confusion matrix is illustrated in figure 2.6c [21].

(a)Confusion matrix for binary classi-fier.

(b) Multi-class confusion matrix for three classes.

(c)TP, FP, FN and TN for class 1 in a 3-class confu-sion matrix.

Figure 2.6:Illustration of confusion matrices.

Arguably the most common metric of classification performance isaccuracy.

The accuracy is simply all of the correctly classified samples divided by all of the samples,

accuracy = T P + T N

T P + T N + FP + FN. (2.7)

In the multi-class case, the overall accuracy does not have to be calculated for every class. In figure 2.6b for example, the sum of the green diagonal divided by

(28)

the sum of all samples will yield the accuracy.

Two other commonly used evaluation metrics areprecision and recall.

Preci-sion yields the proportion of how many positive predictions are truly positive, while recall gives the proportion of how many actual positives are correctly clas-sified. Unlike accuracy, precision and recall have to be calculated individually for each class in the multi-class case. The formulae for precision and recall are [21]:

precision = T P

T P + FP, (2.8)

recall = T P

T P + FN. (2.9)

While comparing two models, there is a risk that one model may have a better precision score and the other model may have a better recall score. In order to

compare their relative performance, a single metric is useful. TheF1-score [21],

also referred to asDice-score or the Dice-coefficient, is a way to combine precision

and recall into a single number by calculating their harmonic mean. The formula for F1-score can be expressed as:

F1score = 2 ∗ precision ∗ recall

precision + recall. (2.10)

Yet another common evaluation metric when detecting objects in images is

theIoU (Intersection over Union). If objects are to be detected by drawing boxes

around them, the IoU is calculated by dividing the intersecting area of the boxes by their mutual area, as portrayed in figure 2.7. The same theory applies for semantic segmentation tasks, where each correctly classified pixel corresponds to intersectioning area between prediction and target.

Figure 2.7: Illustrational example of how to calculate intersection over

(29)

3

Method

The following chapter describes the collection of data, image preprocessing and neural network implementations as well as how the results were evaluated.

3.1 Data collection

Images were captured in a lab at Combitech using aDalsa Genie Nano camera.

Objects were placed on a rubber mat under the lighting of two industrial flood-lights. Each training image was captured in a steady scene and contains either only ferrous material or only non-ferrous material. The ferrous material only con-sisted of iron and iron-like objects and the non-ferrous material only concon-sisted of copper and copper-like objects, with the exception of a small data set of cables.

Originally images were taken of four separate objects at a time in order to ap-ply the preprocessing described in section 3.2. This preprocessing consisted of a few different steps. One of those steps was to segment the images as described in section 3.2.2. Another step was to remove noise that the segmented images con-tained as described in section 3.2.3. This was done by extracting the four largest connected components in the segmented image, as images were captured of four objects at a time, and setting the remaining pixels to zero. If this noise removal would not be applied, the constraint of having to place a fixed number of objects in the scene would be removed. This means that images may be taken of as many objects as possible, but these images will contain noise from the segmentation process.

Another 30 images were captured that contained various amounts of both iron and copper with various degrees of clutter. Lastly, 30 images of as many mixed object types as possible were captured.

Table 3.1 describes the different data sets that were collected. The table also describes whether or not the data sets were annotated. This annotation was done

(30)

according to the steps described in section 3.2, with the exception of one data set that was manually annotated. In this annotation process, the noise removal technique mentioned above was applied. This technique was not applied to all data sets however, which is also presented in table 3.1. Data sets that did not employ noise removal were introduced to create more advanced training images of cluttered objects. Example images are presented in figure 3.1.

Table 3.1:The different sets of data that were collected

Data set Object type Images Noise Removal Annotations

Set 1 Copper 139 Yes Yes

Set 4 Copper 133 No Yes

Set 5 Iron 139 Yes Yes

Set 8 Iron 216 No Yes

Set 9 Cables 15 Yes Yes

Set 10 Copper & Iron 30 No Yes (manual)

Set 11 Mixed 30 No No (a) Example of an image with iron components. (b)Example of an image with copper compo-nents. (c) Example of an image with cable components. (d) Example of an image with iron and copper components.

(31)

3.2 Image preprocessing 21

3.2 Image preprocessing

In order to feed the images to the neural network, training and annotation images had to be created and processed.

3.2.1 White balance correction

The raw camera data contained distortions in the form of shifts in the RGB-channels. Something that is supposed to be completely gray should have the same value in all RGB-channels in the image. Nevertheless the camera misinter-preted the colors so that the resulting image did not represent the color gray with the same value in all RGB-channels. To counteract this, a small patch was placed in the upper left corner of the scene, known to be completely gray. By adjusting the RGB-channels for the entire image according to the RGB-deviations in the gray patch, these distortions were removed.

3.2.2 Background modelling

In order to create annotation images for the neural network, information is needed to discern what is background, what is ferrous and what is non-ferrous. This was solved by capturing images of either only ferrous or only non-ferrous material on a rubber mat and then applying a mixture of gaussians background model to these images. The mixture of gaussians background model iterated through each image in the set a number of times and estimated a number of gaussian distribu-tions centered around the most probable values for each pixel [22]. If the value of a certain pixel deviated enough from the most probable values, it was classi-fied as foreground [22]. This way the rubber mat was classiclassi-fied as background and the objects placed on the rubber mat were classified as foreground. To make this work as well as possible, objects were placed in as many different places as possible in the scene and the scene was kept as still as possible. To make it work even better images of empty scenes were captured on regular intervals. The mix-ture components were estimated from all of the images. It is theoretically sound to estimate the mixture components from only the empty images. However, the scene may be moved slightly when placing objects on the rubber mat. Using the captured images with empty images in between can create invariance to small changes in the scene.

Most of the images contained mostly background and it would not be of very much use to have images of only background in the data set. All empty images were removed by checking if the largest connected foreground components were smaller than a certain threshold. If the largest components were too small to be considered objects, the image was classified as empty and removed.

The background modelling functionality inOpenCV is the one that was

imple-mented [4]. Examples of an image and its background modelled counterpart are presented in figure 3.2. It is evident in the figure that shadows were detectable.

(32)

(a)Example of an image before background modelling.

(b)Example of the same image segmented by the background model. Black represents back-ground, white represents objects and gray represents shadows.

Figure 3.2:Example of background model applied on an image.

3.2.3 Noise removal

Small and subtle changes in the scene and the lighting can cause the pixel values of the background to change, resulting in noise in the raw background model images. Certain morphological operations that close holes and bridge gaps were applied on the raw background model images, the four objects can be extracted as the four largest connected components. This resulted in a binary mask of the areas in which the objects were found. Performing element-wise multiplication of this binary mask with the raw background model image yielded a resulting image displaying the raw background model image without noise outside of the four areas in the binary mask. This process is illustrated in figure 3.3. Note that this technique was only applicable on images with a prespecified amount of objects in them.

Figure 3.3: Visual representation of noise removal process. Leftmost

im-age portrays a raw background model imim-age with noise. The middle imim-age portrays the binary mask. The right image portrays the resulting image.

(33)

3.3 Merged dataset 23

3.2.4 Downsampling

In a segmentation network, different areas are to be detected and classified. This area detection does not necessarily require pixelwise accuracy. The fact that such accuracy is not needed paired with the fact that the captured images were quite large means that a lot of computation and storage space is saved by pling the images and training on smaller images. Hence, images were downsam-pled by a factor of four in both directions.

3.3 Merged dataset

This section presents how a dataset composed of merged images was created. The aim of this process was to create data that compares better with the actual data at Stena Recycling. This was achieved by extracting components from one image and pasting them onto another image.

This image pasting function was implemented with help of the noise removal functionality described in section 3.2.3. The resulting images from the noise re-moval process contain zero noise outside of the areas in which the objects are found. These images also contain raw background model data inside these ar-eas. This means that a segmentation of the objects has been done in these local areas, resulting in a segmentation mask which likely portrays the shapes of the objects relatively well. This was also used as the annotation image for the neural network.

If the objects of an image, image2, are to be pasted onto another image, im-age1, the above mentioned segmentation mask is useful. The segmentation mask of image2 can be multiplied with its original RGB-image (image2) to extract the image content in only the relevant areas. It can also be inverted and then mul-tiplied with the original RGB-image (image1) to remove image content in the relevant areas in image1. Adding these images together results in addition of the extracted components from image2 onto removed content in image1. The compo-nents from image2 have been pasted onto image1. This is described in algorithm 2.

As mentioned, the goal was to create data that resembles actual data from the recycling plant. This was achieved by applying the above method to three or five images at a time, iterating through all of the data in data sets 1 through 8, as described in table 3.1. Since this was done on data sets 4 and 8, which did not contain noise removal, some of the resulting images have noise pasted on top of other images. As seen in figure 1.2, this data is mostly iron with some copper on top.

(34)

To best recreate the scenario at the recycling plant, an algorithm was imple-mented to iterate through the copper and iron data sets and create a data set of images with mixed object types. A random choice between one and two was made that represents how many copper images that were to be used for one output im-age. One more iron image than copper image was used for each output imim-age. First, the components of one iron image were pasted onto another iron image. Then, the components of a copper image were pasted onto the already merged image. If the random choice gave an output of one, the image is finished. If the random choice gave an output of two, the components of another iron image and then another copper image were pasted onto the already merged image. This re-sults in a data set of images with mixed object types in various degrees of clutter. These steps are outlined in algorithm 1. An example of an image produced by algorithm 1 and 2 is shown in figure 3.4.

Algorithm 1:Create merged dataset

Result:Data set of merged images

Input: Sets of copper images, sets of iron images;

whileImages left in sets do

number_of_copper_images = random_choice(1,2);

number_of_iron_images = number_of_copper_images + 1;

Addnumber_of_copper_images copper images to list;

Addnumber_of_iron_images iron images to list;

StitchIron image 1 onto Iron image 2;

forRemaining images in lists do

ifindex even then

StitchNext copper image onto earlier image

else

StitchNext iron image onto earlier image

end end

SaveResulting image;

RemoveUsed images from sets;

end

Algorithm 2:Image merging

Result:Components of image 2 stitched onto image 1

Input: img1, annot1, img2, annot2;

forImages do

img_out = img1 ◦ (annot2==0) + img2 ◦ annot2; end

forAnnotations do

annot_out = annot1 ◦ (annot2==0) + annot2 end

(35)

3.4 Data split 25

(a) Example of an image with components stitched on top of each other.

(b)Example of an annotation im-age with components stitched on top of each other. White rep-resents copper, light gray sents iron and dark gray repre-sents background.

Figure 3.4: Illustration of an image with merged data. The image contains

background, iron and copper.

3.4 Data split

From data sets one through nine, as described in table 3.1, a total of 17 images were chosen from all data sets to be used as validation data. Another 13 images from data set ten were used as validation data. The remaining 17 images from data set ten were used as test data. The remainder of the images from data sets one through nine were used as training data. The images from data set eleven were used to test the prediction confidences, as described in section 3.7, for differ-ent types of objects. The total amount of data is presdiffer-ented in table 3.2. The total amount of images is smaller than the amount of images in table 3.1, which is due to the fact that the empty images have been removed.

3.5 Network implementations

The neural network was implemented following the work done byCSAIL at MIT

and the paper by Zhou et al. [26]. The implemented network uses a version of

HrNetV2 as encoder and a C1 decoder [26]. The framework developed byCSAIL

required multiple GPU’s to run, which was not available with the equipment at hand. Hence, a training function tailored to the available equipment was devel-oped. Pretrained weights were used for the encoder and only the decoder was trained.

The network was trained by minimizing the NLLloss by stochastic gradient descent, see section 2.1.1.

(36)

Table 3.2: The data that was used to train, validate and test the neural net-works.

Data set Object type Images

Training Iron 549

Training Copper 432

Training Cables 6

Training (Merged data) Iron and copper 229

Validation Iron 9

Validation Copper 8

Validation Iron and copper 13

Validation Cables 1

Test Iron and copper 17

Test Cables 1

data augmentation was included in the implementation. When loading images with data augmentation activated there is a probability that they may be flipped, rotated, cropped or have their brightness contrast and RGB-values shifted. A

li-brary calledAlbumentations was used to implement the data augmentation, with

the use functions such asRGBShift, RandomBrightnessContrast, RandomCrop,

Ro-tate and Flip [5].

A total of seven different models were trained, with specifications presented in table 3.3. The data sets used for training are numbered according to table 3.1,

and the merged data set is denoted asM. The five first models were trained on

background, iron and copper. The sixth model was trained only on background and iron. The seventh network was trained on all available data to distinguish between background, iron, copper and cables. Not all validation and test data is applicable to each model. The copper validation images, for example, were not applicable on the model that was trained on background and iron. All applicable validation and test data was used to test and validate each model.

3.6 Network evaluation

Evaluation of the neural networks was made in three steps: training, validation and testing.

The networks were trained over several training epochs. For each epoch, the network model was saved. During training, the aim is to minimize the loss func-tion. In the implementation used in this thesis, loss and accuracy were saved for each training epoch.

After training, the models were run on the validation data set. Loss, accuracy and IoU were saved for each model. The best model was chosen by calculating the arithmetic mean of the accuracy and the IoU. Note that validation may also be run on each model in parallel with the training.

(37)

3.7 Network certainty scores 27

Table 3.3:The different network models that were trained.

Model Epochs Iterations

per epoch

Data sets used Data

aug-mentation

1 (Small) 50 80 1,2,3,5,6,7 No

2 (Medium) 50 200 1,2,3,5,6,7 Yes

3 (Big) 50 2000 1,2,3,4,5,6,7,8 Yes

4 (Big with

merged data) 50 2000 1,2,3,4,5,6,7,8,M Yes

5 (Only

merged data) 50 500 M Yes

6 (Only iron) 50 500 4,5,6,7 Yes

7 (Big with cables) 50 2000 1,2,3,4,5,6,7,8,9,M Yes

In order to acquire unbiased results of the performance of the best model, it was run on the test set. To compare test results for different models further, their F1-scores were calculated.

3.7 Network certainty scores

In the network classifier the output went through a softmax layer, see section 2.1.1. This output came in the shape of a probability distribution for a certain pixel belonging to a certain class. If, for example, a network was to classify a pixel into one of three classes the softmax layer may output a probability of 0.8 for the pixel belonging to class one, 0.15 for the pixel belonging to class two and 0.05 for the pixel belonging to class three. The probabilities will sum up to one.

In order to find out the total prediction confidence for each pixel the sum of squares of the probabilities may be used. Maximum certainty for, for example, a three-class classifier will then be

12+ 02+ 02= 1, (3.1)

whereas the lowest certainty will be 12 32 + 12 32 + 12 32 = 3 9 = 1 3. (3.2)

Plotting these values will yield an image where high values represent classifi-cations that have been made with high certainty and vice versa.

(38)

3.8 Manual annotation

The manual annotation was done with theVGG Image Annotator [2] tool by

man-ually drawing polygons around the objects. An example is shown in figure 3.5.

(39)

4

Results

This chapter presents the results acquired from the methods described above in order to help answer the problem statements presented in the introduction. A few comments are made on the results as well, in order to highlight informa-tion that is covered in the discussion. Evaluainforma-tion and result of the automatic annotation is presented. The results of five network models with different data and training processes are presented and compared. Results of experiments with adding more classes and plotting prediction confidence are also presented.

4.1 Automatic annotation

In order to evaluate the automatic annotation process, a set of ten images were manually annotated. The intersection over union between the automatic and manual annotation images was calculated, as described in section 2.3 and figure 2.7. This resulted in a mean IoU value of 0.8143 for all of the classes. Visual results of manual and automatic annotation are displayed in figure 4.1. Images that highlight where the annotations differ are presented in figure 4.2.

(40)

(a)Original iron im-age.

(b) Manually anno-tated iron image.

(c)Automatically an-notated iron image.

(d) Original copper image.

(e) Manually anno-tated copper image.

(f)Automatically an-notated copper im-age.

Figure 4.1:Comparisons between manual and automatic annotation.

(a)Iron difference image. (b)Copper difference image.

Figure 4.2: Difference images that highlight where the annotation methods

differ. Red pixels represent areas where the annotation methods produced different values.

(41)

4.2 Neural Network evaluation 31

4.2 Neural Network evaluation

In the following sections, the evaluation and results of the network models are presented. The networks are numbered according to table 3.3. The best model for each network was chosen by calculating the mean IoU and the accuracy.

Examples of resulting inference images ran on the test set (the iron and copper test set in table 3.2, taken from set 10 in table 3.1) are presented as well for each network model. The inference images are split into three parts:

1. Left: Original image.

2. Middle: Annotation image (i.e. the ”ground truth”).

3. Right: Inference results. The output of the model applied on the original (left) image.

In the middle and right images objects are color coded: gray represents back-ground, red represents ferrous material and blue represents non-ferrous material (in these cases copper).

(42)

4.2.1 Model 1

Model 1 was trained on a small amount of data over the least amount of training iterations. Model 1 was the only model that did not employ data augmentation.

Quantitative evaluation

The evaluation results are presented in figure 4.3. Note that the validation IoU began to decrease after a certain amount of training epochs as seen in figure 4.3b. This was a result of overfitting. The best model was found after 23 epochs.

0 10 20 30 40 50 Epoch 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 NL Llo ss Training loss Valiadation loss Best model

(a)NLLloss for network 1.

0 10 20 30 40 50 Epoch 20 40 60 80 100 Training accuracy [%] Valiadation accuracy [%] Validation IoU [%] Best model

(b)Accuracy and IoU for network 1.

Figure 4.3:Training and validation plots for network 1.

Qualitative evaluation

Examples of inference results are presented in figure 4.4. It is evident that the copper pieces have been found and distinguished from the iron pieces. However the contours and shapes of objects have not been clearly discerned and there were some significant misclassifications as seen in figure 4.4b.

(43)

(a)Easier inference test example for network 1.

(b)Harder inference test example for network 1.

(44)

4.2.2 Model 2

Model 2 was trained on a small amount of data (same as model 1) over more training iterations than model 1. Model 2 and subsequent models employed data augmentation.

The evaluation results are presented in figure 4.5. The best model was found after 13 epochs. The addition of data augmentation resulted in a model that was not as prone to overfitting as model 1. Note also that the data augmentation introduced more noise in the training loss compared to model 1, as seen in figures 4.3a and 4.5a. 0 10 20 30 40 50 Epoch 0.01 0.02 0.03 0.04 0.05 0.06 0.07 NL Llo ss Training loss Valiadation loss Best model

0 10 20 30 40 50 Epoch 40 50 60 70 80 90 100 Training accuracy [%] Valiadation accuracy [%] Validation IoU [%] Best model

Examples of inference results are presented in figure 4.6. The copper and iron parts have been discerned from each other with some misclassifications. The contours and shapes of piled objects were not that clear.

(45)

(46)

4.2.3 Model 3

Model 3 was trained on all unmixed iron and copper data. Unlike models 1 and 2, this model included data sets 4 and 8 in table 3.1 which contained piled objects and a bit of noise.

The evaluation results are presented in figure 4.7. The best model was found after three training epochs. The model was trained over 50 training epochs with 2000 iterations per epoch. The fact that the best model was reached after 3 out of 50 training epochs showed that, for this model, large amounts of training was not a critical factor. The validation IoU and accuracy also began at a higher value compared to the earlier models. This was because the first model was reached after 2000 iterations, compared to 80 for model 1 and 200 for model 2.

0 10 20 30 40 50 Epoch 0.00 0.05 0.10 0.15 0.20 NL Llo ss Training loss Valiadation loss Best model

0 10 20 30 40 50 Epoch 70 75 80 85 90 95 100 Training accuracy [%] Valiadation accuracy [%] Validation IoU [%] Best model

Examples of inference results are presented in figure 4.8. Iron and copper parts have been correctly identified for the most part. Contours and shapes were more distinguishable than for earlier models. However, the inference images were messy and the network seemed to force regions apart where they should be con-nected.

(47)

(48)

4.2.4 Model 4

Model 4 was trained on all copper and iron data, including the mixed merged data set.

The evaluation results are presented in figure 4.9. The best model was reached after 25 training epochs. The IoU was significantly higher than for earlier models as seen in figure 4.9b. This model was also trained over 50 training epochs with 2000 iterations per epoch. The higher variance in the data resulted in a network where the optimal model was reached after 25 out of 50 total training epochs. It is noteworthy that the validation accuracy and IoU reached values similar to those of the best epoch after only a few epochs.

0 10 20 30 40 50 Epoch 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NL Llo ss Training loss Valiadation loss Best model

0 10 20 30 40 50 Epoch 75 80 85 90 95 100 Training accuracy [%] Valiadation accuracy [%] Validation IoU [%] Best model

Examples of inference results are presented in figure 4.10. Copper and iron parts have been correctly classified for the most part. The shapes and contours of the piles and objects were significantly improved in comparison to the earlier models, as seen in figure 4.10b.

(49)

(50)

4.2.5 Model 5

Model 5 was trained only on the merged data set.

The evaluation results are presented in figure 4.11. As mentioned in section 3.3, some of the images from the merged data set contained noise. In this case, back-ground may have been correctly classified as backback-ground but mislabeled as iron or copper in the training data. The images in the validation set were mostly from the automatically annotated data set where noise removal was applied, or manually annotated. This caused the validation loss to be mostly lower than the training loss, as seen in figure 4.11a. The validation IoU and accuracy were very similar to those of model 4.

0 10 20 30 40 50 Epoch 0.05 0.10 0.15 0.20 0.25 0.30 NL Llo ss Training loss Valiadation loss Best model

0 10 20 30 40 50 Epoch 65 70 75 80 85 90 95 100 Training accuracy [%] Valiadation accuracy [%] Validation IoU [%] Best model

Examples of inference results are presented in figure 4.12. Iron and copper parts have been identified with more detail than model 4, including smaller copper parts as seen in figure 4.12b. Whether or not the shapes and contours were better or worse than they were for model 4 is debatable.

(51)

(52)

4.2.6 Test results

The resulting metrics of the best model for each network is presented below in table 4.1. These metrics are acquired from running the models on the iron and copper test set, as presented in table 3.2. It is evident that models 4 and 5 per-formed the best.

Table 4.1:Results of the best model for each network applied on the test set.

Network Epoch Accuracy Mean IoU NLLloss F1-score

1 23 95.54% 0.7963 0.09168 0.8767

2 13 95.76% 0.8028 0.07338 0.8841

3 3 96.03% 0.7938 0.06514 0.8751

4 25 96.70% 0.8415 0.06110 0.9116

5 42 96.71% 0.8423 0.06005 0.9129

Evaluation metrics for each of the optimal network models are presented in figure 4.13.

(a) Accuracies for models one through five.

(b)Mean IoU for models one through five.

(c)NLLloss-measures for models one through five.

(d)F1-scores for models one through five.

Figure 4.13:Evaluation metrics for models one through five. Note the Y-axis

(53)

4.3 Confidence maps 43

4.3 Confidence maps

Examples and results of the experiments with prediction confidence are presented below.

4.3.1 Binary classifier

Model 6 from table 3.3 was trained to only discern iron from background. An example of a confidence map for predictions made by model 6 is presented in figure 4.14. The image on which it was applied contained one piece of iron, one piece of copper wire, one heap of cables and one PCB. The figure is split into four parts, from the left:

1. Original image

2. Prediction confidence for class 1 (background) 3. Prediction confidence for class 2 (iron)

4. Sum of squared prediction probabilities (total prediction confidence) White represents high confidence, black represents low confidence.

Figure 4.14:Confidence scores for a model trained to only discern iron from

background.

All of the objects have been predicted to be ”iron” with relatively high confi-dence.

(54)

4.3.2 Multi-class classifier

An example of the same confidence map for model 5, which was trained to dis-tinguish between background, iron and copper, is presented in figure 4.15. The image on which the model was applied is the same as in figure 4.14 . The figure is split into five parts, from the left:

1. Original image

2. Prediction confidence for class 1 (background) 3. Prediction confidence for class 2 (iron)

4. Prediction confidence for class 3 (copper)

5. Sum of squared prediction probabilities (total prediction confidence) White represents high confidence, black represents low confidence.

Figure 4.15: Confidence scores for a model trained to distinguish between

background, iron and copper.

The iron and copper pieces have been classified with high certainty. The PCB and the heap of cables have been classified with lower certainty as seen in the rightmost part of figure 4.15.

4.4 Addition of classes

A final model (model 7) was trained on data including cables as well as back-ground, iron and copper. Resulting inference examples are presented in figure 4.16. A confidence map is also presented in figure 4.17. Just as in section 4.3, the original image is to the left, total prediction confidence is to the right and the individual prediction scores for each class are presented in the middle images.

(55)

4.4 Addition of classes 45

(a)Inference of model 7 on easier image.

(b)Inference of model 7 on harder image.

(c)Inference of model 7 on cluttered image.

Figure 4.16:Inference examples for model 7. Original images are to the left,

(56)

(a)Original image and confidence scores for classes one and two.

(b)Confidence scores for classes three and four as well as total prediction confidence.

Figure 4.17: Confidence scores for a model trained to distinguish between

background, iron, copper and cables.

4.5 Testing a model on real data

Examples of running inference with model 5 on actual footage from the assembly line at Stena Recycling are presented in figure 4.18. The copper and iron parts have been identified, with some errors in the shapes and contours.

(57)

4.5 Testing a model on real data 47

(a)Inference example one.

(b)Inference example two.

Figure 4.18:Inference examples for model 5 applied on actual footage from

Stena Recycling. Original images are to the left, inference results are to the right. There is no annotated ground truth.

(58)