Semantic Segmentation of Iron Ore Pellets with Neural Networks

(1)

with Neural Networks

Terese Svensson

Space Engineering, masters level

2019

Luleå University of Technology

(2)

Department of computer science, electrical and space engineering MASTER’S THESIS

Semantic Segmentation of Iron Ore Pellets

with Neural Networks

Terese Svensson

Examiner: Dr. Anita Enmark, Luleå University of Technology

External Supervisor: Dr. Martin Simonsson, Data Ductus

External Deputy Supervisor: Elin Åström, LKAB

(3)

(4)

LULEÅ UNIVERSITY OF TECHNOLOGY

Abstract

Department of Computer Science, Electrical and Space Engineering Master of Science in Engineering in Space Engineering

Semantic Segmentation of Iron Ore Pellets with Neural Networks

by Terese Svensson

(5)

(6)

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to Data Ductus in Luleå for your warm welcome, support and for hosting me at your office during these months of thesis work. I want to give a special thanks to my supervisor Dr. Martin Simonsson for all of your guidance with this thesis, and for running the scrip on your computer almost daily. Thank you to my deputy supervisor Elin Åström and LKAB for supporting me with the image dataset, your knowledge and answering all my questions regarding your work.

I thank Dr. Anita Enmark, my examiner at Luleå University of Technology.

I also wish to thank my family for their endless encouragement during my university studies independent on where I have been located in the world.

(7)

(8)

LIST OF FIGURES

2.1 Four different CV approaches. . . 3

2.2 A multilayer ANN. . . 4

2.3 A single neuron’s composition. . . 4

2.4 A typical CNN architecture. . . 5

2.5 Machine learning cross-correlation operation. . . 6

2.6 Atrous convolutions. . . 7

2.7 Max pooling operation. . . 8

2.8 U-Net architecture. . . 9

2.9 Possible prediction outcomes. . . 10

2.10 Microscope image of iron ore pellet. . . 11

2.11 The different iron ore pellet constituents. . . 13

3.1 Demonstration of annotation with Ilastik. . . 15

3.2 Examples of the raw image data set of iron ore pellets. . . 18

3.3 RGB value of the classes and their respective colors used in this thesis. . . 19

3.4 The total number of pixels of each class. . . 19

3.5 PSPNet’s architecture. . . 22

3.6 FC-DenseNet’s architecture. . . 22

3.7 DeepLabv3+’s architecture. . . 23

3.8 BiSeNet’s architecture. . . 24

3.9 GCN’s architecture. . . 25

4.1 Resulting test segmentation with the five models. . . 28

4.2 The average accuracy in percentage of each model against dataset size. . . 31

4.3 The F1 score in percentage of each model against dataset size. . . 32

4.4 The average IOU in percentage of each model against dataset size. . . 32

4.5 The training time of each model in minutes against dataset size. . . 33

4.6 The mean of the average class accuracy of all models against dataset size. . . 34

4.7 The average validation accuracy over epochs for PSPNet. . . 36

4.8 The average IOU over epochs for PSPNet. . . 36

(11)

(12)

LIST OF TABLES

4.1 The models’ results on public datasets. . . 27

4.2 The models’ test run results, validation metrics. . . 27

4.3 The models’ test run results, per class accuracy . . . 28

4.4 The models’ results with 20% of the dataset. . . 29

4.5 The models’ per class results with 20% of the dataset. . . 29

4.10 The models’ results with full dataset. . . 31

4.11 The models’ per class results with full dataset. . . 31

4.12 Average results for each dataset size. . . 33

4.13 Average per class results for each dataset size. . . 33

4.14 The models’ results with full dataset and data augmentation. . . 34

4.15 The models’ per class results with full dataset and data augmentation. . . 34

4.16 Selected models’ results with varying learning rates. . . 35

(13)

(14)

ACRONYMS

ANN Artificial Neural Network CNN Convolutional Neural Network CPU Central Processing Unit

CV Computer Vision

FN False Negative

FP False Positive

FPS Frames Per Second

GCN Global Convolutional Network GPU Graphical Processing Unit

ILSVRC ImageNet Large Scale Visual Recognition Challenge IOU Intersection Over Union

LKAB Luossavaara-Kiirunavaara AB

LTU Luleå Tekniska Universitet (Luleå University of Technology) ReLU Rectified Linear Unit

RGB Red Green Blue

TN True Negative

TP True Positive

(15)

(16)

1 INTRODUCTION

The analysis of micro structures in iron ore pellets is a complicated process that requires experts and a large amount of time to classify the minerals in each image. Because of this the analysis usually consists of visual estimations and small sample sizes. The identification process could be improved in terms of speed and accuracy by applying machine learning to this task. Machine learning has been and is used in many fields for visual analysis, with usual examples as satellite imagery and medical imaging [11] [9]. The applications for machine learning in the mining industry can be many, where remote sensing imagery for mineral exploration is a vital part for mineral potential mapping, and visual microscopy analysis can be largely quantified and automated [7] [18].

Machine learning is a computer program that learns without being explicitly programmed for the given task. For image analysis, it requires multiple parts: images that are already classified to be used as ground truth to train the system, and images to test and verify the system [10]. As a machine learning network is given more and more layers, it becomes what is called a deep learning network. Deep learning is created with its foundation in many applied mathematics fields, such as linear algebra and probability theory [10]. Key limiting factors of the effectiveness of machine learning are the computation power available, as well as the number of classified images for training and testing the system [10]. Therefore, the progression of graphical processing units (GPUs) has accelerated the training and testing of image classifiers, making them faster and able analyze more data in the same amount of time, as well as the development of larger datasets to train with.

In 2012 the annual competition ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was won with a convolutional neural network (CNN) called AlexNet [14], and from that year forward the competition has been dominated by deep CNNs [21]. This fact demonstrates how deep learning is leading the development of machine learning, and is a very viable selection for applications in image analysis fields. The ILSVRC evaluates image classification and object recognition algorithms, meaning that the algorithm predicts the object visible on an image, with many useful applications.

Another evaluation approach is with semantic segmentation, where every pixel in an image is labeled with an object class. The corresponding datasets used to evaluate semantic segmentation networks include the Cityscapes [6] dataset from 2016 and the PASCAL Visual Object Classes (VOC) challenge [8] from 2012. A possible application of available deep learning networks is to analyze, by semantic segmentation, iron ore pellets and from that quantify the parts of the pellet. This thesis will therefore evaluate the possibility of applying machine learning to analyze the micro structures in iron ore pellets.

1.1 Motivation

Data Ductus works with multiple complex projects in vision technology and uses machine learning to improve production quality, production speed and for preventive maintenance. Additionally, Data Ductus has access to a large amount of microscope images of iron ore pellets through Luossavaara-Kirunavaara AB (LKAB), combined with enough GPU power to build and train large networks that is necessary for this thesis.

There exists multiple approaches for automation of the classification process of pellets, mainly studies at Vale’s mines in Brazil. Wagner et al. [26] use digital microscopes to both acquire and process iron ore pellets. They created porosity maps and measured phase fractions with automated segmentation. Augusto et al. [1] classified hematite in iron ore from optical microscope images, identifying textures and shapes with an automatic analysis procedure. Castellanos et al. [3] used semi-automated processes in the software FiJi to process their microscope images from the same mining company.

(17)

this method was not sufficient, the thesis supervisor Dr. Martin Simonsson at Data Ductus created a Pixel Classifier for LKAB [24]. This classifier uses the open source software Ilastik to train the pixel classifier for the task. Ilastik uses classical classification methods, for example Random forest. The need for this thesis was identified during the work on the classifier, which therefore can be seen as the third step for LKAB.

Based on the above summary, there is indeed an opportunity to utilize Data Ductus’ expertise in machine learning, today’s improvements in the field of semantic segmentation and the absence of previous attempts to use existing high performance CNN models to analyze micro structures of iron ore pellets.

1.2 Thesis Aim

The aim of this thesis is to evaluate existing semantic image segmentation models to design a system to identify micro structures in iron ore pellets.

1.3 Objectives

To reach the above defined aim, the work has been divided into objectives as listed below: OBJECTIVE 1: Obtain an understanding of micro structures of iron ore pellet phases.

OBJECTIVE 2: Obtain an understanding of machine learning techniques (deep learning, CNNs, se-mantic segmentation).

OBJECTIVE 3: Set up evaluation criteria.

OBJECTIVE 4: Build, train, test and evaluate five machine models.

OBJECTIVE 5: Evaluate the potential of CNNs for image analysis applications in the mining industry.

1.4 Deliminations

The data set is limited to optical microscopy images of iron ore pellets, collected from the same laboratory with the same microscope. Different microscope and environmental settings and iron ore pellet treatment would require a larger number of test data.

The thesis will not cover creating new machine learning algorithms or parts, as the aim is to evaluate existing models to execute the objectives as formulated in Section 1.3 above.

1.5 Thesis Outline

(18)

2 THEORY

2.1 Semantic Segmentation

Computer Vision (CV) is a common deep learning application, with the aim of teaching a computer to identify objects by vision similar to the human eye [10]. It can be used to classify and evaluate information from images and is essential in many areas, from satellite images to microscopical scales. Semantic segmentation is a CV task, which classifies every pixel in an image with a predicted label without separating any possible instances of the classes, instead of classifying complete images with specific labels, recognizing objects in images, or localizing objects and then giving their pixels a label [5]. It can segregate any number of classes. See Figure 2.1 below for a visual explanation of the different usual CV approaches in the machine learning field, with an example of semantic segmentation to the far left.

Figure 2.1: Four different CV approaches. From left to right are semantic segmentation, object classification and localization, object detection and instance segmentation displayed.1

Semantic segmentation is selected because the goal is to identify the constituent of every pixel in the mi-croscope images. Furthermore, there is no necessity of identifying objects in the images, or separating the instances of the constituents in the images, making other approaches superfluous.

2.2 Neural Networks

The aim of an Artificial Neural Network (ANN) is to replicate the function of the human brain, building networks of neurons for information processing. It is a subset of machine learning, and consists of multiple connected neurons, which together can create incredibly complex systems. A simple ANN can be seen in Figure 2.2 below, with an input layer, two hidden layers and an output layer.

(19)

Figure 2.2: A multilayer ANN, with an input layer, two hidden layers and one output layer. 2

The hidden layers are the ”black box” of the network, and consists of all actions and calculations based on the input data, with the results given in the output. [10] By adding hidden layers, the net becomes deeper. A single neuron’s composition of weights, a transfer function, an activation function with threshold and an output is shown in Figure 2.3 below.

Figure 2.3: A single neuron’s composition, with inputs, weights, a transfer function, an activation function with threshold and activation. 3

The weights of each input is what controls the impact of each input on the system. During the training of the network, the weights are adjusted to fit with the provided ground truth. A transfer function relates an output of a neuron to its input, and the activation function maps the output from the transfer function to a desired range, for instance between 0 and 1, to be sent to the next node. The output of one neuron is another neuron’s input, building the network and layers of the ANN.

2.3 Convolutional Neural Network

A CNN is a type of ANN that is used for processing data, commonly image data for different purposes. According to Goodfellow et al. [10], a CNN uses convolution instead of general matrix multiplication as in an ordinary ANN, in at least one layer. Usually, a basic CNN layer consists of a convolution stage, a detector

(20)

stage and a pooling stage, with subpooling to feature maps in between layers. Figure 2.4 shows a typical CNN architecture.

Figure 2.4: A typical CNN architecture.4

When using CNNs for semantic segmentation tasks, it is crucial to know exactly where in the image a certain feature or pixel is located. The input and output images have to be of equal size, and the CNNs have different design approaches than for other CV tasks.

The functionality of the layers in the CNNs are described in the following sections, with the theory mostly based upon Goodfellow et al.’s [10] work. The concepts introduced in this section are the building blocks used to create the CNN models presented in Section 3.5.

2.3.1 Convolution

Convolution is a mathematical operation on two functions used in machine learning. In general, the convo-lution of the functions f and g is defined by

s(t) = (f∗ g)(t) =

∫ _∞

−∞

f (τ )g(t− τ)dτ, (1)

where the first argument, f , is the input, and τ is the age of the measurement, the second argument, g, is the kernel and the output, s(t), is referred to as output, and t can represent time, however it must not. It can be interpreted as a weighted average of f (τ ) at the moment t. When working with discrete applications, the convolution can be defined as

s(t) = (f∗ g)(t) =

∞

∑

τ =−∞

f (τ )g(t− τ), (2)

assuming that t is only integer values. With the terminology used for convolutional neural networks, the first argument x is the input, the second argument y is the kernel and the output s is the feature map. As images are multi-dimensional arrays, the convolution has to be summed over multiple dimensions. Summing over a 2D image I with a 2D kernel K gives the formula

S(i, j) = (I∗ K)(i, j) =∑

m

∑

n

I(m, n)K(i− m, j − n), (3)

with i, j, m and n being array elements. Convolution is said to be commutative, with the equation

S(i, j) = (K∗ I)(i, j) =∑

m

∑

n

I(i− m, j − n)K(m, n), (4)

being equal to the one above. The latter is frequently used due to it being easier to implement for machine learning. In regards to machine learning, convolution can be seen as matrix multiplications, where the

(21)

kernel, the filter mask, is moved both horizontally and vertically along the image matrix to produce the output. As the multiplications only are carried out where the full kernel matrix is in the image matrix, the resulting output is smaller than the input image. Also note that many neural network libraries instead uses cross-correlation, which is similar to convolution but the kernel is not flipped, as seen in the equation

S(i, j) = (I∗ K)(i, j) =∑

m

∑

n

I(i + m, j + n)K(m, n). (5)

The above operation is many times called convolution in machine learning library implementations, and the convention will be followed in this thesis. The cross-correlation ”convolution” operation is illustrated in Figure 2.5 below.

Figure 2.5: Illustrated example of the cross-correlation operation.

2.3.2 Convolution Components

There are multiple components of the operation that can be adjusted to give different results, which are in-troduced below. These components are what is used to build the different CNN models that are inin-troduced and described in Section 3.5.

Kernel Size

The kernel size refers to the width and height size of the kernel. It directly impacts how large

Kernel Coefficients

The coefficients in the kernel, or filter, is what determines the nature of the operation. The coefficients of all kernels are critical for the performance for the function of a CNN model. A learning operation will find the appropriate kernel values for each kernel.

Stride

The stride refers to how many steps the kernel moves on the input matrix for each step. A stride of 1 results in moving the kernel one step to the right, and a stride of 2 results in moving the kernel two steps to the right for each step. Adjusting the stride size affects how large the output matrix is, and how much of the output matrix values overlap to each other.

Padding

(22)

Atrous Convolutions

Atrous, or dilated, convolutions refers to the kernel pixels’ separation. A larger dilation value results in a larger separation between the kernel pixels. This is illustrated in Figure 2.6 below.

Figure 2.6: Illustrated example of atrous convolutions.

2.3.3 Nonlinear Activation Function

A nonlinear activation function is used in the detector stage of a CNN layer. Its purpose is to introduce a non-linearity to the system. Two commonly used nonlinear activation functions are the Rectified Linear Unit (ReLU) and the Sigmoid functions, introduced below.

Rectified Linear Unit

The ReLU is the commonly recommended activation function for machine learning applications, and its function is

g(z) = max{0, z}. (6)

According to Goodfellow et al. [10], although ReLU is nonlinear, it stays close to linear and retains many of the useful properties for generalization of linear models, with its only difference to a linear unit is that a ReLU’s output is 0 for the first half of its area, and it being discontinuous at the point z = 0.

Sigmoid function

The Sigmoid function is a mathematical operation, where the logistic Sigmoid functions maps the input values to a smooth range of values between 0 and 1.

f (x) = 1

1 + e−x. (7)

The Sigmoid function is a nonlinear activation function used by the CNN model in Section 3.5.5.

2.3.4 Pooling

The pooling stage of a CNN layer is used to modify the output, usually to make the output smaller by replacing the output at a certain location with a summary of nearby outputs. It helps representation to stay more or less invariant to small translations. There exists many kinds of pooling operations.

(23)

Figure 2.7: Illustrated example of the Max Pooling operation. The max pooling is expressed with the formula [22]

aj= max N_×N(a

n_×n

i u(n, n)), (8)

where it applies a window function u(x, y) to a n× n small input patch of the full N × N matrix, computing

the max of the included pixels, resulting in a lower resolution matrix.

2.3.5 Concatenation

Concatenation of tensors in machine learning applications are equivalent to string concatenation, meaning that the concatenation of the two tensors [a, b] and [c, d] results in [a, b, c, d].

2.3.6 Batch normalization

Batch normalization acts as a regularizer and allows for higher learning rates, re-scaling each mini-batch to a desired range [12]. Usually, it maps the output values from another layer to values between 0 and 1, or, as can be useful when working with image processing, 0 to 255.

2.3.7 Upsampling

Upsampling can be done with different methods, with the purpose of making low-dimension feature maps larger, closer to the original input size feature map. Two usual methods are the deconvolution and unpooling operations. The first is a convolution in reverse. The input image is instead smaller than the output, with padding and dilation of the input to increase the matrix size, making the resulting convolution output larger than the original input. A pooling cannot be reversed, however it stores the location of the maximum values, which the unpooling operation utilizes to place its matrix values at the corresponding positions.

Another method is by using bilinerar interpolation. It is a resampling technique used when having to transfer a dataset from one cell size to another, which takes the four nearest neighbors to create an output surface.

2.3.8 Hyperparameters

(24)

2.3.9 Semantic segmentation CNN architectures

The common system design in current state-of-the-art CNNs is the encoder-decoder U-Net architecture by Ronneberger et al. shown in Figure 2.8 [20]. The CNN is convoluted and pooled a number of times, then up-convoluted to fit the output segmentation map. The original image data is preserved and copied to the output convolution.

Figure 2.8: U-Net architecture. [20]

The CNNs to be evaluated in this thesis are based on the encoder - decoder architecture, as it is the most common model for state-of-the-art CNNs, with tweaks and additions that enhances their performances.

2.4 Network Training

The training of a CNN consists of trial and error by the program. It can be supervised or unsupervised, based on the level of assisted learning the CNN is given. Unsupervised learning provides no correct answers, leaving the CNN to identify structures of interest in the provided datasets. In supervised learning, the program is given the correct answers, the ground truth, and evaluates its algorithm on the test images. From evaluation of the accuracy on the test, the weights are updated and the training iterates, continuing until the desired accuracy is reached.

A training of a CNN goes on for a number of epochs, which is the number of times the CNN trains on the complete training dataset. An epoch size of 100 means that the CNN trains on each input in the dataset 100 times. In each epoch, the CNN evaluates one batch of dataset at a time before updating its weights. The batch size can range from one to the size of the full dataset. Usually, mini batches smaller than 32 are recommended for improved performance [23].

(25)

2.4.1 Ground truth

Ground truth, a dataset of correctly annotated images, is the core part of supervised machine learning and vital for successfully teaching a CNN. To generate ground truth for semantic segmentation, each pixel in the image dataset has to be annotated correctly. For other CV approaches, like image classification, it is sufficient to give one label to the entire image. Hence, the ground truth generation for semantic segmentation is a time consuming and more complicated task than for other CV tasks.

These ground truth images are then what the CNNs are evaluated with, where each pixel in the resulted segmentation either is correct or incorrect in comparison to the ground truth annotation.

2.5 Evaluating a Network

To properly evaluate a network, the ground truth and original images are split into three image sets - one for training, one for testing and one for validating. Each set has a unique purpose. All three datasets contain both the original image and the annotated ground truth of the image. The supervised training is carried out on the training dataset, the testing of the training is done on the testing dataset, and the final validations are done on the validation datasets. The testing and validation datasets are separated to ensure that the CNN is not overfitted to the provided testing dataset and performs badly when new data is introduced, but instead has successfully identified and learned the parameters connected to the different classes, as described in Section 2.4 above.

The evaluation part is introduced as a mean of keeping track of the training of the CNN, as well as give feedback to the CNN for continued training during the iterations. The evaluation is based on the pixel predictions compared to the provided ground truth. For every pixel prediction done by a CNN, there are four possible outcomes, as presented in Figure 2.9.

Figure 2.9: Illustration of possible prediction outcomes.

The above are binary classifications, assuming two classes and only true or false. In semantic segmenta-tion, as there are a large number of classes, the pixel classification is instead viewed as the correct class or

(26)

True Positive (TP): The pixel is correctly classified with the class label. False Positive (FP): The pixel is incorrectly classified with the class label. True Negative (TN): The pixel is correctly classified with another class label. False Negative (FN): The pixel is incorrectly classified with another class label.

These definitions are used for calculating a number of the evaluation metrics introduced in Section 3.4. In many cases, including cancer detection in medical images, or pedestrian detection for self-driving cars, the existence of FN are the most dangerous for the system, as it would result in missing to diagnose the cancer, or that the car runs over a person. Systems are many times designed to penalize FN predictions. In the case of the thesis work, no design choices have been made to avoid FN predictions in particular.

The evaluation can be extended to the concept of a confusion matrix, which summarizes the prediction results for every involved class. It is great for visual interpretation of data, and displays which classes are hard to predict.

2.6 Iron Ore Pellets

Iron ore pellets are one of the end product of LKAB’s iron ore mining in the northern Sweden. The company produces the pellets at multiple locations. Knowing the constituents of the pellets allows for more precise usage and dosage, it makes it easier for customers to know what pellet to use for what application. It also allows the producer to better tweak their product, while selling the correct pellet for the correct purpose. A cut through microscopic image of a pellet can be seen in Figure 2.10, with the full pellet and a zoomed in detail view.

Figure 2.10: Microscope images of iron ore pellet with different magnification, with a full pellet (left) and a zoomed in view (right).

As can be seen in the image, a pellet contains many different minerals, and the composition varies during the production process due to chemical reactions. The different constituents in iron ore pellets to be classified and analyzed in this thesis are listed below with an explanation of their visual appearance.

Magnetite

(27)

Hematite

Hematite can be identified as large areas of light-grey color, in texture very similar to the magnetite, also separated with crevices.

Pore

Pores can be identified as the large, very dark and round parts of the images. In Figure 2.10 there are two large pores in the magnified image, middle left and bottom right.

Epoxy

Epoxy is the molding substance. It can be identified as the dark grey parts of the images. It fills both larger areas and smaller crevices.

Wüstite

Wüstite is formed around magnetite during the reduction process, with more structured parts, almost like dots. It is microporous. It is difficult to differentiate wüstite and magnetite.

Olivine

The additive olivine is grain formed and patch-wise placed in the pellet. It has similar color to the epoxy, the dark grey, but with a lighter grey ring around it instead of the dark ring around epoxy.

Slag

Slag is a light grey ring around a darker gray olivine grain. It only exists around the olivine grains.

Metallic iron

Metallic iron can be identified as the smaller, round white parts of the images. It is very easy to separate visually from the other constituents.

(28)

Figure 2.11: The different iron ore pellet constituents classified in this thesis.

(29)

(30)

3 MATERIAL & METHODS

3.1 Software

This subsection will introduce the software used for the work done in for this thesis.

3.1.1 Github

Github is an online program developing software, where millions of developers store their code and programs in repositories. It allows for version control using Git, and it has many repositories free to use. Github hosts the repository used in this thesis.

3.1.2 Fiji

Fiji is Just imageJ (FiJi) is an open source image processing package based on ImageJ, consisting of a collection of large amounts of plugins for scientific image analysis. It bundles a lot of plugins which facilitate scientific image analysis, and eliminates the need of using multiple components from different sites. In this project, FiJi is used for creating stacks of the large input images.

3.1.3 Ilastik

Ilastik is a software made for interactive image classification, segmentation and analysis. It is easy to use and utilize, and a great choice for the non-expert who still wishes to use machine learning to classify images or do simple segmentation of them. [25] The user annotates the input data, usually images, and receives real-time feedback from the program, and the trained model can be exported for use in other software. There are a wide variety of different classifier types that can be selected, including Random Forest, Single Decision Tree and Gaussian.

Below is an example image of the simple segmentation capabilities of Ilastik, where the segmentation training in the center image is divided into three labels: bird (yellow), hand (magenta) and background (turquoise).

Figure 3.1: Original image (left), hand annotation of image (center), and simple segmentation of image (right).5

This process created annotated images, that together have trained an image segmentation model. This trained model can then be used for classifying other images with the given labels.

Previously, Dr. Simonsson [24] has used Ilastik to train a model on classifying micro structures iron ore pellets, as described in Section 1.1. Ilastik will be used to do a rough classification of generate an initial rough classification that will be manually improved and used as ground truth for this project.

(31)

3.1.4 GIMP

GNU Image Manipulation Program (GIMP) is a free and open source software for image editing developed by a diverse group of volunteers. It has many capabilities and can be used for tasks ranging from drawing to advance retouching, or converting file formats. It has the capability of working with multiple layers at once, making it a useful tool in this project to work with hand made corrections of the ground truth annotation done in Ilastik.

3.1.5 Software Libraries

The following subsection consists of a list of the essential software libraries used in this thesis project.

NumPy

NumPy, short for Numerical Python, is an open source Python library that allows Python to have similar functionality to MATLAB. It introduces n-dimensional arrays and matrices and operations with them, mak-ing it useful for image analysis.

TensorFlow

TensorFlow is an open-source machine learning library for high performance numerical computation devel-oped by the Google Brain team. Its main use is for machine and deep learning tasks, but its applications are many involving production and research.

Keras

Keras is an open source neural network library, to be used in combination with TensorFlow or another backend engine. Keras contains implementations of common machine learning building blocks, with focus on fast and easy experimentation.

CUDA

CUDA is a parallel computing platform for GPU computations.

Pandas

Pandas is an open source Python library that allows Python to have similar functionality to the program-ming language R. It is a tool for data analysis and manipulation.

Matplotlib

Matplotlib is an open source Python library and extension to NumPy for creating 2D plots in Python. It is similar to the MATLAB plotting style.

Seaborn

Seaborn is an open source Python library for statistical data visualization. It complements and extends Matplotlib with Pandas’ data structure, and functions directly with complete data sets.

OpenCV

OpenCV (Open source computer vision) is an open source software library aimed at real-time computer vision, containing over 2500 algorithms including those about machine learning and computer vision.

3.2 Hardware

3.2.1 GPU

Two computers with GPUs were available during this thesis project, although the main GPU was the one at Data Ductus. Both were GeForce’s GTX 1080 Ti. This GPU has 11 Gbps and a Standard Memory Configuration of 11 GB GDDR5X.

(32)

general. As a comparison, an example code run would take four months on a CPU-only computer, and 12 hours with a GPU.

3.2.2 Microscope

A microscope was used for data collection, to magnify, study and photograph cut-through pieces of iron ore pellets. The images in this thesis was obtained with a Leica DM 6000 M microscope, with a Leica CTR 6000 controller. The raw images from the microscope were saved as .tif files.

3.3 Dataset

A full sized pellet microscope image is around the size 25000 x 25000 pixels per image. These images are pieced together from camera photos of resolution 2048 x 1536. Because the imaging piecing is not perfect, the complete pellet images have visible lines where the camera photos are pieced together. Furthermore, the size of each pellet image is not equal, as it is adjusted to the size of the pellet. One full-size pellet image file is around 400 - 600 MB large.

(33)

Figure 3.2: Examples of the raw image data set of iron ore pellets.

3.3.1 Sample Preparation

Before the images were taken of the pellets, they were prepared by researchers at LKAB. The pellets were collected in different stages of the experimental blast furnace, and between three to six of them were molded together in a 30 mm or 40 mm cylinder with epoxy. The molded piece was cut, then grinded, then polished to an appropriate depth, where the pellet were cut-through and the micro structures were visible. Additionally, the pellets are porous and easy to damage, so the grinding and polishing had to be carried out softly. Each pellet were about 0.5-0.7 mm in diameter.

3.3.2 Annotation Process

(34)

Figure 3.3: RGB value of the classes and their respective colors used in this thesis.

The distribution of classes in the dataset was not even, with some classes being much more frequent than others, based on the total number of pixels of each class. The total number of pixels of each class can be seen in Figure 3.4, and the exact numbers are presented in Appendix C.

Figure 3.4: The total number of pixels of each class.

3.3.3 Dataset Augmentation

The dataset used for the task was small, as previously mentioned, opening up for the option to augment the data to create more data to train with. Because the dataset used in this thesis was not dependable on orientation in any way, it could be flipped both horizontally and vertically as a mean of augmentation. It was selected not to do any of the more complicated augmentations, such as brightness change, different transformations, additions of noise or similar. The reason was because the images that will be investigated in this thesis were all coming from a controlled lab environment, and this was constant for all images relevant to this topic, thanks to the use of microscopes to capture the images.

3.4 Evaluation Metrics

(35)

3.4.1 Training Evaluation

The training is evaluated with a loss function. It is a measure of how well the algorithm models the dataset. A good model gives a low loss value, and a worse one yields a larger value. The loss is what is fed back in the system during training to evaluate the progress of the training.

Today, the usual choice is using a cross entropy method [10]. A binary classification problem would utilize the Sigmound cross entropy loss, while in this thesis’ case, where there are more than two classes, the Softmax cross entropy loss is the most viable choice.

Softmax cross entropy loss

A softmax is a normalizing operation, taking in non normalized logarithmic probability input and provides a linear, normalized output.

Cross entropy loss is a way to measure the training error. It computes how close the estimated probability output from the softmax operation is to the ground truth.

3.4.2 Prediction Evaluation

The prediction metrics are used on classification problems, and there exists many different ways to evalu-ate the prediction performance of a CNN. A common way of measuring performance is by using the mean Intersection over Union. This is the metric used when comparing the performance of CNNs on datasets and competitions such as Cityscapes [6], PASCAL VOC 2012 [27], and CoCo [15]. In the published paper for each CNN model that will be evaluated in this thesis, they also measure the CNN’s performance in terms of mean IOU, in combination with others. In this thesis, aside from mean IOU, the performance of a semantic segmentation net will also be evaluated with average accuracy, precision, recall, F1-score, and per-class accuracy. As all metrics are binary, they should be considered as if the color of the ground truth

pixel and the predicted pixel are equal as explained in Section 2.5. Each metric is described below.

Intersection over Union

The Intersection Over Union (IOU, or Jaccard index) is a measure of similarity, and in CV tasks, it gives a measure of how large part of the predicted labels and ground truth are equal. The measure is between 0 and 1, and the closer to 1, the more similar are the two. As seen in Equation 9 below, the IOU is the ratio of the intersection between A and B and the union of A and B.

J (A, B) = |A ∩ B| |A ∪ B| =

T P

(T P + F P + F N ). (9)

In most cases, the appropriate measure is the mean IOU over all the classes, and then one uses the below equation [17] MIOU = 1 ncl ∑ inii (ti+ ∑ jnji− nii) , (10)

where nij is the number of pixels of class i predicted to be of class j, when there are ncl different classes,

and ti =

∑

jnij is the total number of pixels belonging to class i.

Accuracy

The accuracy metric measures the prediction of true results compared to all predictions, between 0 and 1.

Acc = T P + T N

T P + T N + F P + F N, (11)

with the average accuracy defined as the below equation. Let nii be the number of pixels of class i predicted

to belong to class i, where there are ncl different classes.

(36)

Precision

The precision metric measures how often the model is predicting the correct pixel class, giving a value between 0 and 1.

p = T P

T P + F P. (13)

Recall

The recall metric measures how often the model predicts true pixel classes, giving a value between 0 and 1.

r = T P

T P + F N. (14)

F1 score

The F1 score combines precision and recall metrics, giving a value between 0 and 1 with the equation below.

F = 2pr

p + r. (15)

3.5 Semantic Segmentation Models

This subsection will describe the selection process of the CNNs that are used in this thesis, and then introduce each of the CNN models that are selected for the task. All models will be run with a ResNet101 model pre-trained on ImageNet. The concepts of the building blocks of each model are prevously introduced in Section 2.3.

3.5.1 Model Selection Process

The CNN model selection was made from two GitHub repositories compiling up to date CNN models for semantic segmentation used for image processing applications6 7_{. The first selection was based on two things:}

sourcecode availability and age of the model. The sourcecode availability was of the greatest importance, as the aim was to use the model for experiments in the thesis. The age was of importance due to the speed of the advancements in this sector, so older models stand no chance against newer ones. This resulted in thirteen papers.

The next selection was based on framework and variation on designs, where the acceptable framework would be TensorFlow. The model design variety was preferable instead of having nets that are only just slightly different from each other. This also included a variation in presented computational cost. From the presented conditions, the final selection of CNN models to use in the thesis are the five below.

3.5.2 Pyramid Scene Parsing Network

Pyramid Scene Parsing Network, PSPNet, is a semantic segmentation model created by Zhao et al. [29] including their Pyramid Pooling Module, as displayed in Figure 3.5.

(37)

Figure 3.5: Overview of PSPNet’s architecture. It contains pooling, convolutions, upsampling and concate-nations [29].

PSPNet combines finding global features with pyramid pooling, effectively executing both scene parsing and pixel level predicting simultaneously.

PSPNet has 56.0M trainable parameters. The model pretrained with MS-COCO gives the, as of April 2017, new state-of-the-art performance of 85.4% average IOU accurcy on PASCAL VOC 2012 and 80.2% average IOU accuracy on Cityscapes.

3.5.3 Fully Convolutional DenseNets

The fully convolutional DenseNets, or FC-DenseNets by Jégou et al. [13], extends the use of DenseNets to semantic segmentation applications. The overview of the FC-DenseNet is presented in Figure 3.6 below.

Figure 3.6: Overview of FC-DenseNet’s architecture. It is U-formed and consists of of convolutions, tran-sition downs (pooling), trantran-sition ups (upsampling), concatenations and dense blocks. The dense blocks concatenates new feature maps [13]

.

(38)

dataset.

3.5.4 DeepLabv3+

DeepLabv3+ is a deep CNN developed by Chen et al. for Google Inc released in February 2018. [4] It combines special spatial pyramid pooling modules and encoder-decoder structures, with the detailed structure presented in Figure 3.7.

Figure 3.7: Overview of DeepLabv3+’s architecture. The encoder stage contains atrous convolutions, pooling and convolutions. The decoder stage contains convolutions, upsampling and concatenation [4].

DeepLabv3+ combines a depthwise spatial convolution of one layer with a pointwise convolution of multiple layers, doing convolution on multiple input layers at once, resulting in what Chen et al. [4] names atrous separable convolution. The model has 48.0M trainable parameters.

The DeepLabv3+ CNN model has a measured performance of 89.0% average IOU on the PASCAL VOC 2012 dataset, and 82.1% average IOU on the Cityscapes dataset test set performance, hence being the new state-of-art on the two datasets as of August 2018.

3.5.5 BiSeNet

(39)

Figure 3.8: Overview of BiSeNet’s architecture. The Attention Refinement Module (ARM) consists of global pooling, convolution, batch normalization and a sigmoid. The Feature Fusion Module (FFM) consists of concatenation, convolutions, batch normalizations and rectified linear units, global pooling, sigmoid and summation. The full architecture combines the two through a spatial and a context path [28].

The context path are combined with the Attention Refinement Module presented in Figure 3.8 above for refining the output from the context paths. The two paths are combined with the Feature Fusion Module presented in the above figure.

BiSeNet has 47.6M trainable parameters. It achieves 68.4% average IOU on the Cityscapes test dataset, with a speed of 105 FPS.

3.5.6 Global Convolutional Network

(40)

Figure 3.9: Overview of GCN’s architecture. The Global Convolutional Network consists of convolutions and summations, the Boundary Refinement (BR) consists of convolutions, rectified linear units and summations, and the whole pipeline consists of convolutions, the GCN, the BR, deconvolutions and summations [19]. The GCN CNN model has 43.0M trainable parameters. It has a measured performance of 83.6% average IOU on the Pascal VOC 2012 dataset, and 76.9% average IOU on the Cityscapes test dataset, achieving the current state-of-the-art as of March 2017.

3.6 Experiments

Experiments have to be carried out to be able to evaluate and compare the performance of the selected models. The following three experiments have been selected to evaluate the performance of the nets, as well as the impact of dataset size, data augmentation, learning rate and batch size on model performance. The performance will be evaluated with the metrics defined in section 3.4.

3.6.1 Impact of Data Size and Augmentation Techniques

The small size of the data set that will be used in this thesis opens the possibility to investigate how data augmentation techniques can improve the training, validating and testing of the models. There is no need of orientation of the images, so the data can be flipped both horizontally and vertically, enlarging the training dataset with a factor of four.

3.6.2 Parameter Tuning

Based on Section 2.3.8, the tuning of the learning rate is maybe of greatest importance to allow the best performance on the available dataset. In this experiment two different learning rate values will be tried for the best models to find which one fits each model the greatest. Thus, the learning rate hyperparameter will be adjusted to optimize the performance of the best nets.

3.6.3 Model Performance on Complete Dataset

(41)

(42)

4 RESULTS

4.1 Results on Public Datasets

This section presents a compilation of the CNN models’ results on public data sets, as described in their published articles and/or on the public leaderboards8910_{. They can be seen in Table 4.1.}

Table 4.1: Comparison of the selected models’ results on the public datasets Cityscapes, PASCAL VOC 2012 and CamVid, measured in average IOU (%).

Model Cityscapes test dataset PASCAL VOC 2012 CamVid

mIOU (%) mIOU (%) mIOU (%)

PSPNet 81.2 85.4 69.1

FC-DenseNet56 - - 58.9

DeepLabv3+ 82.1 89.0

-BiSeNet 78.9 - 68.7

GCN 76.9 83.6

-All models have not published results on all public datasets, explaining the missing results in the compi-lation.

4.2 Test Run

4.2.1 System design

The CNN models were trained with a batch size of 1 for 150 epochs for speed. The dataset contains 180 images, roughly annotated by Ilastik. The results are as shown below in Tables 4.2 and 4.3.

4.2.2 Roughly annotated images

Table 4.2: Comparison of the selected models’ resulting validation metrics with a test run on the rough Ilastik annotated data, all metrics measured in percent (%).

Model mAcc Precision Recall F1 mIOU Training time (min)

(43)

Table 4.3: Comparison of the selected models’ results with a test run on the Ilastik annotated data, average per class accuracy measured in percent (%).

Model Pore Epoxy Hematite Magnetite Wustite Metallic Iron Slag Olivine

PSPNet 77.3 87.2 59.3 71.2 90.4 98.6 37.4 52.1

FC-DenseNet56 80.3 81.9 58.5 74.9 93.3 98.7 38.4 48.0

DeepLabv3+ 79.8 78.9 54.8 70.0 92.9 96.2 25.9 49.7

BiSeNet 63.2 79.1 55.2 61.7 88.5 96.6 15.4 46.9

GCN 73.8 84.9 58.5 70.4 92.5 97.9 16.0 51.1

Comparison of resulting segmentation of a part of a pellet image is shown in Figure 4.1 below.

(a) Ground truth (b) BiSeNet (c) DeepLabv3+

(d) FC-DenseNet56 (e) GCN (f) PSPNet

(44)

4.3 Data Size and Augmentation Impact

The CNN models were trained with a batch size of 1, with a learning rate of 10−3, for 200 epochs, on the datasets with varying sizes.

4.3.2 20 percent

The 20% training was made with 32 images, divided in 16 training, 10 testing and 6 validation images. The resulting model performances are listed in Tables 4.4 and 4.5 below.

Table 4.4: Comparison of the selected models’ resulting validation metrics with 20% of the dataset, all metrics measured in percent (%).

PSPNet 73.4 77.8 73.4 73.8 39.1 90

FC-DenseNet56 76.1 76.6 76.1 75.7 43.4 96

DeepLabv3+ 71.6 76.4 71.6 72.3 36.4 86

BiSeNet 63.1 75.7 63.1 63.4 27.8 149

GCN 72.5 76.5 72.5 73.2 34.8 161

Table 4.5: Comparison of the selected models’ results with 20% of the dataset, average per class accuracy measured in percent (%).

PSPNet 59.3 85.4 83.3 50.7 84.9 69.6 17.7 45.4 FC-DenseNet56 74.5 77.9 83.3 56.9 80.7 79.6 20.6 49.0 DeepLabv3+ 50.0 86.2 83.4 52.1 78.7 75.4 17.3 38.1 BiSeNet 31.8 91.7 83.3 50.1 77.3 59.5 16.7 33.3 GCN 36.8 85.0 83.4 51.1 81.2 77.2 16.8 35.7 4.3.3 40 percent

PSPNet 78.7 79.6 78.7 78.3 50.4 140

FC-DenseNet56 88.9 89.4 88.9 88.8 57.7 128

DeepLabv3+ 82.8 82.3 82.8 81.8 51.7 104

BiSeNet 65.3 68.4 65.3 62.9 37.6 173

(45)

PSPNet 71.9 82.9 91.4 71.4 87.0 89.2 40.1 54.0 FC-DenseNet56 64.3 87.4 97.3 81.4 90.6 89.9 48.0 64.2 DeepLabv3+ 65.5 84.0 95.9 70.4 83.8 88.6 43.0 68.7 BiSeNet 69.2 75.3 91.5 71.0 79.4 84.3 38.5 46.2 GCN 69.3 71.7 85.8 80.0 77.9 90.2 43.3 64.6 4.3.4 60 percent

PSPNet 71.2 75.4 71.2 69.4 48.6 135

FC-DenseNet56 89.7 89.9 89.7 89.4 59.7 154

DeepLabv3+ 84.2 83.9 84.2 83.1 55.3 151

BiSeNet 67.6 67.5 67.6 64.6 40.9 279

GCN 71.5 69.9 71.5 69.2 47.2 208

PSPNet 78.1 79.9 86.1 72.4 83.9 92.1 57.4 65.6 FC-DenseNet56 78.1 84.0 94.9 79.8 91.7 93.0 58.8 70.3 DeepLabv3+ 68.9 82.6 95.4 78.2 85.6 92.0 60.0 79.7 BiSeNet 56.1 79.1 88.9 68.0 85.7 87.5 49.8 63.8 GCN 60.4 83.9 85.2 56.1 88.4 91.9 50.4 75.4 4.3.5 Full dataset

(46)

Table 4.10: Comparison of the selected models’ resulting validation metrics with the full dataset, all metrics measured in percent (%).

PSPNet 87.4 87.0 87.4 86.7 59.1 183

FC-DenseNet56 85.2 86.6 85.2 85.4 58.9 217

DeepLabv3+ 87.5 87.3 87.5 87.0 60.4 157

BiSeNet 78.7 78.7 78.7 77.8 46.2 223

GCN 86.9 88.1 86.9 87.0 60.5 237

Table 4.11: Comparison of the selected models’ results with the full dataset, average per class accuracy measured in percent (%).

PSPNet 76.8 85.4 94.5 75.8 90.0 88.6 66.8 65.5

FC-DenseNet56 68.9 87.1 96.6 65.0 93.0 89.3 64.5 62.3

DeepLabv3+ 70.0 85.1 94.8 74.9 90.0 87.0 69.0 76.3

BiSeNet 51.7 78.5 95.8 69.5 85.6 85.9 51.5 65.9

GCN 66.8 84.7 96.2 74.8 90.0 87.8 65.9 77.4

4.3.6 Graphs and Averages

Figure 4.2 below shows the average accuracy of each CNN model against the dataset size in number of images.

Figure 4.2: The average accuracy in percentage of each model against dataset size. FC-DenseNet56 performed the best with smaller dataset sizes, while placing fourth with the full dataset. DeepLabv3+ placed among the top three nets consistently. PSPNet and GCN performed almost equally over the different datasets. BiSeNet performed notably worse than the other models.

(47)

Figure 4.3: The F1 score in percentage of each model against dataset size. FC-DenseNet56 performed the best with smaller datasets, while placing fourth with the full dataset. DeepLabv3+ displayed a steady increase with increasing dataset size, placing among the top for the full dataset. Both PSPNet and GCN showed a decrease in F1 score when increasing from 40% to 60% of the dataset, and performed with the best with the full dataset. BiSeNet performed notably worse than the other models.

Figure 4.4 below shows the average IOU of each model against the dataset size in number of images.

Figure 4.4: The average IOU in percentage of each model against dataset size. FC-DenseNet56 performed the best with smaller datasets, while placing fourth with the full dataset. DeepLabv3+ displayed a steady increase with increasing dataset size, placing among the top for the full dataset. PSPNet showed a decrease in average IOU when increasing from 40% to 60% of the dataset, and together with GCN performed with the best with the full dataset. BiSeNet performed notably worse than the other models.

(48)

Figure 4.5: The training time of each model in minutes against dataset size. GCN displayed a steady increase in training time, always training with one of the longest times. PSPNet, FC-DenseNet56 was consistently in the center of the group. PSPNet and DeepLabv3+ were the two with the shortest training times. BiSeNet displayed a training time for the 60% dataset size at a great distance from all other measurements, both own and other models’. Otherwise, BiSeNet took one of the longest times for training throughout.

Table 4.12: Average metrics for each dataset size, all metrics measured in percent (%). Model mAcc Precision Recall F1 mIOU Training time (min)

20% 71.3 76.6 71.3 71.7 36.3 117

40% 78.0 79.4 78.0 77.0 48.9 146

60% 76.8 77.3 76.8 75.1 50.3 186

100% 85.1 85.5 85.1 84.8 57.0 204

Data aug 88.0 88.6 88.0 87.9 60.9 206

Table 4.13: Average metrics for each dataset size, average per class accuracy measured in percent (%). Model Pore Epoxy Hematite Magnetite Wustite Metallic Iron Slag Olivine

20% 50.5 85.2 83.4 52.2 80.6 72.3 17.8 40.3

40% 68.0 80.3 92.4 74.8 83.7 88.4 48.6 59.5

60% 68.3 81.9 90.1 70.9 87.1 91.3 55.3 71.0

100% 66.8 84.2 95.6 72.0 89.7 87.7 63.5 69.5

(49)

Figure 4.6: The mean of the average class accuracy of all models against dataset size. Epoxy was the only class to be most accurately annotated with the smallest dataset size, all other classes demonstrated improved class accuracy with increased size. Hematite, followed by wüstite, metallic iron and epoxy were the four most accurately predicted classes. Magnetite, olivine, pore and slag were the four classes to be the hardest to predict accurately, with the slag class being the worst independent on dataset size.

4.3.7 With data augmentation

The selected data augmentations are vertical and horizontal flipping, resulting in a fourfold increase in dataset size. The dataset contains 180 images, divided in 91 training, 58 testing and 31 validation images. The resulting model performances are listed in Tables 4.14 and 4.15 below.

Table 4.14: Comparison of the selected models’ resulting validation metrics with full dataset and data augmentation, all metrics measured in percent (%).

PSPNet 91.7 92.2 91.7 91.8 64.3 172

FC-DenseNet56 89.4 90.1 89.4 89.5 63.5 219

DeepLabv3+ 88.3 88.7 88.3 88.1 63.7 156

BiSeNet 82.3 83.2 82.3 82.2 51.1 242

GCN 88.2 88.6 88.2 88.0 61.9 241

Table 4.15: Comparison of the selected models’ results with full dataset and data augmentation, average per class accuracy measured in percent (%).

PSPNet 74.0 84.0 95.5 79.8 93.4 87.9 66.9 70.8

FC-DenseNet56 67.9 86.6 94.2 72.1 93.9 88.7 65.1 77.6

DeepLabv3+ 70.2 84.7 93.3 74.7 90.9 88.1 74.7 81.5

BiSeNet 61.0 80.4 93.6 69.6 84.6 87.8 60.4 66.1

(50)

Based on the above results, it was selected to continue the experiments with PSPNet and FC-DenseNet56, as the two performed the best.

4.4 Parameter Tuning

In the following tests, 180 images, divided in 91 training, 58 testing and 31 validation images are used, together with vertical and horizontal flipping as data augmentation. The test will training for 200 epochs with a batch size of 1.

4.4.2 Initial learning rate

PSPnet and FC-DenseNet56 trained with learning rate of 10−3 and learning rate of 10−4. The resulting model performances are listed in Tables 4.16 and 4.17 below.

Table 4.16: Comparison of the selected models’ resulting validation metrics with varying learning rates, all metrics measured in percent (%).

Model Learning rate mAcc Precision Recall F1 mIOU Training time (min)

PSPNet 10−4 91.7 92.2 91.7 91.8 64.3 172

PSPNet 10−3 91.9 92.5 91.9 92.0 62.9 159

FC-DenseNet56 10−4 89.4 90.1 89.4 89.5 63.5 219

FC-DenseNet56 10−3 89.7 90.3 89.7 89.7 67.7 211

Table 4.17: Comparison of the selected models’ results with varying learning rates, average per class accuracy measured in percent (%).

Model Learning rate Pore Epoxy Hematite Magnetite Wustite Metallic Iron Slag Olivine

PSPNet 10−4 74.0 84.0 95.5 79.8 93.4 87.9 66.9 70.8

PSPNet 10−3 74.7 88.6 95.7 76.3 92.7 88.0 62.8 64.0

FC-DenseNet56 10−4 67.9 86.6 94.2 72.1 93.9 88.7 65.1 77.6

FC-DenseNet56 10−3 76.2 92.6 95.9 70.8 89.2 89.0 77.8 81.8

(51)

Figure 4.7: The average validation accuracy over epochs for PSPNet with learning rates 10−3(left) and 10−4 (right). The larger learning rate resulted in larger oscillations in the average accuracy over the epochs, and a longer time to reach the top performance. The smaller learning rate resulted in a smoother and faster learning curve, and smaller oscillations.

(52)

Figure 4.9: The training loss over epochs for PSPNet with learning rate 10−3 (left) and 10−4 (right). The larger learning rate resulted in larger oscillations in the average loss over the epochs, and a longer time to minimize the loss. The smaller learning rate resulted in a smoother and faster learning curve, and smaller oscillations.

It can be noted that using a larger learning rate resulted in larger oscillations for all measured metrics. While the observed performance differences for IOU and validation accuracy seems irregular and random, the average loss against epoch seems to reach an increased peak with a cyclic pattern. Investigation of the nature of this almost cyclic loss increase could be of interest for further work on this topic.

4.5 The Models’ Performance

4.5.1 PSPNet

PSPNet performed as one of the best CNN models during the dataset size experiments, and with data augmentation, it was the best of the options, as well as one of the fastest. PSPNet was the best at identifying the pore and magnetite classes. With tuning the learning rate, it performed with almost 92% average accuracy on the validation dataset.

Appendix B contains five of the validation images of PSPNet, where it can be observed that its difficulties mainly are the slag and olivine classes.

4.5.2 FC-DenseNet56

FC-DenseNet56 performed as one of the best CNN models during the dataset size experiments. It was the best at identifying the epoxy, wüstite and magnetic iron classes. FC-DenseNet56, while giving the best average IOU of the tested CNN models, was still outperformed only by PSPNet. Nevertheless, it is a good choice, and might be the best fitted CNN model for other tasks.

4.5.3 DeepLabv3+

(53)

4.5.4 BiSeNet

BiSeNet is supposed to be faster compared to the other CNN models, while not compromising too much of the performance. In the experiments in this thesis, it was however one of the slowest, and consistently giving the worst performance of the tested CNN models. BiSeNet was the worst of the CNN models at correctly identifying the pore, epoxy, magnetite, wüstite, magnetic iron, slag and olivine classes in the microscope images. Based on this, it was not a difficult decision to not include it in further testing.

4.5.5 GCN

(54)

5 DISCUSSION

5.1 Dataset Size and Augmentation Impact

5.1.1 Dataset size vs training time

The time it takes to train a CNN model on a dataset is affected by many factors, where the size of the dataset is one of them. In the case of this thesis work, the training time ranged from 86 to 279 minutes, or 1.5 to 4.5 hours. As seen in Figure 4.5, there was a large variation between the CNN models, and the training time increased when the dataset size increased. The few exceptions where the training time decreased when the dataset size increased could be explained by other circumstances, for example if the GPU was simultaneously busy with other efforts, as the training always was running in the background during both workdays and over nights. This makes the actual measured training time unreliable as a measure, and should be viewed more as a rough estimate than a definite frame of time it takes for every training. All in all, the dataset size increase from 32 to 180 images yielded an average CNN model training time increase of 75 %.

5.1.2 The impact of dataset size on performance

The results showed that larger dataset improved performance. Due to an uneven distribution of classes in the datasets, the CNN models may have been trained more on some classes than others, which was most visible in the smaller datasets. In Tables 4.12 and 4.13 the averages of each size is presented, and it can be observed that an increase in dataset size resulted in an improved CNN model performance. The increase was largest between the smaller dataset sizes. This is logical, as the performance is approaching the 100% accuracy, and it can never move past it. Figure 4.6 visualizes how also the different classes’ performances approached each other when the dataset size was increased, although there is a notable difference in the difficulty for each CNN model to identify the different classes correctly.

5.1.3 Data augmentation impact

Data augmentation is a viable choice when the available dataset is small. Only by horizontally and vertically flipping the dataset, the average accuracy was increased by 3.41%, and the average IOU was increased by 6.84%. Given the large variety of available data augmentation techniques, there exist possibilities to further improve a CNN’s performance without adding additional ground truth images.

5.2 Comparing to Other Results of Same Models

Based on the compilation of reported performances in Table 4.1, the potential of the tested CNN models were not reached in this thesis. The main explanation of this is that the larger datasets contain thousands of segmented images, while the CNN models here had less than 200 images to learn how to differentiate between the classes. In addition, the ground truth images provided in the huge databases are in color, compared to the grey scale images used in this thesis work, giving the CNN models more data to differentiate the classes with, making them perform better with colored images compared to grey scale.

The CNN models’ greater performance on larger datasets displays that the models have a window of improvement also in this task, given that the ground truth dataset is enlarged.

(55)

5.3 Evaluation of the Evaluation Method

There have been many program runs with training of the different CNN models, allowing for decisions based on CNN model performance. The dataset size tests displayed the size’s impact on the CNN models’ performance. It is further visualized with image examples in Appendix A.

Using multiple evaluation metrics allowed for a conclusive evaluation of the CNN models. When testing all five CNN models, the different metrics showed similar results, making the decision on which CNN models to continue testing with straightforward. However, when testing different learning rates with PSPNet and FC-DenseNet56, both CNN models performed the best measuring with the different metrics. FC-DenseNet56 with a learning rate of 10−3 performed the best based on the average IOU, PSPNet with a learning rate of 10−3 performed the best based on average accuracy and F1 score, while having the worst average IOU of the four options.

Although the fine annotation of the ground truth images did not significantly impact the performance of the CNN models, as can be seen from the results in Section 4.2, it is important for the actual accuracy compared to the original images. A CNN model can never perform better than the ground truth provided to it. So the time it takes for correctly annotating images is important to yield results that are relevant to the real world.

There was an attempt to training with larger batch sizes, however the existing GPU memory was not sufficient for the task.

5.4 Evaluation of the Pellet Composition

The identification accuracy performance were not equal between the classes. Based on Table 4.13 and Figure 4.6, slag was the most difficult to detect, followed by pores and olivine, while hematite followed by wüstite and metallic iron were the easiest.

Firstly, there was not a balanced number of pixels of each class in the dataset, as mentioned in Section 3.3.2, meaning that the dataset was skewed in favour of the classes epoxy and wüstite, and with very little representation of the classes slag and olivine.

Based on a common need of a balanced class representation for ideal performance of the algorithms, this is a potential area of improvement. There exists Python libraries to assist imbalanced datasets, which could be used, for example the imbalanced-learn package. At the same time, the balance represented in this thesis is the natural class balance for the pellet composition. Slag and olivine are naturally much less common than the other classes.

Another notable complication is that there is no clear distinction between pores and epoxy in the ground truth. LKAB expressed that the two in fact could be combined to one class, as the separation of them are not critical for the investigation. This would probably increase the prediction accuracy of the two classes. In a similar manner, olivine and slag are complicated to identify, and especially their edges, as the two tend to blend in with the other classes.

5.5 The Potential for CNNs for Iron Ore Pellet Classification

The mining industry can benefit in many ways using CNNs in their industry, with two notable areas being saving time, and improving the analysis of the iron ore pellets. The CNNs perform satisfactory on segmenting microscope images, including the iron ore pellet images assessed in this thesis, and the controlled environment a lab allows for results in fewer outside factors disturbing the data.

There exists a good amount of useful data for this task, with the only need of segmenting them to create more ground truth, that can be used for further improving the predictions.

Semantic Segmentation of Iron Ore Pellets with Neural Networks

with Neural Networks

Terese Svensson

Space Engineering, masters level

2019

Semantic Segmentation of Iron Ore Pellets

with Neural Networks

Terese Svensson

Examiner: Dr. Anita Enmark, Luleå University of Technology

External Supervisor: Dr. Martin Simonsson, Data Ductus

External Deputy Supervisor: Elin Åström, LKAB

Abstract

ACKNOWLEDGEMENTS

CONTENTS

LIST OF FIGURES

LIST OF TABLES

ACRONYMS

1

INTRODUCTION

1.1 Motivation

1.2 Thesis Aim

1.3 Objectives

1.4 Deliminations

1.5 Thesis Outline

2

THEORY

2.1 Semantic Segmentation

2.2 Neural Networks

2.3 Convolutional Neural Network

2.4 Network Training

2.5 Evaluating a Network

2.6 Iron Ore Pellets

3

MATERIAL & METHODS

3.1 Software

3.2 Hardware

3.3 Dataset

3.4 Evaluation Metrics

3.5 Semantic Segmentation Models

3.6 Experiments

4

RESULTS

4.1 Results on Public Datasets

4.2 Test Run

4.3 Data Size and Augmentation Impact

4.4 Parameter Tuning

4.5 The Models’ Performance

5

DISCUSSION

5.1 Dataset Size and Augmentation Impact

5.2 Comparing to Other Results of Same Models

5.3 Evaluation of the Evaluation Method

5.4 Evaluation of the Pellet Composition

5.5 The Potential for CNNs for Iron Ore Pellet Classification