Machine vision for automation of earth-moving machines: Transfer learning experiments with YOLOv3

(1)

Machine vision for automation of earth-moving machines: Transfer learning experiments with YOLOv3

Carl Borngrund

Computer Science and Engineering, master's level (120 credits) 2019

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

(2)

Machine vision for automation of earth-moving machines: Transfer learning experiments with YOLOv3

Carl Borngrund

Dept. of Computer Science, Electrical and Space Engineering Lule˚ a University of Technology

Lule˚ a, Sweden

Supervisors:

Ulf Bodin, Fredrik Sandin

(3)

(4)

Abstract

This master thesis investigates the possibility to create a machine vision solution for the automation of earth-moving machines. This research was done as without some type of vision system it will not be possible to create a fully autonomous earth moving machine that can safely be used around humans or other machines. Cameras were used as the primary sensors as they are cheap, provide high resolution and is the type of sensor that most closely mimic the human vision system.

The purpose of this master thesis was to use existing real time object detectors together with transfer learning and examine if they can successfully be used to extract information in environments such as construction, forestry and mining. The amount of data needed to successfully train a real time object detector was also investigated. Fur- thermore, the thesis examines if there are specifically difficult situations for the defined object detector, how reliable the object detector is and finally how to use service-oriented architecture principles can be used to create deep learning systems.

To investigate the questions formulated above, three data sets were created where different properties were varied. These properties were light conditions, ground material and dump truck orientation. The data sets were created using a toy dump truck together with a similarly sized wheel loader with a camera mounted on the roof of its cab. The first data set contained only indoor images where the dump truck was placed in different orientations but neither the light nor the ground material changed. The second data set contained images were the light source was kept constant, but the dump truck orientation and ground materials changed. The last data set contained images where all property were varied.

The real time object detector YOLOv3 was used to examine how a real time object detector would perform depending on which one of the three data sets it was trained using.

No matter the data set, it was possible to train a model to perform real time object detection. Using a Nvidia 980 TI the inference time of the model was around 22 ms, which is more than enough to be able to classify videos running at 30 fps. All three data sets converged to a training loss of around 0.10.

The data set which contained more varied data, such as the data set where all properties were changed, performed considerably better reaching a validation loss of 0.164 compared to the indoor data set, containing the least varied data, only reached a validation loss of 0.257. The size of the data set was also a factor in the performance, however it was not as important as having varied data. The result also showed that all three data sets could reach a mAP score of around 0.98 using transfer learning.

(5)

(6)

Chapter 1 Thesis Introduction

1.1 Background

Earth moving machines are a type of heavy duty machinery used in a vast number of industries. Example of such industries are forestry, mining and construction. Without these machines the industries relying on them would not function. There exists many different earth moving machines that all serve a vital role these industries. Machines like wheel loaders are used to move smaller amount of materials or load materials onto other machines, such as dump trucks. Dump trucks are used to move large amount of materials at once.

All industrial processes has been subject for automation in recent years and earth moving machines are no different. Automation of earth moving machines can bring great benefits to companies, for example improve the safety of workers and reduce cost.

However as long as these companies are needed to equip the machines with a human in the drivers seat the cost reduction will be negligible [8].

Typically autonomous vehicles has a multi module architecture where the perception module interprets all the data from different sensors and from that data generates a world model of what is around the vehicle. One of the main tasks of this perception module is object detection and classification.

Automation of earth moving machines can be seen to have five phases.

1. Manual operation 2. In-sight tele-operation 3. Tele-remote operation

4. Assisted tele-remote operation 5. Fully autonomous operation

(9)

Graphics PC Control PC Switch

Interface HW IMU

GPS

Switch 4G Router

Sensors

Router Router Internet

4G Fiber

VPN tunnel

Figure 1.1: System overview

Currently industry is only utilizing manual operation, in-sight tele-operation and tele- remote operations. The current research into automation of earth moving machines is in between phase 3 and phase 4 as discussed in [8]. To achieve any of these five phases different hardware and software functions are needed. The automation will also have to be proven safe due to EU directive 2006/42/EC [7]. The functionality can be categorized into the control room, the data link and the vehicle [15]. A overview of these functions can be seen in figure 1.1.

Real time video feed is typically used by tele-remote operators, using the video feed to for fully autonomous machines can be advantageous. Such information may include, but is not limited to, classification of objects within the feed, part maps of the objects or distance to the object. Creating a rule-based algorithm to accomplish there tasks is very difficult because of the vast difference in images, which is why a different approach is needed. Because of this very large difference in the input images, a machine learning¹ approach may be preferred.

Most of the work related to autonomous driving of earth moving machines has been conducted in the realm of autonomous driving inside environments such as highway driving, urban driving and country driving. Early projects such as Eureka Prometheus Project produced great results which accumulated into VaMoRs-P [9] in the late 80s to early 90s. The different DARPA Grand Challenge such as the 2005 Grand Challenge and 2007 Urban Challenge pushed the field forward with vehicles such as Stanley [27], Tartan [28] and Junior [20]. The work was continued in 2009 by Google’s self driving

1Machine learning is described in section 2.2

(10)

project which goal was to drive ”fully autonomously over 10 uninterrupted 100-miles routes” [4]. This goal was achieved in a few months. Multiple start ups has since the early-to-mid-2010s worked on autonomous driving, such as Waymo (Google), Tesla and Comma.ai. Most established automotive manufacturers has also conducted research into autonomous driving, this includes both cars and lorries.

The biggest difference between the research which has been in autonomous driving of normal vehicles and autonomous driving for earth moving machines is the environments they operate in. For example, the task of autonomous highway driving is a very constrained problem, where the feature space is quite small. The feature space for fully autonomous earth moving machine in the intended industries are very large. The difference in between strip mining and forestry is much larger than the difference between a highway in the USA and a highway in Germany. To our knowledge no academic work has been conducted aimed towards trying to solve autonomous driving of earth moving machines in the intended industries.

1.2 Motivation

To reach the fifth phase (fully autonomous operation) the vehicle must understand what exists around itself in the world. Without some type of sight such a vehicle will never be able to function within the established industries. Not knowing whats in front of the vehicle before there is physical contact is incredibly dangerous. When humans maneuver any type of vehicle sight is the main sense being used. Cameras will be the main sensors used as most of the infrastructure surrounding earth moving machines are based on vision.

Other sensors have different properties which could be used, together with cameras, to allow the system to get a better understanding around itself. The problem which most of these sensors is the fact that they have low resolution, low range or are very expensive.

These sensors would also have a hard time giving a good understanding of the world when using a single type of sensor. Cameras, which are based on the human vision system, are cheap, use very high resolution and have a high range thus making them good for machine vision when only using a single type of sensor.

The human vision system is used to to understand depth, see objects, be alerted to penitential dangerous and allow for planning. The visual information can also be used to perform tasks better than only using other types of sensors. An example of this is when trying to fill a wheel loaders bucket, the operators use their sight extensively, however the current research into automation of wheel loaders only used pressure sensors. [8]

Requiring different industries to produce copious amount of data such as videos of day to day operations may not be feasible, especially for smaller companies. Thus, the machine learning model need to be taught with a smaller amount of data. For comparison a classic dataset for image recognition such as ImageNet contains over 14 million images [3]. It is therefore very important to either have the machine learning model generalize very well across multiple different industries or allow those different industries to train the model using a small amount of data. It is also important for the data which is being extracted from the real time video stream to be transported to the subsystems which

(11)

depend on it. Without such functionality it would be impossible to create a modular solution and the system would then have to be able to perform every single task which might be used in the intended industries.

1.3 Problem Definition

The problem which is being researched within this thesis can be separated into five questions. All these questions exists within the context of earth moving machines and the concerned industries as described above.

Q1: How can existing deep learning models be leveraged to perform object detection and classification from real time videos with the aim of autonomous earth-moving machines?

Using a real time video feed puts specific requirements on both the hardware and software used. These requirements are far more stringent than object detection using static images. What are these requirements and how are they met? What type of information is interesting to extract and what other system would want to use it?

Q2: How much training data is needed to reach good performance when using transfer learning?²

What is quality data and how is it collected properly? What properties should the data have and how much variance is needed? Are these different when using transfer learning?

How well can the model perform when using transfer learning. How much data is needed to train the model sufficiently? Does there exist inherent problems with transfer learning?

Q3: What type of environmental variation can make object detection and classification difficult for the deep learning model?

What types of situations can make it difficult for a real time object detector to correctly detect objects? Such variations might be different backgrounds, different light sources or different ground materials.

Q4: How reliable is a deep learning model trained using transfer learning for the task of real time object detection?

When using transfer learning to train a deep learning model for the task of real time object detection, how reliable is such a system? How can the reliability be tested in such a way to prove it safe?

Q5: How can such a deep learning model be created using SoA principles?³ Service-oriented architecture has a few important principles attached to it, how can these principles be used when creating a deep learning model or is non deep learning software development fundamentally different?

2Explanation of what transfer learning is can be found in section 2.5

3For a full explanation of SoA can be found in 2.7

(12)

Review and discussion

Requirements

Experiments

Evaluation

Figure 1.2: Stages

1.4 Scope and Delimitation

The scope of this thesis is to use real time video, extract information and transport the information over WiFi or cable. There will be no attempt in sending commands or controlling any part of earth moving machines. The research done is solely a computer vision and transporting problem. The real time video will come from a normal camera, there will be no extra inputs such as radar or LIDAR. It is also important to note that neural networks will be used together with transfer learning. There will be no attempt to research any other way to try to solve this problem such as classic image processing or to create a large data set containing appropriate data.

1.5 Methodology

To be able to answer the questions described in section 1.3 a combination of theory and experiments was used. Using such a mix of theory and experiments would allow for well structured experiments with deep roots in theory and the experiments would show these theories being applied on real situations, which is an important step for wide spread use within industry. A continuous dialogue with industry was upheld to understand what type of information would be interesting to extract from a live video feed. The main industry partner was Volvo Construction Equipment thus the research was aimed to answer the questions raised within that dialogue. A short literary review was conducted to allow for good theoretical understanding.

The research was conducted in different stages and these stages were iterated upon.

The main stages were a research and discussion stage, a requirements definition stage, experiments stage and evaluation stage. A visualisation of these stages can be found in Figure 1.2.

Using the work flow described in Figure 1.2 minimizes the risk of incorrectly choosing the requirements or over-engineer one solution. Iterating the stages would also allow for the new gained knowledge from the experiments and evaluation to be used as foundation

(13)

for a more in depth discussion and research stage. This led to a deeper understanding of what requirements were needed to answer the research questions defined in Section 1.3. Some of the research questions defined in 1.3 might not be directly answered by experiments and will thus use a fully theoretical method.

An alternative approach could be to leverage simulations as simulations allows to create a large amount of data very quickly. Simulations does, however, have challenges in capturing all the possible edge cases that exist and there has been almost no research conducted which examines how well computer vision systems trained in simulation gen- eralized to the real world. These problems made simulations unsuitable to answer the questions in 1.3.

YOLOv3 [25] was chosen as the real time object detector as it a high performance real time object detector on multiple data sets. It also had a very good implementation [14] in python using the framework pytorch, which is very easy to edit. Other real time object detectors such as SSD [17] which had similar performance could have been used. An alternative approach could have been to create a real time object detector from scratch, the problem with this approach is the fact that its quite difficult to get it fast enough while performing both detection and classification. Creating a completely new real time object detector would have made the scope of the thesis too large.

1.6 Thesis Structure

This thesis will first present some background theory needed to understand the experimental setup and results. The experimental setup and explanation of how the data collection was performed will follow. Penultimate the results will be presented and lastly a discussion about the results together with conclusion and some suggestions of future work.

(14)

Chapter 2 Theory

2.1 Computer Vision

Computer vision is a subject that has been studied by computer scientist since the 60s and aims to give computers a way to extract high level information from images or videos.

Examples of such high level information is what object exists within an image or how far the object is from the camera. There are multiple different ways to try to achieve such information extraction, using both normal algorithms and artificial intelligence. Example of algorithms used are SIFT [18], SURF [5], and BRIEF [6]. In recent years the computer vision space has been taken over by deep learning models as many of the problems within both fields have a large overlap. The increase of computing power in the last years has also contributed to the rise of deep learning within the computer vision space.

2.2 Machine Learning

Machine learning (ML) is the study of teaching computers to conduct specific tasks.

Unlike a rule based system ML is used by humans to conduct specific tasks by data- driven model optimisation, thus avoiding the need for explicit solutions engineered from first principles. The model learns a task by experience with the use of a performance metric [11]. Mathematically this can be seen in Equation 2.1 where f is the model, x is the input, θ is the model parameters and y is the output.

y = f (x; θ) (2.1)

2.2.1 Task

The task that the model learns can be anything. A very common task to teach the model is classification of objects within different images. To teach the model how to classify

(15)

objects it is shown features that has been extract from the images. These features are usually described using a feature vector.

Let the feature space, F_s, be all the features that can be quantitatively extracted from the input to the task, T . Let F_v be the feature vector and f_i be the i : th index in Fv. Thus Fv = (f1, f2, ..., fn−1, fn)∀{fi ∈ R}. F^v is then used as input to the model to teach it T .

As an example, say T of the model is to classify different dog breeds from images.

Thus the F_scould be information like sex, age, height and weight. F_v could then contain all this information represented numerically. A common way to represent sex is to allow males to be 0 and females to be 1. Thus F_v of a female dog which is 45cm tall, 11 years old and weighs 3 kg would be F_v = (1, 11, 0.45, 3). F_v is then used as input to the model to teach it T as explained above.

2.2.2 Experience

As mentioned above to teach a model a task it will learn by experience. In what way to give a model experience can loosely be categorized in two categories, supervised learning and unsupervised learning.

Supervised Learning

Supervised learning is when teaching the model a specific task using an input that consists of a feature vector and the correct output. As an example, let the input of the model, which is being trained on a object classification task, be a vector that contains the feature vector of the input and the correct output. This output has multiple different names such as label, target or ground truth. These terms will be use interchangeably. The input vector that contains the feature vector and the label is then fed into the model at which point the model is allowed to learn which features corresponds to what output.

Mathematically this can be described by using 2.1. The model is given a input, x and produces, at the start, an arbitrary output, y₁. This output is then measured against the ground truth and the model adapts the parameters, θ such that the model f outputs a value closer to the ground truth for the input x.

Unsupervised Learning

In contrast to supervised learning, unsupervised learning does not have a label which the inputs corresponds to. Thus the model is left to itself to decide what it believes to be important things to output. This output can be, but is not limited to, anomaly detection, clustering or association.

Using anomaly detection as an example, a person does not need to have a perfect understanding of high voltage (HV) underwater cables to notice anomalies if the person is shown a large amount of X-ray images of HV cables. In this case the ability of finding anomalies is enough, there is no need to understand what type of anomaly or be able to know if its safe for use, as cables with anomalies are by definition unsafe to use. Using

(16)

unsupervised learning can be favorable in other cases as finding data that is perfectly labeled and categorized can be very difficult.

There are other examples of learning like semi-supervised learning and regression learning, however they are not relevant for this research.

2.2.3 Performance

To be able to evaluate the knowledge of the model a performance metric, P , must be used. For a classification task P might be a number that represents if the model classified the object correctly or not. Using this one can easily calculate the accuracy or error rate of the model. All different tasks will have a different performance metrics. However, most important metric for all tasks is how well the model performs on unseen input data, basically how good does the model generalize?

2.2.4 Data Set

A data set is a collection of data that is used for machine learning purposes. The data is usually split up into two different sub sets, called training data and test data. The training data usually contains around 80% of the data within the entire data set and the remaining 20% is the test data. The training data is the data that the machine learning model uses to learn a task and the test data is used to check how well the model performed on the task, given completely unseen data.

Only splitting the data in to training and test data can make it difficult to counteract overfitting and underfitting, especially when the dataset is small. Overfitting describes the situation when the model has been trained too much on the training data and has a hard time generalizing to the test data. Overfitting is the result of the model learning the noise in the data, rather than the underlying features. Underfitting describes the situation when the model does not perform well on the training data or the test data.

Underfitting occurs because of the model being to simple compared to the task it is trying to learn.

Overfitting and underfitting comes from the bias-variance tradeoff which states that there exists a conflict when trying to minimize two error sources at the same time. As the name states, these two errors are bias and variance. Bias is the error that describes bad assumptions made in the learning algorithm. High bias can mean the model has a hard time learning the important features of data set, in other words underfitting.

Variance on the other hand describes the error that occurs from small differences in the input compared to the training data. A high variance can mean that the model is very sensitive to noise, or in other words overfitting.

One easy way to mitigate underfitting and overfitting is to preserve 20% of the training data and introduce a new subset called validation data. These 20% are never used to to train the model, but is instead used to check how well the model is performing on unseen data after every training step. Whenever a model performs better than the previous that model is saved. This allows for the model to be loaded after the training has been completed and can lead to a better performing model on never seen before input.

(17)

f y w₂ P

x₂ ... ...

w_n x_n

w₁ x₁

w₀ x₀

b

Figure 2.1: Perceptron

Underfitting and overfitting is not uniquely a problem when it comes to the actual data used, however splitting the data set into validation data, training data and test data is quite an easy way to mitigate these problems.

Data Augmentation

Another important way to counteract overfitting and to increase the amount of data within a given dataset is to use data augmentation techniques. Data augmentation helps to counteract overfitting as it increases the amount of training data programmatically.

Data augmentation is a technique where the data within a given dataset will be changed in some way during training. Some examples of augmentation techniques are random rotations, random resizing, reflection etc.

2.3 Artificial Neural Network

There are multiple ways to model machine learning. The model that is widely used is artificial neural network (ANN). ANNs are loosely modeled after biological structures and processes.

2.3.1 Perceptron

The main building blocks of artificial neuron networks are artificial neurons which have also been loosely modeled after biological processes. The simplest model of a neuron, called a perceptron, has some amount of inputs and a single output. The perceptron was first proposed by Minsky-Papert in 1969 [19]. Figure 2.1 gives an overview of the perceptron structure.

The perceptron can also be described mathematically by the Formula 2.2 where x_i are the inputs, w_i are the corresponding weights, b is the bias and f is an activation function. Equation 2.2 can also be described in vector form as shown in Equation 2.3.

(18)

−4 −2 0 2 4 0

0.5 1

z

σ(z)

(a) Logistic sigmoid activation function.

−4 −2 0 2 4

0 0.5 1

z

φ(z)

(b) Threshold activation function.

−4 −2 0 2 4

0 2 4

z

R(z)

(c) ReLU activation function.

−4 −2 0 2 4

−1 0 1

z

tanh(z)

(d) tanh activation function.

Figure 2.2: Commonly used activation functions.

x and w are both vectors of the inputs and the weights and g is an activation function suitable for vectors.

y = f (

n

X

i=0

w_i· x_i+ b) (2.2)

y = g(W · x + b) (2.3)

Activation functions are functions that maps the input, the weights and the bias to a specific output. Activation functions are an abstraction from the biological process of a neuron firing. In the simplest form, a neuron is binary such that it is either firing or not firing. Activation functions are used within neural networks to allow a neural network to approximate any function.

The simplest form of activation function is a threshold Function 2.2(b). Using a threshold function as an activation function the value the output can be is either ”0” or

”1”. There is an endless amount of activation functions, Figure 2.2 shows a few that are commonly used.

Activation functions are typically non-linear as it helps the model to approximate any complex function.

2.3.2 Structure of ANNs

The structure of an ANN consists of three types of layers. These layers are called the input layer, the output layer and hidden layers. Every ANN has one input layer, one

(19)

Input #1 Input #2 Input #3

Output #1 Output #2 Hidden

layer Input

layer

Output layer

Figure 2.3: ANN example structure

output layer and a varied amount of hidden layers. Each layer contains a finite amount of perceptron. Each layers output connects as the next layers input, that makes it possible for the net to create very complex connections and allows it to solve complex problems.

An overview of a ANN with one hidden layer can be seen in Figure 2.3.

As Figure 2.3 shows the connections within the network does not form any cycles.

This type of network is called feedforward neural network as the information always flows forwards in the network. The network shown in the figure is also a fully connected neural network as every output of a perceptron in one layer is connected to the input of every perceptron in the next layer.

According to the universal approximation theorem it is possible to represent any mapping from input to output using only one hidden layer. However this is not always favorable as it can add a lot of complexity to the hidden layer as the size might become unreasonably large.

2.3.3 Training Scheme

Combining all the knowledge from Section 2.2 the next ideas to discuss are how a training scheme could look for a artificial neural network. As supervised learning will be the main focus of this report, the training schema will be from a supervised learning perspective.

Many of the ideas explained below does not apply to unsupervised learning as there is no ground truth to the input.

To be able to properly explain the training scheme recall Equations 2.2 and 2.3.

Loss Function

As mentioned in Section 2.2.3 it is important to be able to analyze the performance of the model on a given task. A loss function is, in very simple terms, a function that calculates how incorrect the prediction of the model was compared to the ground truth for a given input.

Allow the training data, D_t, to be a set of pairs containing a input, i, and a label, l, as written in 2.4.

(20)

x₁ x₂ x₃ l1 5 -3.2 -1.1 l₂ -6.2 -1.1 4

l₃ 1 -3 2

Table 2.1: Example score matrix. x_i is the input and l_i is the corresponding label.

D_t = {(i₁, l₁), (i₂, l₂), ..., (i_n− 1, l_n− 1), (i_n, l_n)} (2.4) Allow the loss function, L, to be a summation of the loss of all individual examples as shown in Function 2.5 where L_i is the loss of a single class, f (x_i, W ) is the output of the net for the i : th example, where l_i is the label for the i : th example.

L(W ) = 1 N

N

X

i=0

Li(f (xi, W ), li) (2.5) Different tasks use different loss functions to get the best result. Classification tasks usually use cross entropy loss or square loss while regression most commonly uses mean square error.

As an example, allow the weights within the network to have been randomly initialized and the scores of this network is shown in Table 2.1. If class A gets a high score and class B gets a low score, the classifier believes with a high probability that the input is class A and that there is a low probability of the input being class B. Let the loss function within this example to be a hinge loss function defined in Equation 2.6, where L_i is the loss of i : thinput, s_j is score of the i : th input compared to the j : th label and s_y_i is the score of the i : th input on its corresponding label.

L_i =X

j6=yi

max(0, s_j − s_y_i + 1) (2.6)

Using Equation 2.6 together with the values in Table 2.1 it is possible to calculate all the individual class losses.

L₁ = max(0, −6.2 − 5 + 1) + max(0, 1 − 5 + 1) = 0 (2.7)

L₂ = max(0, −3.2 − (−1.1) + 1) + max(0, −3 − (−1.1) + 1) = 0 (2.8)

L₃ = max(0, −1.1 − 2 + 1) + max(0, 4 − 2 + 1) = 3 (2.9) Using Equations 2.7, 2.8, 2.9 together with 2.5 gives the current loss of the entire model as shown below.

L = L1 + L2 + L3 = 0 + 0 + 3 = 3

(21)

Thus the loss for the example network above is L = 3. The goal is to minimize this value, preferably getting it to zero, as this means all predictions was correct. This is achieved by changing the weights of the network when the prediction was incorrect to get a better score of said prediction. This is done by gradient descent and back propagation that will be discussed below.

Gradient Descent

Gradient descent is used to minimize the loss function such that L(W ) = 0. This is done by trying to find the global minimum of the continuous function L(W ), where W is the point where the loss is minimized. By first finding the gradient of the current loss, X, as shown in equation 2.10.

X = dL

dWL(W ) (2.10)

The gradient of the current loss calculated in Equation 2.10 is then used by Equation 2.11 where X⁰ is the new loss and lr is the learning rate. The learning rate is a value of how much the gradient moves per iteration step.

X⁰ = −X · lr (2.11)

X⁰ will thus be a value that is closer to a minima in the function L(W ). This process is repeated until the result is good. The best result is to find the global minimum of L(W ), however this can be very difficult if there are multiple local minima within L(W ).

The number of training examples that the model will work through before updating the internal parameters during gradient descent is a parameter that is usually called batch size. The value of this is usually set depending on the size of the model and the size of the input.

Back Propagation

Back propagation is an efficient implementation of gradient descent that leverages cal- culus chain rule as described in Equation 2.12. In simple terms back propagation allows information about the gradient to flow backwards in the network. Recall that an ANN consists of some amount of perceptrons where one perceptrons output becomes another perceptrons input, as shown in Figure 2.3. Also recall that information flows forward in the network such that it is a feedforward neural network. Using this information and Equation 2.12 back propagation is explained below.

dx dz = dx

dy dy

dz (2.12)

Let x be the input of an ANN and let the network have some amount of layers, L₁, L₂, .., L_n−1, L_n. Let x propagate through each of the layers and reach the output layer. At the output layer calculate the output error using gradient descent which will be the start of the back propagation. Move backwards in the ANN and calculate the error

(22)

in each of the layers. Update the weights and the biases within the layer and continue backwards. Once the input is reached the networks weights and biases has been updated in such a way to minimize the loss function.

2.4 Convolutional Neural Network (CNN)

Just like an ANN the convolutional neural network (CNN) consists of an input layer, some amount of hidden layers and an output layer. The main difference of an CNN and an ANN is the fact that the CNN has at least one layer use convolution in place of the normal matrix multiplication as shown in Equation 2.3. Convolution also allows for the input size to vary, unlike ANNs where the input size was fixed. Convolution is a mathematical operation that takes two functions f (t), g(t) and produces a third function that shows how one of the functions shape effects the other. The definition of this operation is:

h(t) = (f ∗ g)(t) = Z ∞

−∞

f (t)g(t − τ )dτ

CNN layers are built on three different principles, which are sparse interaction, parameter sharing and equivalent representation [11]. Sparse interaction means that the kernel size of the network is smaller than the input. This allows for the network to be a lot more efficient as the algorithm that matrix multiplication relies is a O(n·m) algorithm where n is the input size and m the output size. However if the number of connections to each layer is limited by k the run time will now be O(k · m) that can allow for big run time decrease as k is usually magnitudes smaller than n.

Parameter sharing refers to the fact that parameters are reused within a CNN that they are not in normal ANNs. This has no effect on the run time, however this allows for a lower memory requirement, which is very important in very large CNNs.

The final principle of a CNN, equivalent representation, is a side benefit of parameter sharing. Formally means that if a function f (x) is equivariant to g, if and only if, g(f (x)) = f (g(x)). What this means in a machine learning perspective is the fact that if the input is, for example, shifted by 1, the output will also be shifted by 1. That being said, a CNN is not equivariant to all types of data augmentation such as rotations and scaling. If this is to be supported other techniques are needed. Convolutional layers does thus contain a convolution and a non-linear activation function, such as ReLU.

Generally the CNN also has one or more pooling layers. A pooling layer is a layer that takes a input of some size and downsizes it to some other smaller size. This is done by subsampling squares within the input and choosing the mean or the maximum value of said square. An example CNN can be seen in Figure 2.4.

(23)

Input layer Output layer

Figure 2.4: Example CNN. Green layers are Convolutional layers, blue are pooling layers and yellow are fully connected layers. The gray is the input layer and red the output layer.

2.5 Transfer Learning

Transfer learning refers to the practice to use an CNN that has been pretrained on a large amount of data to be able to solve task T₁, replace one or multiple layers within the CNN and then train it on a different, smaller, dataset to be able to solve task T₂. As an example assume that the CNN can classify cars after being trained using a very large dataset. Replace the output layer with a linear layer and train it to classify trucks using a much smaller dataset. As an example, ImageNet [3] is a large dataset that contains around 14 million images and is often used as the dataset the CNN is pretrained on.

Formally transfer learning can be described as the following:

Definition 1. Let Ds be the source domain and Ts be the corresponding source task. Let D_t be the target domain and T_t be the corresponding target task. Let f_t be the predictive function for T_s. Thus transfer learning aims to improve the learning of f_t in D_t using the already learned knowledge in Ds and Ts where Ds 6= Dt and Ts6= Tt. [21]

The main reason why this type of learning approach works is because CNNs learns things that are universal for classification problems early. This includes, but is not limited to, edge detection, colour detection and lightning changes. First time transfer learning was mentioned was in the paper Discriminability-Based Transfer between Neural Networks written by Pratt [22]. Transfer learning is used to decrease the two things that are needed to train a CNN from scratch, a incredibly large data set and a lot of computing power.

There exists two main ideas when it comes to transfer learning, fine-tuning and fixed feature extraction. Both of them revolves around changing the weights within the network while training however how they go about doing this is quite different. When using fixed feature extraction the idea is to freeze the early layers in the network and only train the last layers. In contrast when using fine-tuning the network is trained normally, where all the weights are allowed to change. Both of these approaches still replace some amount of layers in the network before training.

(24)

When pretraining a network for the purposes of transfer learning, the common dataset to use is ImageNet. According to [16] it seems that if a network has a high ImageNet accuracy the network will also have a high transfer accuracy. Why this is the case is still in need of more research, however [13] did explore this topic a bit.

2.6 Real Time Object Detection

Real time object detection is a sub category of computer vision where the aim is to detect and classify objects in real time. This means that the inference time of the deep learning models has to be fast enough to deal with a input under real time constraints, such as video, whilst also maintaining a good accuracy.

Real time object detection is a field that has quite a short history as the hardware which was bleeding edge a few years ago did not have the computational power to perform inference in real time, 33 ms. The paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [26] showed that there was a possibility to do inference fast enough to reach 15 fps. A few months later the first YOLO paper [24] was released. YOLO was extremely fast and could achieve inference speeds to allow for 45 fps, whilst achieving a mAP score¹of 63.4. Real time object detectors were incrementally improved by SSD [17], YOLO900 [23] and YOLOv3 [25].

2.6.1 You Only Look Once (YOLOv3)

YOLOv3² is a object detection system that focuses heavily on inference speed, making it possible to run in real time. To be able to run as fast as YOLOv3 does it predicts both bounding boxes and classifies the object within the bounding box in one pass.

Bounding boxes are boxes that encompass one object, thus every detected object will have a bounding box around it to show where it is. Older versions of object detection systems had to first find the bounding box and it then classify the object within. As the name alludes to, YOLOv3 is the third version of object detection algorithms in the YOLO family. YOLOv3 does prediction on a per-frame basis, no temporal information is used.

The network architecture of YOLOv3 consists of three different networks. The first is Darknet-53 that is the backbone network [23]. Secondly there is a upsampling network and lastly there are the detection layers, called YOLO-layers. The network structure can be seen in Figure 2.5

Darknet-53, or the backbone network, is used to extract features from the input image. Darknet-53 mainly consists of residual blocks and 53 convolutional layers. A residual block is a block consisting of a pair of 3x3 and 1x1 convolutional layers together with a shortcut connection. A full overview of the Darknet-53 architecture can be found in Figure 2.6.

1Description of mAP can be found in chapter 2.6.2

2A in depth explanation how yolov3 works can be found in [25]

(25)

1/2 1/4 1/8 1/16 1/32

1/16 1/8 Shortcut

YOLO layers Input image

Detections

Figure 2.5: Overview of YOLOv3 structure. The numbers below each layer shows the dimension decrease of the input at that layer. The gray layer is the input layer. The blue layers are part of the backbone network, Darknet-53. The red layers are upsampling layers and the yellow layers are YOLO layers.

The upsampling network and the three YOLO layers use the features that Darknet- 53 extracted to do detection of objects in 3 different scales. These scales are 1/32, 1/16 and 1/8th of the input size. This was done to remedy the fact that both YOLO and YOLO9000 (v2) had trouble detecting small objects. Each YOLO layer also consists of some number of convolutional layers with batch normalization and a ReLU activation function.

Since object detection is done at three different scales, the output of the entire network is a 1 x 1 x [3 x (4 + 1 + C)] tensor, where C is the amount of classes in the dataset, 4 is the bounding box offsets and 1 is the confidence prediction. YOLOv3 also uses independent logistic classifiers to calculate the probability of an object to belong to each class in the dataset. This is done as there might exist classes within the dataset that aren’t mutually exclusive, an example of this could be ”person” and ”woman”.

(26)

Figure 2.6: Darknet-53 architecture. Image from [25].

Loss Function

The YOLOv3 loss function consists of three different parts as described below:

• Bounding box error

• Confidence error

• Classification error

The bounding box error describes how incorrect the predicted coordinates were compared to the ground truth by using mean square error. The confidence error describes whether there is an object in the predicted bounding box or not. The confidence error also uses mean square error. Lastly the classification errors describes how incorrect the predicted class was inside the bounding box. The classification error uses cross entropy error. The loss function uses two scale factors, λ_coord and λ_noobj to remedy model instability. In YOLO λ_coords = 5 and λ_noobj = 0.5, where λ_coords is used to increase the loss of the bounding box coordinate predictions and λnoobj is used to decrease the loss of the predicted boxes containing no objects.

The calculation of the loss can be seen in Equation 2.13.

(27)

L = λ_coord

S²

X

i=0 B

X

j=0

1^objij

h

(x_i− ˆx_i)²+ (y_i− ˆy_i)²i +

λ_coord

S²

X

i=0 B

X

j=0

1^objij

h (√

w_i−p ˆ

w_i)²+ (p h_i−

q hˆ_i)²

i +

S²

X

i=0 B

X

j=0

1^objij (C_i− ˆC_i)²+ λ_noobj

S²

X

i=0 B

X

j=0

1^noobjij (C_i− ˆC_i)²+

S²

X

i=0

1^obj_i X

c∈classes

(p_i(c) − ˆp_i(c))²

(2.13)

where 1^obji denotes if object appears in cell i and 1^objij denotes that the j : th bounding box predictor in cell i is ”responsible” for that prediction [24]. ( ˆx_i, ˆy_i) denotes the coordinate of the predicted bounding box i and (x_i, y_i) denotes the coordinate of the ground truth bounding box. ˆw_i denotes the width of the predicted bounding box and ˆh_i the predicted height, as where w_i denotes the ground truth width of the bounding box and h_i the ground truth height. ˆC_i denotes the predicted confidence score of cell i and C_i denotes the ground truth confidence score. C is the set of all classes that are being predicted, p_i(c) denotes the probability that the i : th bounding box contains an object of class c and ˆp_i(c) denotes the predicted probability of bounding box i containing class c.

2.6.2 Performance Metrics

The metrics described below, especially mAP and F1 score, is used together with the loss value to determine how good the model performed on a given dataset.

Intersection Over Union (IOU)

Intersection over union is a metric evaluating the overlap of two bounding boxes. Allow B_p to be the predicted bounding box, B_g to be the ground truth and f (x) be a function calculating the area of a bounding box, thus the IOU is calculated by Equation 2.14.

IOU = area of overlap

area of union = f (B_g∩ B_p)

f (B_g∪ B_p) (2.14)

Recall & Precision

Recall is used to calculate how well a model finds all the positives and precision measures how accurate the model is, how well does it find all the relevant objects. How to calculate recall and precision can be found in Equations 2.15 and 2.16 where T P are all the

(28)

true positives such that IOU ≥ threshold, F P are all the false positives such that IOU < threshold and F N are all the false negatives.

Recall = T P

T P + F N (2.15)

P recision = T P

T P + F P (2.16)

Mean Average Precision (mAP)

Mean average precision (mAP) is an object detection metric that multiple data sets use to compare different object detectors. mAP calculated by the area under a precision vs recall curve.

Thus mAP is calculated by interpolating all points on the precision vs recall curve and then calculating the area under the curve. This is done by Equation 2.17 with Equation 2.18 where p(˜r) is the measured precision at recall ˜r.

mAP =

1

X

r=0

(r_n+1− r_n) p_interp(r_n+1) (2.17)

p_interp(r_n+1) = max

˜ r:˜r≥rn+1

p(˜r) (2.18)

F1 Score

F1 score can be seen as the weighted average of the precision and recall as shown in Equation 2.19. The F1 score a model can achieve will fall in between 0 and 1, where 0 is the minimum and 1 is the maximum. The higher the score the better the model performs.

F 1 = 2 · precision · recall

precision + recall (2.19)

2.7 Service Oriented Architecture (SoA)

Service oriented architecture (SoA) [10] is a software architecture style that aims to create services rather than different, complete software solutions. A service should be a logical representation of some business process that is used, an example of this could be a payment system.

All the services should be encapsulated, however they may compose of different services and they should be loosely coupled from each other. Encapsulation means that the service should only present the user with some type of input to trigger an action, the user should not need to understand the subsystems at all. Using a payment as an example again, the user only needs to enter the bank card information and press pay. The

(29)

company that uses said payment service should only see that the payment went through.

Neither the customer nor the company should have to understand how the service works internally. As the system should be loosely coupled it is very easy to add or remove services without having to change the entire system. All these services should also be able to communicate through some type of network.

Each service can play one of three roles, Service provider, Service registry and Service consumer. The service register is the main block that makes information flow through the system. It allows the communication between providers and consumers by helping the providers to discover the consumers. The service register is also used to authenticate which data each and every consumer can consume, thus making the communications more secure. Using this type of structure also makes the system more efficient, reliable and scalable.

The service provider provides the service register with information that the service consumers can consume. The type of information that the service provider provides to the service register can be anything, examples are real time sensor data or historical data.

The service providers also give the service register information about what category the service should be listed under and what type of agreement is required to use the service.

As mentioned the service consumers consume information from the service providers.

To do so the consumer locates entries in the service registry and then binds to the service provider. A single consumer can consume data from multiple different service providers at once.

One implementation using the SoA philosophy is the Arrowhead framework [1].

2.7.1 Arrowhead Framework (AHF)

The arrowhead framework (AHF) [1] is a SoA style framework that aims to simplify development, deployment and operations of inner-connected heterogeneous devices. The main idea of the AHF is that information is usually quite local, hence if a system controlling the indoor temperature of a building relies on sensors to decide weather to increase or decrease the temperature, those sensors are geographically close to the indoor temperature system. To take advantage of this fact the AHF uses the idea of local SoA clouds that are connected to allow for intercommunication [2]. This allows for tasks that are within a local cloud to be protected and encapsulated. By using concepts such as late binding and loosely coupled the solution is allowed to scale during run time.

Late binding means that all methods are looked up by name during run time rather than early binding where the methods are saved in the compiled programs virtual method table, or ”v-table”. Using late binding makes the program a lot less vulnerable to version conflicts as it does not require the compiler to reference libraries at compile time, however it does have a performance impact as finding methods by name is less efficient than finding it in the compiled programs v-table.

Loosely coupled is a programming paradigm where each and every component in a system does not rely on knowing the structure of other components in the system. This makes it very easy to replace a single components implementation in a system as long as

(30)

the new component provide the same service as the old one. A loosely coupled system is also less reliant on platform or programming language. Distributed systems often leverage the loosely coupled design paradigm, where an example of this is data replication.

(31)

(32)

Chapter 3 Experimental Setup

3.1 Data Collection

As working with earth moving machines is a very difficult and time consuming process, the experimental setup instead uses a remote control wheel loader and a toy dump truck, both of smaller scales. The wheel loader was of scale 1:14 and the dump truck was of scale 1:16. The wheel loader was equipped with a single camera to be able to collect the data.

The camera was a Foxeer 4K box action camera that is equipped with a wide angle lens. The camera had the possibility to record in 4K/30fps however all the data was recorded in 1080p/30fps. All other settings within the camera were factory standards.

The data was collected by attaching the camera to the roof of the cab of the wheel loader and positioning the dump truck in different orientations. Examples of such orientations were to turn the dump truck or raising the tipping body. Two examples of how the dump truck was positioned can be seen in Figures 3.1 and 3.2.

To be able to answer the questions defined in Section 1.3 data was collected in three different scenarios. The first scenario was a indoor scenario with the same light source and ground material. The second scenario was a outdoor scenario with a similar lightning conditions, but different ground materials. The third and final scenario was a outdoor scenario with similar ground materials and different lightning conditions. In each scenario the dump truck was positioned in such a way that the wheel loader would approach within a specific area shown by Figure 3.3. In the figure the area that the wheel loader can approach the dump truck is the area created by the diagonally drawn orange lines.

Examples of correct approach paths are the green arrows. The number of images collected in each scenario can be seen in Table 3.1.

(33)

Figure 3.1: Example orientation for the dump truck where the tipping body is in a down orientation.

Figure 3.2: Example orientation for the dump truck where the tipping body is in a up orientation.

Table 3.1: Number of images collected for each scenario Scenario Image amount

Indoor 362

Ground variation 187 Light variation 114

(34)

Cab Tipping body

θ θ

Figure 3.3: Bird eyes view over a dump truck. The wheel loader can approach the dump truck anywhere in the green area and it will be seen as an correct approach. θ ≈ 30°.

Indoor Scenario

The data collected during the indoor scenario was the least varied data as the ground material and light source was constant while the orientation of the dump truck changed, as seen from Figures 3.1 and 3.2. The background was also quite rigid in this scenario as all the data was collected in the same room.

Ground Variation Scenario

In this scenario, the type of ground materials below the dump truck was varied. The different materials used were:

• Grass

• Gravel

• Snow

• Pavement

The light source and the background was kept as similar as possible between each image, however as the data was collected outdoors there was some variation.

Light Variation

In this scenario, the way the light could impact how the dump truck was varied. This means that the light was used to create shadows to impact the dump truck appearance, not use the light to abuse limitations in the camera such as shining a light straight into the lens to overexpose it to the light. The different light variations can be seen below:

• Normal day light

• Shadow on top of the entire dump truck but not the wheel loader

(35)

Table 3.2: Image split for each dataset

Scenario Train images Validation images

Indoor 290 72

Ground variation 150 37

Ground and light variation 242 59

• Shadow on top of parts of the dump truck but not the wheel loader

• Dump truck in sunlight and wheel loader in shadow

The ground and background variation was kept to a minimum however as the data was collected outdoors there will be some variation in both the ground materials and background.

Data Sets

From the three scenarios described above, three data sets were created. The first com- prised of all the indoor images, the second of all the ground variation images and the third all ground variation and light variation images. The images were split into training and validation images such that 80% of the images were used for training and 20% were used for validation. The amount of images in each data set can be seen in Table 3.2.

Some images from the data set can be seen in Figure 3.4.

3.2 Data Labeling

Since YOLOv3 uses a supervised learning algorithm all the data collected by video had to be annotated. From the collected video every 15 frame were extracted and saved as a .png file to be labeled. Each frame was given a corresponding .txt file as a label such that the structure of the paths are, for example, /data/train/images/1.png for the image and /data/train/labels/1.txt for the label.

Each label file contained rows where each row corresponded to an object within the image, thus each image could contain any number of objects. Each row in the file was structured such that it had 5 entries, all separated by a whitespace character. An overview of the structure can be seen in Table 3.3.

Using the first row in Table 3.3 as an example to understand each row entry. The first (0) is a number which corresponds to the class the object belongs to. There were 3 classes in the dataset where 0 corresponds to a wheel, 1 to the cab and 2 to the tipping body.

The second and third entry are the normalized x and y coordinates for the center of each bounding box. The coordinates being normalized basically means that center coordinates were divided by the width or height of the full image to get a value between 0 and 1. Normalizing the coordinates in this way decreases the complexity when resizing and augmenting the data during training. Equation 3.1 shows how the x coordinate was

(36)

Figure 3.4: Some examples from the different data sets.

(37)

Table 3.3: Label structure

Class x y w h

0 0.459896 0.739352 0.078125 0.137963 0 0.540395 0.730556 0.074479 0.131481 0 0.521615 0.640741 0.048438 0.025926 0 0.720313 0.705556 0.048958 0.105556 0 0.471615 0.646759 0.017188 0.019444 1 0.688542 0.556481 0.067708 0.088889 2 0.471354 0.512963 0.196875 0.424074

normalized where x⁰ is the non normalized value of x and w⁰ is the width of the image.

The y coordinate was normalized by using Equation 3.2 where y⁰ is the non normalized value of y and h⁰ is the height of the original image.

x = x⁰

w⁰ (3.1)

y = y⁰

h⁰ (3.2)

The fourth and fifth entries in the row is the width, w, and height, h, of the box, also normalized. The width and height of each bounding box is defined as the distance between the center of the bounding box to the edge. Thus the width is the distance from the center of the bounding box to the left or the right edge of the bounding box and the height is defined as the distance from the center of the bounding box to the top or bottom of the bounding box. A visualization of x, y, w and h can be seen in Figure 3.5.

3.3 Data Augmentation

As discussed in Section 2.2.4 data augmentation is a powerful technique to increase the amount of available data programmatically. The augmentation techniques used in this implementation of YOLOv3 are translation, rotation, shear, scale, reflection, HSV saturation and HSV intensity [14]. All these were randomly applied to images during training. The labels were automatically updated to correctly correspond to the image.

Translation is an augmentation technique which moves the image along the x or y axis to help the neural network to understand that objects can exists anywhere in an image. See Equations 3.3 and 3.4 where (x, y) are the old center coordinates of the image and (x⁰, y⁰) are the new center coordinates. D is the distance of how far the object will be moved.

x⁰ = x + Dx (3.3)

y⁰ = y + Dy (3.4)

(38)

(x, y)

w

h

Figure 3.5: Bounding box coordinates

Rotation is an augmentation technique that rotates the entire image around the center point by some angle. As it does not make sense for the earth moving machine to be upside down the rotation was only -5° to +5°. The equations for rotation can be seen in 3.5 and 3.6 where φ is the angle of rotation, (x, y) are the old coordinates of the image and and (x⁰, y⁰) are the new coordinates.

x⁰ = x · cos(φ) − y · sin(φ) (3.5)

y⁰ = y · cos(φ) + x · sin(φ) (3.6)

Shear is an augmentation technique that changes a 2d image into a parallelogram image. Shearing is a linear function that moves each point of a 2d image by a fixed direction. When applying shearing the area of the image does not change. In this implementation there was a random chance that the image could be sheared by ±2°.

The Equations 3.7 and 3.8 shows shearing. (x⁰, y⁰) are the new center coordinates, (x, y) are the old center coordinates, S_x and S_y are the values corresponding to the angle of which to shear the image.

x⁰ = x + S_x· y (3.7)

y⁰ = x · S_y + y (3.8)

Scale is a augmentation technique that rescales the input image by some percentage.

The percentage change in this implementation was ±10%. Scaling equations can be seen in 3.9 and 3.10. (x, y) are the old coordinates, (x⁰, y⁰) are the new coordinates, S is the

(39)

number to scale by, if S > 1 the image will increase in size and if S < 1 the image will decrease in size.

x⁰ = x · Sx (3.9)

y⁰ = y · Sy (3.10)

Reflection is a augmentation technique that ”flips” the input image through either the x-axis or y-axis. As mentioned above it makes no sense for the earth moving machine to be upside down hence the reflection has a 50% chance to be reflected horizontally. Thus to reflect a point in an image (x, y) over the y-axis to the new point (x⁰, y⁰) Equation 3.11 is used to calculate x⁰ while y⁰ = y.

x⁰ = −1 · x (3.11)

HSV Saturation is a augmentation technique that changes the saturation of the image. The saturation describes the intensity of colour in a given image. The saturation had a chance to be changed by ±50%.

The last augmentation technique used was HSV Intensity that changes the bright- ness of the image by some amount, where a value of 0 is completely black and a value of 100 is completely white. The intensity also had a chance to be changed by ±50%.

3.4 Object Detection Implementation

The YOLOv3 implementation used was [14]. However since YOLOv3 originally was created to be trained on COCO containing 80 classes some small changes had to be done to the structure to make it contain less classes. This was achieved by changing the amount of filters and classes that existed in each yolo layer. As the implementation used generates the model via a .cfg file, this was done by simply changing the filters such that filters is calculated by f ilters = (4 + 1 + C) ∗ 3 where C is the number of classes and changing classes to the number of classes. This means that f ilters = 24 and classes = 3.

To train the network using transfer learning, the official weights¹ were imported and used as starting weights, rather than using randomly generated weights. The hyper parameters used can be seen in Table 3.4. Batch size was set to 3 as the GPU used during training was a NVIDIA 980 TI and any higher batch size would lead to a CUDA out of memory error.

Different weight decay, momentum and learning rate would lead to model instability during training. Momentum is used in stochastic gradient descent (SGD) to describe how long until the parameters of the model gets updated and weight decay is a value that the parameters of the model is decreased by every iteration. Weight decay is not only used to remedy model instability but also helps to avoid overfitting. Freezing was not used as

1Official weights can be found on https://pjreddie.com/media/files/yolov3.weights

(40)

Table 3.4: Hyper parameters used during training

Name Value

Epochs 200

Batch size 3 Weight decay 0.0005

Momentum 0.9

Freezing No

Learning rate 0.001 Optimizer SGD

preliminary experiments showed that the model could not learn the indented task while freezing layers.

As described in Section 2.6.1 YOLOv3 can have trouble to detect smaller objects, thus multi-scale was used. Multi-scale randomly re sizes all training images after every 10 batches such that the input size to the model is between 320 by 320 to 640 to 640 pixels.

3.5 Computational Hardware

As mentioned in Section 3.4 the training was conducted using a Nvidia GPU, more exactly the Nvidia 980 TI. The server used for training was also equipped with 16GB of ram, Intel Core i7-6700K CPU @ 4.00GHz and 1TB of drive space. Inference speed was tested on Nvidia 980 TI, Nvidia 1070 and an Intel Core i5-3320M CPU.

(41)

Machine vision for automation of earth-moving machines: Transfer learning experiments with YOLOv3

Machine vision for automation of earth-moving machines: Transfer learning experiments with YOLOv3

Carl Borngrund

Machine vision for automation of earth-moving machines: Transfer learning experiments with YOLOv3

Carl Borngrund

Dept. of Computer Science, Electrical and Space Engineering Lule˚ a University of Technology

Lule˚ a, Sweden

Supervisors:

Ulf Bodin, Fredrik Sandin

Abstract

Contents

Chapter 1 Thesis Introduction

1.1 Background

1.2 Motivation

1.3 Problem Definition

1.4 Scope and Delimitation

1.5 Methodology

1.6 Thesis Structure

Chapter 2 Theory

2.1 Computer Vision

2.2 Machine Learning

2.2.1 Task

2.2.2 Experience

2.2.3 Performance

2.2.4 Data Set

2.3 Artificial Neural Network

2.3.1 Perceptron

2.3.2 Structure of ANNs

2.3.3 Training Scheme

2.4 Convolutional Neural Network (CNN)

2.5 Transfer Learning

2.6 Real Time Object Detection

2.6.1 You Only Look Once (YOLOv3)

2.6.2 Performance Metrics

2.7 Service Oriented Architecture (SoA)

2.7.1 Arrowhead Framework (AHF)

Chapter 3 Experimental Setup

3.1 Data Collection

3.2 Data Labeling

3.3 Data Augmentation

(x, y)

3.4 Object Detection Implementation

3.5 Computational Hardware