Object detection for a robotic lawn mower with neural network trained on automatically collected data

(1)

Examensarbete 30 hp VT 2021

Object detection for a robotic lawn mower with neural network trained on automatically collected data

Henrik Sparr

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Object detection for a robotic lawn mower with neural network trained on automatically collected data

Henrik Sparr

Machine vision is hot research topic with findings being published at a high pace and more and more companies currently developing automated vehicles. Robotic lawn mowers are also increasing in popularity but most mowers still use relatively simple methods for cutting the lawn. No previous work has been published on machine learning networks that improved between cutting sessions by automatically collecting data and then used it for training. A data acquisition pipeline and neural network architecture that could help the mower in avoiding collision was therefor developed. Nine neural networks were tested of which a convolutional one reached the highest accuracy. The performance of the data acquisition routine and the networks show that it is possible to design a object detection model that improves between runs.

ISSN: 1401-5757, UPTEC F 21023 Examinator: Tomas Nyberg Ämnesgranskare: Thiemo Voigt

Handledare: Joakim Eriksson RISE Computer Science

(3)

Abbreviations

GNSS - Global Navigation Satellite System IMU - Inertial Measurement Unit

RGB-D - Red, Green, Blue and Depth NN - Neural Network

CNN - Convolutional Neural Network DSC - Depthwise Separable Convolution HGC - Hierarchical Group Convolution ReLU - Rectified Linear Unit

TPR - True Positive Rate TNR - True Negative Rate TP - True Positive

TN - True Negative FN - False Negative FP - False Positive

(4)

Popul¨ arvetenskaplig Sammanfattning

Självkörande fordon har blivit mer och mer aktuellt under de senaste ˚aren med flera stora bolag som tar sig an utmaningen. Maskininlärning är en av de mest lovande och använda teknikerna när det kommer till att lösa detta problem och forskningsframtseg publiceras i en hög takt. Sam- tidigt ökar Robotgräsklippare i popularitet men trots att b˚ada dessa omr˚aden är väldigt heta använder robotgräsklippare sig fortfarande av relativt enkla metoder för att klippa gräsmattan.

Inget tidigare arbete hade publicerats p˚a metoder som automatiskt samlar in data och sedan använder den för att förbättra en maskininlärningsmodell.

I detta examensarbete utvecklas och utvärderas därför en procedur för datainsamling samt nio neurala nätverk som avser att hjälpa gräsklippare att undvika kollisioner. Ett neuralt nätverk

¨

ar en metod inom maskininlärning som är relativt beräkningstung och därför förekommer det

¨

aven visst fokus inom detta arbete p˚a att minimera beräknings- samt lagringskraven som dessa nätverk ställer.

Resultatet visar att det är möjligt att designa en metod som hjälper robotgräsklippare att undvika objekt genom att förbättras mellan körningar. Resultatet visar även att djupa falt- ningsnät, ofta kallade CNNs (Convolutional Neural Networks), presterar bäst p˚a denna uppgift vilket överensstämmer med tidigare kunskap inom bildanalys.

(5)

Contents

1 Introduction 6

1.1 Background . . . . 6

1.2 Objective . . . . 7

1.3 Approach . . . . 7

1.4 Results . . . . 7

1.5 Outline . . . . 7

2 Theory 8 2.1 Deep neural networks . . . . 8

2.1.1 Dense neural networks . . . . 8

2.1.2 Convolutional neural networks . . . . 9

2.1.3 Depthwise separable convolution . . . . 10

2.1.4 Hierarchical group convolution . . . . 11

2.1.5 Weights per layer . . . . 12

2.1.6 Training process . . . . 13

2.1.7 Over-fitting of data . . . . 14

3 Method 15 3.1 Equipment and environment setup . . . . 15

3.2 Data collection . . . . 16

3.2.1 First data set . . . . 17

3.2.2 Second data set . . . . 17

3.3 Neural network architecture . . . . 18

3.3.1 Classical CNNs . . . . 19

3.3.2 Light dense NNs . . . . 20

3.3.3 HGC and DSC CNNs . . . . 22

4 Results 25 4.1 First data set . . . . 25

4.2 Second data set . . . . 26

4.2.1 Samples from CNN-1 . . . . 27

4.2.2 Top and bottom performing objects . . . . 31

5 Discussion 32 5.1 Limitations . . . . 32

5.2 Further research . . . . 33

Appendices 35

(6)

1 Introduction

1.1 Background

To prevent grass from overtaking lawns people have been cutting it down for hundreds of years.

For the majority of the time this was done manually and could consume lots of time. During the last centuries there has been a rapid increase in automated solutions for mowing ones lawn. In 1995 the Swedish company Husqvarna released their first robotic lawn mower for consumers and by 2016 one million units had been sold [1]. In a Swedish survey performed by HUI Research it was noted that the purchasing sentiment for robotic lawn mowers increased by 737% between 2011 and 2016 [2].

Figure 1: Purchasing sentiment per segment for lawn mowers in Sweden.

Most mowers currently employ a strategy of rolling around randomly within an area predefined by underground cables, going in straight lines and turning when an object or the boundary is detected. This strategy has the pros of being easy to implement and not requiring much computational power whilst still being able to cut almost every lawn through sheer brute force.

One reason for wanting to expand this cutting strategy to include object detection is that it will result in less tears and dents on the chassis and environs and reduce the risk of hurting animals.

During the last few years several companies have tried developing new ways for the robotic lawn mowers to perform the task of cutting the lawn. Volta with their model Mora uses an on-board camera to avoid objects and to not require a cable, Irobots model Terra creates a map of the lawn to avoid the cable requirement and Husqvarnas EPOS system achieves the same thing through

(7)

1.2 Objective

This thesis seeks to evaluate the possibility of developing a light neural network and a data acquisition pipeline which can be used by a robotic lawn mower to prevent collisions. The thesis focuses on developing neural networks with small memory requirements and evaluate how these networks perform on data limited in amount by what can automatically be collected from a robotic auto mowers normal cutting routine. This reduces the time and power required by the mower to cut the lawn and results in less tear and damage to the mower and its environs. The mower is a Husqvarna Automower^® with externally added sensors connected through single- board computers running the Husqvarna Research Platform [6]. The sensors that have been fitted to the mower are two GNSSs, two IMUs and a RGB-D camera however the focus of this thesis is on using the camera.

1.3 Approach

Data is automatically collected and marked as belonging to a class by utilizing the fact that the mower can determine when it collides with an object. The objects are then rearranged followed by another data collection run to simulate the mowers intended use. The data from the first run is used to train the models which are then evaluated on the second run. This entire data collection routine is performed on two different occasions to allow improvements to be made to the data collection routine. How well the models perform is determined by their accuracy on the test set. Further analysis of when the model performs well and when it does not is also conducted. The end goal of the evaluation is to determine if a light neural network trained on data that can automatically be extracted from normal cutting can perform well.

1.4 Results

The data collection routine was improved between the first and second data collection by changing the position of the camera. On the later data set multiple models performed well at detecting object without producing many false positives. A classical convolution neural network was the top performer of the models with a true positive rate of 0.791. The convolutional neural networks designed to be lighter than the classical version also preformed well with true positive rates of 0.7.

1.5 Outline

The findings suggested that automatically extracting data and using it to train a model is possible. It was also shown that the methods used for collecting the data plays a large role in the performance of the neural networks. Some findings suggesting that the network architectures that were designed perform better on objects with texture that differs from grass were found but further research is required.

(8)

2 Theory

To digitally detect objects in pictures different image processing techniques have been used throughout the digital age. During the last few decades there has been a shift away from traditional image processing techniques, such as different transforms of the image, towards utilizing the flexibility of different deep neural networks to detect and classify objects in images [7]. Con- volutional neural networks are a popular choice for images and there already existed strategies for reducing the size of the networks such as depthwise separable convolutions and hierarchical group convolutions.

2.1 Deep neural networks

A deep neural network, often referred to as just neural network (NN), uses artificial neurons that are loosely based on the neurons that can be found in the brain. These artificial neurons respond to an input by emitting an output according to an activation function, a commonly used one being the rectified linear unit (ReLU) seen in equation (1) which gives the artificial neurons a strong similarity in activation to real neurons [8].

ReLU (x) = max(0, x) (1)

The word ”deep” comes from the fact that these artificial neurons are placed together in layers that are then stacked onto each other and connected resulting in a deep structure of layers between the input and output layers. The resulting deep network created by this stacking is very flexible and can be used to model complex dependencies like detecting objects in a picture.

2.1.1 Dense neural networks

The layers in a NN can be interconnected in different ways. In a dense NN every neuron in a layer is connected to every neuron in the next. This creates a ”dense” field of neuron connections where the activation of each neuron is correlated with every other neuron. A dense network is therefore able to learn very high level features at the cost of using a higher number of weights corresponding to the connections between each neuron.

(9)

Figure 2: Visualization of a dense neural network with ten input features and two output features.

Each node visualizes a neuron and each edge a connection between two neurons

2.1.2 Convolutional neural networks

One problem that arises when creating dense NNs is that connecting every neuron in a layer to every neuron in the next layer results in a very large model with often times too many neuron weights too feasibly handle. A common strategy for reducing the amount of weights used when processing images is introducing a kernel that is convoluted with the image allowing the same weights to be used multiple times in each layer. In addition to reducing the number of weights in the network the convolution operation also has the property of being shift invariant which is also desirable when analyzing images as an object can be present anywhere in the picture. A traditional convolutional neural network for classifying images often ”shrinks” the image in the spatial dimension with each layer whilst simultaneously increasing the number of features per layer to increase the abstraction of the picture further down the networks pipeline. Shrinking the width and height of the image is usually done by moving the kernel that is to be multiplied with the image, also known as the stride, in steps larger than one or by max pooling the resulting image. When max pooling the image it is split into equally large areas and the largest element in each segment is set to represent the entire segment.

(10)

6 2 3 4

max(6, 2, 1, 3) = 6 6*0.5 + 2*0.5 + 1*0.5 +3*0.5 = 7

1 3 7 4 6 7 7 9

2 5 8 5 5 8 4 8

0 1 2 1

Input image Output after a 2 x 2 Output after convolving every other element of input max pooling of the input with a 2 x 2 kernel with all weighs set to 0.5

Figure 3: Difference between using a 2 x 2 kernel or a 2 x 2 max pooling to reduce the dimension of the image. The weights of the kernel do not have to be the same but will be learned by the model during training.

2.1.3 Depthwise separable convolution

A common strategy for reducing the number of weights used in a convolution is to factorize or separate the multi dimensional kernel into several kernels of a lower dimension. One common example of this is the Sobel kernel that has been used for edge detection in image processing since at least 1968 [9]. The Sobel kernel can be separated into a one dimensional kernel for each dimension reducing the number of operations need for the convolution. In deep learning a common architecture for separating the convolutional kernel is the depthwise separable convolution proposed by Chollet [10]. An illustration of this operation is seen in figure 4. In this example instead of using a 3x3x3 kernel to calculate the activation of a neuron in the next layer a 3x3 kernel followed by a 1x3 kernel is used, reducing the number of weights used from 27 to 12 whilst preserving some of the correlation between the output and all 27 neurons. This depthwise separable convolution is a popular choice in network architectures as it has showed to increase efficiency whilst performing on par with the conventional convolution.

(11)

Figure 4: Difference between a conventional convolution operation and a depthwise separable convolution which uses fewer weights and operations.

2.1.4 Hierarchical group convolution

In CNNs 1×1 convolutional layers are often used, take for example Google’s inception architecture in which about two thirds of the convolutional layers are with a 1 × 1 kernel [11]. This is done to fuse the features generated by previous layers and following them by an activation function also adds more non-linear behaviour to the model. Even though a 1 × 1 kernel is used in the spatial dimension the number of weights can become large depending on the input and output dimensions. Therefore the hierarchical group convolution (HGC) layer was proposed in which a separate 1 × 1 convolution is used for every feature but with the output of the previous feature concatenated with the next feature [12]. This combination of operations allow output features of higher and lower levels of abstraction to be learned by the layer. At the same time, the number of weights used by the layers drops from about the square of the input dimension to a constant times the input dimension.

(12)

Y₁ Y₂ ... Yn₋₁

Input X X₁ X₂ X₃ ... Xⁿ

↓ ↓ ↓ ↓

1x1conv 1x1conv 1x1conv ... ^1x1conv

↓ ↓ ↓ ↓

Output Y Y₁ Y₂ Y₃ ... Yⁿ

Figure 5: The HGC layer that reduces the number of weights used visualized with the input X being of size width × height × N and the output Y being of the same size.

2.1.5 Weights per layer

When running and training the neural network on a device with limited storage and processing power it is necessary to have in mind the number of weights that the model uses. For the dense layer every input feature will be connected to every output feature. Every output feature will also have a bias included that will have to be learned. For a picture the number of input features will be the depth/channels Dinof the image times the height Hinand width Winof the image. These features will be mapped to Dout number of output features. For the conventional convolutional layer the number of weights used will be equal to kernel size times the number of output features Dout plus one bias per output feature. The size of the kernel will be equal to the depth of the image times the height and width of the kernel (H_k and W_k). For the depthwise separable convolutional layer the kernels will instead be just of the size Din× Hk × Wk and D_in× D_out, D_out biases will still be used. The Hierarchical group convolutional layer will require one kernel of size 1x1x1 for the first convolution and Din− 1 kernels of size 2x1x1 for the rest of the convolutions. Every output feature Dout will also require a bias associated with it, note that the HGC has the same amount of input and output features resulting in D_out = D_in.

(13)

Layer no. parameters Dense DinDoutHinWin+ Dout

Convolutional DinDoutHkWk+ Dout

DSC DinHkWk+ DinDout+ Dout

HGC 3Din− 1

Table 1: The number of weights used per layer for the different layer architectures.

This information is summarized from the most to the least number of weights in Table 1. Typical values of the different parameters for small neural networks can be seen in Table 2.

Property typical value Din, Dout 10⁰− 10³ Hin, Win 10⁰− 10³ Hk, Wk 10⁰− 10¹

Table 2: Typical values of the parameters used in Table 1. The values from D_in, D_out, H_k and W_k comes from [7] whilst H_in and W_ingets their upper limits from the fact that images used in networks rarely are of full high definition formats.

2.1.6 Training process

Fitting or training a NN means updating the parameters of the model to create a good mapping of inputs to outputs on some desired data set. To do this we first need some measurement of how good the network f with its current weights ω map the input x to the desired output y.

This is often referred to as a loss function (l) and during the training procedure our objective is to minimize this loss function.

min

ω l(f_ω(x), y) (2)

During what is known as supervised machine learning our desired output y is known and is often denoted as y_true and the networks prediction f_ω(x) is often denoted as y_pred.

One of the reasons that this can be such a difficult task is that the loss function with respect to ω will almost certainly be a convex function which makes it difficult to find the optimal values for the parameters. Therefor, to try to find values of the weights that produce a good output the weights are initialized randomly followed by the loss function being evaluated on some portion of the data. This portion of the data is known as a batch and after the loss function has been calculated for the batch the gradient of the loss function with respect to the weights are calculated through back propagation and then the weights are updated to try to minimize the loss function. This process is repeated for new batches and when the entire data set has been seen by the network what is known as an epoch has passed. A network is typically trained for several epochs before the training process is done.

(14)

2.1.7 Over-fitting of data

One problem with the very flexible model that is a NN is the possibility of it learning the data that it is trained on specifically instead of the correlation between input and output features that we want it to learn, similar to a person memorizing instead of truly learning something.

This is referred to as over-fitting of the data and is something one has be careful of, especially when using a small data set to train the network. One way of preventing this over-fitting from happening is to slightly alter the data that is being trained on between epochs, for example by adding a small Gaussian blur to the image. Another way of bettering the training of the network is randomly setting a portion of the inputs to a specific layer to zero which is known as a dropout layer. Both these methods are commonly used and have shown to improve the performance of networks [13] [14].

(15)

3 Method

As the field of object detection for robotic lawn mowers is a young scientific field no golden standard for collecting data, designing methods for object detection and evaluating those methods existed and therefore a lot of time had to be put into considering these aspects. For collecting data it was determined that one data set should first be collected and used to train the models followed by evaluation and then these results would be used to improve the data collection routine. When it came to determining the architecture of the networks that were to be tested nine different networks of three different architectures were chosen.

3.1 Equipment and environment setup

For testing and evaluating all methods a Husqvarna Automower 450x robotic lawn mower with extra sensors attached was used. The mower had previously been equipped with one Intel Realsense D435 depth camera, two PhidgetGPS global navigation satellite systems (GNSSs) and two PhidgetSpatial Precision 3/3/3 High Resolution inertial measurement units (IMUs).

The sensors and the mower had been interconnected by a modified version of the Husqvarna Research Platform running on one Raspberry Pi 3B+ and one Raspberry Pi 4B single-board computer [15]. The original and modified Husqvarna Research Platform was build on ROS Kinetic which is a open source software framework for robotics. ROS Kinetic is targeted towards the Ubuntu 16.04 release which receives general support until April 2021. Because the operating system used would loose general support during the duration of the project all code was migrated to Ubuntu 20.04. To achieve this the Husqvarna Research Platform was tweaked to work with ROS Noetic instead of ROS Kinetic and altered to use Python 3 instead of Python 2 which was removed in Ubuntu 20.04. During these alterations it was also noted that the extra computing power from using two Raspberry PIs instead of one was not necessary for the methods that would be used and the system was therefor altered to only use the Raspberry Pi 4B.

(16)

Figure 6: The robotic lawn mower with its added sensors.

3.2 Data collection

Two data sets for testing were collected to evaluate the different machine learning networks. Both times the data acquisition protocol consisted of preparing a lawn with different objects scattered across the area that was to be cut. The mower was then started and instructed to move in a similar manner as during an ordinary mowing session, during which sensory and collision data was collected and saved. This meant randomly choosing a direction to traverse and choosing a

(17)

3.2.1 First data set

During the first data collection session the mower was placed on a lawn containing seven different types of objects that were to be avoided. Four of these objects remained stationary between the two consecutive runs (trees, walls, a chair and a glass bottle), three were moved (a pair of shoes, a backpack and a person) and one object was only present in the first run (an aluminum can).

To conserve memory space the camera was not recording at its full 1920 x 1080 resolution and 30 fps frame rate but at a 240 x 240 resolution and a frame rate of 6 fps. After both runs the frames before each collision were marked as ”robot should turn (1) and ”robot should not turn”

(0).

Figure 7: 12 samples from the first data set that was collected.

3.2.2 Second data set

During the collection of this data set data from the camera was not constantly captured, instead two seconds of 240 x 240 video at 15 fps was kept in a buffer and upon detecting a collision this data was saved with the last second marked as data where the mower should turn. The buffer was also automatically saved at a set interval whit the data contained in it marked as data where the mower should not turn.

(18)

The set of objects on the lawn during the collection of the second data set consisted of a glass bottle, two shoes, one computer bag, a storm drain/well, a wall and a plastic chair. All mobile objects were moved between the first run in which the training data was collected and the second run where the data for evaluating the networks were collected. To analyze the difference in difficulty of detecting each object each collision was manually marked with the relevant object.

For collection of the second data set the camera had been placed at a lower position to better capture the objects. This positioning does not require the wooden rack seen in Figure 6 which also makes it a better candidate for production.

Figure 8: 18 samples from the second data set that was collected.

3.3 Neural network architecture

Nine different NN architectures were developed for the task of discriminating images where the

(19)

3.3.1 Classical CNNs

The three classical CNNs was designed in a similar manner as the smallest CNNs found in the book Advanced Applied Deep Learning: Convolutional Neural Networks and Object Detection [7].

Where a convolutional layer is followed by a ReLU activation function which is then followed by a 2 × 2 max pooling or convolutional layer to reduce dimensionality. To the end of this block a dropout layer is normally added to assist the training of the network. These blocks are then stack upon each other and in the end a dense layer assigns probabilities to each class. This architecture with a few modifications was used for all three networks which are illustrated in Figure 9

Figure 9: The three network architectures of conventional CNN style that were created.

(20)

3.3.1.1 CNN-1 architecture

The first CNN (CNN-1) was chosen to use a picture of size 120 × 120 × 3 pixels as input and to consist of two blocks. Both blocks consisted of a convolutional layer with a 4 × 4 kernel and a stride of 2 followed by a convolutional layer with a 2 × 2 kernel and a stride of 1 and lastly a dropout layer. The first two convolutions outputted an image of depth 32 and the last two an image of depth 64, all convolutional layers where followed by a ReLU activation function.

The image outputted by the final convolutional layer was flattened and connected to a dense layer which had an output of two logits representing the log-probabilities of the input picture belonging to the two classes ”robot should turn” and ”robot should not turn”. The resulting architecture needed 148290 parameters to be trained.

The second CNN (CNN-2) was chosen to use a smaller (60 × 60 × 3) picture as input which would allow it to consist of more blocks while still using less weights. As the image that was being passed through the network was smaller the depth and height of the kernels used were also smaller. One block in this architecture consisted of a convolutional layer with a 2 × 2 kernel and a stride of 2 followed by a convolutional layer with a 1 × 1 kernel and a stride of 1. As with CNN-1 the blocks ended with a dropout layer, the convolutional layers were followed by a ReLU activation function and the networks final step consisted of a dense layer with two outputs. The output depth of the two first convolutional layers were 32, the following two had the depth 64 and the last two had the depth of 32. The small kernels and input picture resulted in the most light conventional CNN 26306 parameters to be determined by training.

Finally the architecture of a CNN which used a larger (240 × 240 × 3) input picture (CNN-3) was designed. To reduce the size of the input picture before the fully connected dense layer the picture as passed through three convolutional layer with stride 2 and also passed through two max pooling layers with a 2 × 2 kernel. The kernel size of the convolutional layers was 4 × 4 and like with the other networks they were all follow by a ReLU activation function. The final CNN architecture needed 46450 parameters to be trained.

3.3.2 Light dense NNs

The dense NNs were designed by stacking dense layers onto each other. To prevent the number

(21)

Figure 10: The three network architectures of dense NN style that were created.

3.3.2.1 Deep NN-1 architecture

The first dense NN (Deep NN-1) used an input image of size (60 × 60 × 3) and consisted of just two dense layers. The first layer outputted 30 features and like all other networks the final dense layer outputted two features. The large first layer resulted in Deep NN-1 being the largest network with 324092 parameters to be trained.

To allow an additional layer without increasing the number of parameters of the network the output or input dimension of the first layer in Deep NN-1 had be changed. As the first layer only outputted 30 features the height and depth of the input picture was reduced to half the size of the previous network resulting in an input size of (30 × 30 × 3). The first dense layer was followed by another layer which also had an output consisting of 30 features and lastly a dense layer with two output features was added. This architecture resulted 82022 parameters to be trained.

(22)

For the final dense architecture another approach at reducing the dimensionality of the input to the first dense layer was deployed. This approach was using one of the blocks that had been designed for the CNNs before the first dense layer. The block consisted of a convolutional layer with a 4 × 4 kernel, a stride of 2, 16 output features and a ReLU activation function followed by 2 × 2 max pooling layer. This block was followed by the architecture of Deep NN-1 resulting in network with a in between of the architecture of the CNNs and the Deep NNs with 63566 parameters to be trained.

3.3.3 HGC and DSC CNNs

Three CNNs that included HGC and DSC layers were designed as the final networks to be tested.

The first two networks (HGC&DSC-1 and HGC&DSC-1) had a design similar to CNN-1 and CNN-2 but with their two convolutional layers per block swapped for a DSC layer followed by a HGC layer. The final network that was designed (HGC&DSC-3) used the same input size as CNN-3 but as the DSC and HGC layer used fewer weights than the conventional convolutional layers the need for max pooling layers to keep down the number of weights was not needed, instead an architecture similar to HGC&DSC-1 and HGC&DSC-2 but with four blocks was opted for. The three neural networks are illustrated in Figure 11

(23)

Figure 11: The three network architectures of a lighter CNN style that utilized DSC and HGC layers.

(24)

3.3.3.1 HGC&DSC-1 architecture

The architecture for HGC&DSC-1 used the same input and output sizes for each layer as CNN-1.

A (120 × 120 × 3) input picture depthwise separably convoluted with a 4 × 4 kernel which used a stride of 2 followed by a HGC layer and a dropout layer. The second block of the structure used the same kernel sizes and strides. The number of output features for each block was 32 and 64 respectively. This architecture resulted in 104448 parameters 30% fewer parameters than CNN-1.

Like the first architecture HGC&DSC-2 also used the same input sizes, output sizes, kernel height and kernel depth as CNN-2. Which means that all DSC layers used a 2 × 2 and a stride of 2 and that the number of output features for the blocks was 32, 64 and 32. HGC&DSC-2 managed to reduce the number of parameters associated with the network by 66% bringing it down to 8991 parameters.

The final architecture consisted of four of the blocks used in the previous two architectures. All DSC layers used a 2 × 2 kernel with a stride of 2. The four blocks had 16, 64, 128 and 16 output features resulting in 29342 parameters to be trained, 37% fewer than CNN-3 although not as directly comparable as with the two previous networks as those HGC and DSC networks were more similar to the CNNs than the third was.

(25)

4 Results

All nine NN were realized in Python using the Keras API running on top of the TensorFlow machine learning platform. All networks were trained using the cross-entropy loss between the labels and predictions as its loss function. The stochastic gradient descent algorithm Adam with α = 0.001, β₁= 0.99, β₂= 0.999 and = 10⁻⁷ was used [16]. The training data was augmented by adding zero centered Gaussian noise and all networks were trained for 20 epochs.

4.1 First data set

The routine for collecting the first data set resulted in a training set of 1440 pictures and a testing set of 1416 pictures. Both the training and testing set contained 240 frames marked as ”robot should turn” (class 1) which equals 17% of the data. After training all models their accuracy was calculated as the share of pictures that were classified correctly.

Model Input size no. params Train accuracy Test accuracy

0 guess - - 0,83 0,83

CNN-1 120x120 148290 0,98 0,91

CNN-2 60x60 26306 0,95 0,91

CNN-3 240x240 46450 0,97 0,93

Deep NN-1 60x60 324092 0,87 0,87

Deep NN-2 30x30 82022 0,83 0,84

Deep NN-3 60x60 63566 0,97 0,91

HGC&DSC-1 120x120 104448 0,98 0,91

HGC&DSC-2 60x60 8991 0,94 0,93

HGC&DSC-3 240x240 29342 0,94 0,91

Table 3: Comparing the performance of always guessing zero and the nine models on the first data set.

The results were analyzed to determine which models would be used for the next data set and if the data collection routine should be changed for the final evaluation. It was noted that the high placement of the camera resulted in some of the objects not being present in the pictures just before the collision, see Figure 12 for reference, which resulted in the change of camera placement for collection of the second data set. This also resulted in the determination of using all nine models for the final evaluation was the first data set no longer was representative of the second.

(26)

Figure 12: The 12 frames before a collision with a bottle. As seen the cameras high placement caused the object to not be present in the final pictures before the collision.

4.2 Second data set

The data set collected for the final evaluation of the models contained 3300 images for training of which 315 had been marked as belonging to class 1 and 2820 images for training of which 195 had been marked as class 1. After the models had been trained their accuracy was tested and the true positive rate (TPR) and true negative rate (TNR) was calculated, also known as the models sensitivity and specificity. These results can be seen in Table 4 and the models accuracy on the test data is further broken down to accuracy per object in Table 5.

(27)

Model Train accuracy Test accuracy Test TPR Test TNR

0 Guess 0.905 0.931 0 1

CNN-1 0.998 0.984 0.791 0.999

CNN-2 0.989 0.973 0.637 0.998

CNN-3 0.998 0.971 0.602 0.999

Deep NN-1 0.996 0.972 0.622 0.998

Deep NN-2 0.970 0.946 0.289 0.997

Deep NN-3 0.982 0.958 0.448 0.997

HGC&DSC-1 0.995 0.974 0.667 0.998

HGC&DSC-2 0.975 0.977 0.696 0.998

HGC&DSC-3 0.960 0.978 0.721 0.998

Table 4: The second data sets accuracy, true positive rate and true negative rate.

Model Bottle Shoe Bag Well Wall Chair

CNN-1 0.875 0.711 0.952 0.571 0.912 0.500 CNN-2 0.688 0.711 0.952 0.429 0.667 0.000 CNN-3 0.688 0.737 0.881 0.107 0.737 0.000 Deep NN-1 0.813 0.684 0.762 0.250 0.526 0.850 Deep NN-2 0.000 0.263 0.595 0.000 0.404 0.000 Deep NN-3 0.000 0.474 0.786 0.500 0.439 0.000 HGC&DSC-1 0.313 0.737 0.952 0.357 0.877 0.050 HGC&DSC-2 0.000 0.684 0.905 0.9285 0.912 0.000 HGC&DSC-3 0.063 0.684 0.905 0.893 0.965 0.000

Table 5: Comparing of the performance of the models per object on the second data set.

4.2.1 Samples from CNN-1

After the models had been trained samples classified by CNN-1 were extracted to get examples of where the model failed and succeeded. Figure 13, 14, 15 and 16 shows true positive (TP), true negative (TN), false negative (FN) and false positive (FP) samples to illustrate where the model did and did not succeed. A TP sample means that the model correctly predicted that an object was present, TN means that the prediction that no object was present was correct, FN means that the data is marked as containing an object but the model predicted that no object was present and FP means that the model detected an object when the collected data was marked as containing nothing. The images have the object associated with them as their title and eight samples were taken from each class except for the false positive samples as the model only resulted in three samples.

(28)

Figure 13: Samples that were correctly classified as ”containing an object” also called ”the mower should turn”. Note that the specific object type is not predicted by the model but have been added manually to better asses the models performance.

(29)

Figure 14: Samples that were correctly classified as ”not containing an object” also called ”the mower should not turn”.

(30)

Figure 15: Samples that had been marked as containing an object but that the model failed to classify as such.

(31)

4.2.2 Top and bottom performing objects

In addition to extracting TP, TN, FN and FP samples, six samples each marked as chair, wall and computer bag were extracted. The chair was chosen as it was the object that the models performed worst on and the wall and bag was chosen as the models performed best on these two objects. The samples of the three objects can be seen in Figure 17, 18 and 19.

Figure 17: Five random samples marked as containing a chair in the test data set, this object was the bottom performing object of the data set.

Figure 18: Five random samples marked as containing wall in the test data set, this object was on of the objects the models performed the best on. As the data collected during the last second before a collision is marked as containing an object there is no guarantee that a picture will contain an object as noted in the fifth image.

Figure 19: Five random samples marked as containing a computer bag in the test data set, this object was the top performing object of the data set.

(32)

5 Discussion

The performance on the first data set showed that the position and angle of the camera is critical for collecting good data and thereby also for training the models. When we look at Table 3 and 4 we can calculate an average test accuracy for data set one of 0.90 and an average test accuracy for data set two of 0.97 meaning that the models performed much better on the later data set.

When looking at which models performed best on the final data set it is clear that the ”Deep NN” architectures performed worse than both the classical CNNs and the HGC and DSC CNNs.

The CNN-1 architecture is quite clearly the best performing of the nine models as it has the highest test accuracy, test TPR and test TNR. The one downside of CNN-1 is that it is the model with the second largest number of parameters so if size is very much at a premium one of the HGC&DSC models might be preferred. The size of the data that is needed for training will also play a roll in determining what architecture that will be the best one to use, CNN-2 and HGC&DSC-2 both preform well with an input size of 60 × 60 which results in a lower demand of storage needed to save the training data for the model.

When it comes to assessing if automatically collecting data and using it to train a light NN works one way of doing so is to look at the test TPR and test TNR of the best performing network. CNN-1 has a test TPR of 0.791 and a test TNR of 0.999 meaning that out of a thousand pictures containing objects the network will detect 791 of those will only producing one false positive per every thousand negative samples. Although the networks are far from perfect at detecting objects the TPR and TNR suggests that the network can assist in preventing most collisions while producing few false positives. When looking at Figure 13 - 16 it is seen that the TPs contain objects that should be detectable, the TNs consist of mostly grass, the FNs seem like they contain very little object and the FPs contain little grass. This strongly suggests that CNN-1 learned what was intended for it to learn and that it did not learn to detect the objects in an unintended way. Analyzing what objects the models performed best and worst on also tells us something about the problem and the samples in Figure 17 shows a chair with a texture somewhat similar to grass and dark color making it look similar to a shadow. Both the best performing objects that can been seen in Figure 18 and 19 are not shiny in their texture but are instead matte suggesting that this could be one of the reasons why the model performs better on these two objects. The main difference between the easy and difficult objects however seem to be the width of the objects with the easier to classify objects simply obscuring more of the picture than the slim legs of the chair.

(33)

conditions, etc may change more between real runs than between the train and test run recorded for this thesis data sets. The data sets were also collected on a relatively small lawn with just six objects and a real lawn can be several magnitudes larger in size and contain lots of more objects which might mean that the light models in this thesis do not perform as well on a larger lawn.

Lastly the mower did not collect data from several different runs and tried to improve between each run but was instead only trained on a single run and then evaluated on another. If the data collection and training routine deployed in this thesis is run for several runs in a row the models might forget what they have previously learnt and end up only detecting objects that the mower collided with during the previous run.

5.2 Further research

The limitations of the method for object detection that was used in this paper can be tackled by deploying the mower in a more realistic setting where it is allowed to collect data on a larger lawn with more objects and real waiting times between runs. The method can be further developed by reducing the number of epochs it is trained for after the first run, including some data from older runs in the training set or by adding a transfer learning approach to the network where some layers are trained before the product is shipped and only a few layers are trained by data from the lawn. If the same robotic lawn mower is used as in this thesis one could also try to include the data from the other sensors into the NN.

(34)

References

[1] https://news.cision.com/se/husqvarna-ab/r/husqvarna-group-firar-1-miljon- robotgrasklippare-pa-en-vaxande-marknad,c2244277. (Visited on 04/12/2021).

[2] https://news.cision.com/se/hui- research/r/rekordstarkt- kopintresse- for- robotgrasklippare,c2030042. (Visited on 04/12/2021).

[3] https://www.volta.ai/mora/. (Visited on 04/12/2021).

[4] https://www.irobot.se/Terra. (Visited on 04/12/2021).

[5] https://www.husqvarna.com/se/grasmatta- tradgard/professionell- anvandning/

professionell-robotgrasklippare/epos/. (Visited on 04/12/2021).

[6] https://github.com/HusqvarnaResearch/hrp. (Visited on 04/13/2021).

[7] Umberto Michelucci and SpringerLink (Online service). Advanced Applied Deep Learn- ing: Convolutional Neural Networks and Object Detection. English. 1st 2019. Berkeley, CA:

Apress, 2019. isbn: 1484249763;9781484249765;1484249755;1484249771;9781484249772;9781484249758;

[8] R H Hahnloser. R Sarpeshkar. M A Mahowald. R J Douglas and H S Seung. “Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit”. In: Nature (June 2000). url: https://doi.org/10.1038/35016072.

[9] Irwin Sobel. “An Isotropic 3x3 Image Gradient Operator”. In: Presentation at Stanford A.I. Project 1968 (Feb. 2014). url: https : / / www . researchgate . net / publication / 239398674_An_Isotropic_3x3_Image_Gradient_Operator.

[10] Fran¸cois Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. 2017.

arXiv: 1610.02357 [cs.CV].

[11] Christian Szegedy et al. “Going Deeper with Convolutions”. In: CoRR abs/1409.4842 (2014). arXiv: 1409.4842. url: http://arxiv.org/abs/1409.4842.

[12] Xukai Xie, Yuan Zhou, and Sun-Yuan Kung. “Exploring Highly Efficient Compact Neural Networks For Image Classification”. In: 2020 IEEE International Conference on Image Processing (ICIP). 2020, pp. 2930–2934. doi: 10.1109/ICIP40778.2020.9191334.

[13] Guozhong An. “The Effects of Adding Noise During Backpropagation Training on a Gen- eralization Performance”. In: Neural Computation 8.3 (1996), pp. 643–674. doi: 10.1162/

neco.1996.8.3.643.

[14] Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networks from Overfit- ting”. In: Journal of Machine Learning Research 15.56 (2014), pp. 1929–1958. url: http:

//jmlr.org/papers/v15/srivastava14a.html.

[15] https://github.com/TianzeLi/hrp_myversion. (Visited on 04/13/2021).

(35)

Appendices

The nine networks defined in Python.

1 i m p o r t n u m p y as np

2 i m p o r t t e n s o r f l o w as tf

3 f r o m t e n s o r f l o w . k e r a s i m p o r t l a y e r s

4

5 c l a s s HGC ( tf . k e r a s . l a y e r s . L a y e r ) :

6

7 def _ _ i n i t _ _ ( self , l a y e r s ) :

8 s u p e r( HGC , s e l f ) . _ _ i n i t _ _ ()

9 s e l f . l a y e r s = l a y e r s

10

11 def b u i l d ( self , i n p u t _ s h a p e ) :

12 for i in r a n g e( s e l f . l a y e r s ) :

13 s e t a t t r( self , ’ b l o c k ’ + str( i ) , l a y e r s . C o n v 2 D (1 , 1 , s t r i d e s =1 , p a d d i n g =’ s a m e ’, a c t i v a t i o n =’ r e l u ’, u s e _ b i a s = T r u e ) )

14 15

16 def c a l l ( self , i n p u t s ) :

17 c o n c a t = l a y e r s . C o n c a t e n a t e ( a x i s = -1)

18 def u n s q u (i n p u t) :

19 r e t u r n tf . e x p a n d _ d i m s (input, a x i s = -1)

20 arr = [ s e l f . b l o c k 0 ( u n s q u ( i n p u t s [: ,: ,: ,0]) ) ]

21 for i in r a n g e(1 , s e l f . l a y e r s ) :

22 arr . a p p e n d (g e t a t t r( self ,’ b l o c k ’ + str( i ) ) ( c o n c a t ([ arr [ i -1] , u n s q u ( i n p u t s [: ,: ,: , i ]) ]) ) )

23 r e t u r n c o n c a t ( arr )

24

25 def C N N 1 () :

26 c o n v 1 = l a y e r s . C o n v 2 D (32 , 4 , s t r i d e s =2 , p a d d i n g =’ v a l i d ’, a c t i v a t i o n =’ r e l u ’, u s e _ b i a s = True , n a m e =’ c o n v 1 ’)

30

31 inp = l a y e r s . I n p u t ( s h a p e =[120 , 120 , 3] , n a m e =’ i n p u t _ i m a g e ’)

32 x = c o n v 1 ( inp )

33 x = c o n v 2 ( x )

34 x = l a y e r s . D r o p o u t (0.5 , n a m e =’ d r o p o u t 1 ’) ( x )

35 x = c o n v 3 ( x )

36 x = c o n v 4 ( x )

37 x = l a y e r s . F l a t t e n ( n a m e =’ f l a t t e n ’) ( x )

39 l a s t = l a y e r s . D e n s e (2 , n a m e =’ d e n s e 1 ’) ( x )

(36)

40 r e t u r n tf . k e r a s . M o d e l ( i n p u t s = inp , o u t p u t s = l a s t )

41

42 def C N N 2 () :

49

50 inp = l a y e r s . I n p u t ( s h a p e =[60 , 60 , 3] , n a m e =’ i n p u t _ i m a g e ’)

51 x = c o n v 1 ( inp )

52 x = c o n v 2 ( x )

54 x = c o n v 3 ( x )

55 x = c o n v 4 ( x )

57 x = c o n v 5 ( x )

58 x = c o n v 6 ( x )

59 x = l a y e r s . F l a t t e n ( n a m e =’ f l a t t e n ’) ( x )

61 l a s t = l a y e r s . D e n s e (2 , n a m e =’ d e n s e 1 ’) ( x )

62 r e t u r n tf . k e r a s . M o d e l ( i n p u t s = inp , o u t p u t s = l a s t )

63

64 def C N N 3 () :

65 c o n v 1 = l a y e r s . C o n v 2 D (16 , 4 , s t r i d e s =2 , p a d d i n g =’ v a l i d ’, a c t i v a t i o n =’ r e l u ’, u s e _ b i a s = True , n a m e =’ ’)

68

69 inp = l a y e r s . I n p u t ( s h a p e =[240 , 240 , 3] , n a m e =’ ’)

70 x = c o n v 1 ( inp )

71 x = l a y e r s . M a x P o o l i n g 2 D ( n a m e =’ m a x P o o l 1 ’) ( x )

72 x = c o n v 2 ( x )