Diverse Time Redundant Triplex Parallel Convolutional Neural Networks for Unmanned Aerial Vehicle Detection

(1)

School of Innovation Design and Engineering

Västerås, Sweden

Thesis for the Degree of Master of Science in Engineering - Dependable

Systems 30.0 credits

DIVERSE TIME REDUNDANT TRIPLEX

PARALLEL CONVOLUTIONAL NEURAL

NETWORKS FOR UNMANNED AERIAL

VEHICLE DETECTION

Martin Bilger

martin.bilger@gmail.com

Hubert Stepien

hub.ste@hotmail.com

Examiner: Masoud Daneshtalab

Mälardalen University, Västerås, Sweden

Supervisors: Håkan Forsberg, Johan Hjorth

Mälardalen University, Västerås, Sweden

Company Supervisor: Billy Lindgren

SAAB Dynamics, Karlskoga, Sweden

(2)

Acknowledgment

We would like to wholeheartedly thank Dr.Håkan Forsberg and Johan Hjorth for their academical knowledge and support during this master thesis. They have provided superb guidance and fruitful discussions that allowed this thesis to become of higher quality. We also express gratitude towards Billy Lindgren for allowing us the opportunity to perform this thesis at SAAB. This thesis would not have been possible without his support. Last but not least, we thank SAAB Dynamics for providing us with the necessary hardware and locals in which we could work on this thesis.

(3)

Abstract

Safe airspace of airports worldwide is crucial to ensure that passengers, workers, and airplanes are safe from external threats, whether malicious or not. In recent years, several airports worldwide experienced intrusions into their airspace by unmanned aerial vehicles. Based on this observation, there is a need for a reliable detection system capable of detecting unmanned aerial vehicles with high accuracy and integrity. This thesis proposes time redundant triplex parallel diverse convolutional neural network architectures trained to detect unmanned aerial vehicles to address the aforemen-tioned issue. The thesis aims at producing a system capable of real-time performance coupled with previously mentioned networks. The hypothesis in this method will result in lower mispredictions of objects other than drones and high accuracy compared to singular convolutional neural networks. Several improvements to accuracy, lower mispredictions, and faster detection times were observed during the performed experiments with the proposed system. Furthermore, a new way of interpret-ing the intersection over union results for all neural networks is introduced to ensure the correctness and reliability of results. Lastly, the system produced by this thesis is analyzed from a dependability viewpoint to provide an overview of how this contributes to dependability research.

(4)

Sammanfattning

Säker luftrum på flygplatser över hela världen är avgörande för att säkerställa att passagerare, ar-betare och flygplan är säkra från externa hot, oavsett om de är avsiktligt skadliga eller inte. Under de senaste åren har många flygplatser världen över upplevt intrång i sitt luftrum av obemannade drönare. Baserat på denna observation finns det ett behov av ett tillförlitligt detekteringssystem som kan upptäcka obemannade flygburna fordon med hög noggrannhet och integritet. Denna avhandling föreslår tidsredundanta trippla parallella, neurala nätverksarkitekturer tränade för att upptäcka obe-mannade drönare för att ta itu med frågan som nämns ovan. Avhandlingen syftar till att producera ett system som kan prestera i realtid i kombination med tidigare nämnda nätverk. Hypotesen är att den här metoden kommer att resultera i lägre felförutsägelser av andra föremål än drönare och hög noggrannhet jämfört med enstaka neurala nätverk. Flera förbättringar av noggrannhet, lägre felförutsägelser och snabbare detekteringstider observerades under experiment med det föreslagna systemet. Ett nytt sätt att titta på resultat från genomskärning av union för alla neurala nätverk införs för att ytterligare säkerställa riktigheten och tillförlitligheten i resultaten. Slutligen analyse-ras det system som har designats i denna avhandling ur en tillförlitlighetssynpunkt för att ge en översikt över hur det bidrar till forskningen.

(5)

9.2 Time Redundancy . . . 26 10.Discussion 29 10.1 Limitations . . . 29 10.2 Methodology . . . 29 10.3 Hardware . . . 30 10.4 Neural Networks . . . 30 10.5 Results . . . 30 10.6 Dependability Evaluation . . . 31 11.Conclusions 33 11.1 Research Question 1 . . . 33 11.2 Research Question 2 . . . 33 12.Future Work 34 12.1 Further Research . . . 34 References 37

(6)

Appendix B Visual Representation of Mispredictions For All Neural Networks 39

(7)

List of Figures

1 Image of an unmanned aerial vehicle. . . 2

2 Visualization of a neural network. . . 3

3 Visualization of a convolutional layer. . . 4

4 Visualization of max pooling operation. . . 5

5 RICC Framework. . . 10

6 Flowchart for research question 1. . . 12

7 Flowchart for research question 2. . . 13

8 Sample images from training dataset. . . 15

9 Visualization of YOLO V5s detection procedure. . . 16

10 YOLO V5s average precision and recall . . . 16

11 MobileNet V2 average precision and recall. . . 17

12 EfficientDet D1 average precision and recall. . . 18

13 Testing software execution process. . . 19

14 Visualization of area of intersection. . . 20

15 Visualization of area of union. . . 21

16 Frames from testing video. . . 22

17 Results from the first test with 2% confidence threshold. . . 23

18 Visualization of the visual results from the first test. . . 24

19 Results of second test with 20% confidence threshold. . . 25

20 Response time for intersection over union. . . 26

21 Image of time redundancy interface. . . 26

22 YOLO V5s as time redundant network. . . 27

23 MobileNet V2 as time redundant network. . . 27

24 EfficientDet D1 as time redundant network. . . 28

25 Whole frame from the testing video. . . 38

26 Visualization of mispredictions. . . 39

(8)

List of Tables

1 Hardware setup. . . 18

2 Ground truth for test video. . . 22

3 Statistical results of first test with 2% confidence threshold. . . 23

4 Statistical results of first test with 20% confidence threshold. . . 25

(9)

Acronyms

ACRI Average Confidence Re-Inference. 28 AoI Area of Intersection. 20

AoI Area of Intersection. 20 AoU Area of Union. 20, 21 AP Average Precision. 6, 17

BiFPN Bi-Directional Feature Pyramid Network. 17

CIFAR Canadian Institute For Advanced Research Database. 5

CNN Convolutional Neural Network. iv, 1, 3–5, 7–10, 14, 16, 17, 19, 21, 22, 29, 30, 33, 34 DWSC Depth-wise Separable Convolutions. 16

FAA Federal Aviation Administration. 1 FN False Negative. 5, 25

FP False Positive. 5, 8, 9, 23, 25

IoU Intersection over Union. 19–21, 25, 26, 32–34 mAP Mean Average Precision. 6, 9

MNIST Modified National Institute of Standards and Technology Database. 5 MR Modular Redundancy. 8

n-Amount Any Amount of Positive Integers. 8 NN Neural Network. 1–10, 15–17, 19, 21–23, 26, 29–34 PD Percentage Difference. 28

PPD Percentage Point Difference. 28 RADAR Radio Detection and Ranging. 2, 8 ReLU Rectified Linear Unit. 2, 16

RF Radio Frequency. 8, 9

RGB Red, Green and Blue. 4, 17

RICC Robustness Interpretability Completeness Correctness. 10, 31 TAC Total Average Confidence. 28

TN True Negative. 22, 23, 25, 26 TP True Positive. 5, 9, 19, 22, 23, 25

UAV Unmanned Aerial Vehicle. 1, 2, 8, 15, 31, 32 YOLO You Only Look Once. 9, 15–17, 27, 29–31

(10)

1. Introduction

During the last five years, commercial drone sales have increased and are projected to continue growing [1]. The growth coupled with components needed to build a drone, becoming more avail-able and easier to use (Plug and Play), implies that nearly everyone who wants to own and fly a drone can do so. Unmanned Aerial Vehicles (UAV) are not a problem in themselves; it is when a drone operator knowingly or unknowingly enters an airport’s restricted airspace unlawfully, which poses a safety risk. A drone can pose a significant threat to aircraft and airport safety. Federal Aviation Administration (FAA) in the United States of America tracks all possible drone intrusions in and around proximity to its airports [2]. Gathered from the FAAs data, drone sightings are occurring semi-frequently. There is a possible need for dependable, robust, and accurate drone detection methods, one of which is detection with the help of Neural Networks (NN). The problem of neural networks is that the logic is based on probabilistic approaches that cannot be fully trusted in safety-critical environments.

Recently a more significant focus has been directed at finding and identifying UAVs using Convolu-tional Neural Networks (CNNs). In 2020 relevant articles about obstacles and challenges in drone detection with CNNs have been published by, amongst others, Pawelczyk et al. [3] emphasize the need for large data-sets depicting drones in diverse environments to better train CNNs. Further-more, the detection of UAVs must be reliable and accurate when used in airport environments. A misdetection between an airplane and a UAV is explicitly undesired. Unfortunately, the state of the art research on dependability in CNNs is scarce. Relevant work was conducted by Latifi et al. [4]. Their work describes how up to 30 parallel neural networks can help reduce mispre-dictions and further improve the reliability of outputs in CNNs. The evaluation of NNs from a dependability standpoint is necessary to understand better the implications of implementing NNs in safety-critical applications. Nuhrenberg et al.[5] proposed a framework for evaluating depend-ability in NNs.

This thesis will investigate how different combinations of CNN architectures affect mispredictions and drone detection accuracy. Furthermore, experiments regarding how time redundancy may affect the performance of diverse architectures are performed. Architectures that have been im-plemented are YoloV5s, MobileNet V2 and EfficientDet D1. Experimental research is the central aspect of gathering empiric evidence for analysis, discussion, and conclusion. This thesis shows an apparent reduction in mispredictions and higher detection accuracy by implementing diverse parallel CNNs with time redundancy compared to singular networks.

(11)

2. Background

The background section provides the essential knowledge needed to understand the conducted work and its relevancy. In addition, the area will touch on neural networks, unmanned aerial vehicles, data processing, and dependability aspects.

2.1 Unmanned Aerial Vehicles

Unmanned Aerial Vehicles (UAV) as seen in Figure 1, are aircrafts that rely on remote control inputs from either humans or autonomous operation [6], [7]. These machines are often in the shape of an airframe connected to multiple rotors for providing lift and steering capabilities.

Figure 1: UAV with an attached camera [8]. CC0 by Robert Lynch.

UAVs have a wide area of usage ranging from search and rescue operations [9] to recreational

drone racing. Therefore, the airframes can equip numerous sensors, such as cameras [9] and

Radio Detection and Ranging (RADAR) to enhance UAVs’ operational capability. The above-described capabilities of UAVs provide positive results when used correctly within the proper boundaries and intended operation, which means that no harmful interference is originating directly or indirectly from the operation of a UAV. However, operators can fly UAVs in a manner that is either immediately endangering human lives or providing information on secure areas, which must remain a secret for numerous reasons. One particular scenario where UAV pose a direct risk for human lives is in the airport environment. The potential collision between aircraft and a UAV could damage the aircraft’s structure and vital components such as engines. This event could potentially trigger a loss of life event if the aircraft crashes.

2.2 Neural Networks

Neural Networks allow computers to learn and interpret the world around them. These networks use models that mimic the human nervous system, which has been evolving for more than 500 million years. Human nervous system has adapted neural synapses that connect neurons [10] to learn based on received inputs. After a conducted learning process, the human brain can produce an output based on previously learned knowledge. This process is biologically controlled in humans [10], while computers must rely on complex mathematics to achieve a similar result [11]. Figure 2 shows a classical NN architecture. Human neurons continuously fire between each other and use the same mechanism to send information, while NNs can use different activation methods. These methods are not restricted to but include rectified linear units (ReLU) [12], see Equation 1

(12)

and logistic sigmoid function [13], see Equation 2 .

f (x) = 1

1 + e−x (2)

These functions only send an output to the next artificial neuron if a threshold has been reached on the previous input’s value. The functions help the system learn different features. The training process for NNs may take anything from a few minutes to several weeks or even years, depending on the size of the training dataset and the number of features to recognize. Furthermore, the architecture of a NN also plays a factor in the time it will take to train; variables such as image input size and the size of the network. The completed training process produces trained weight files, allowing the network to recognize previously learned features on new images that are not present in the training dataset.

Input 1

Input 2

Input n

Output 1

Output n

Input Layer Hidden Layer Output Layer

Figure 2: Classic architecture of a Neural Network consisting of 1 input layer, 3 hidden layers and 1 output layer.

NNs are highly diverse, meaning the operations that can benefit from their usage range from computing slip in heavy autonomous haulers [14] to analyzing music genres [15]. It is important to emphasize that this diversity is only accurate if thorough training of a network is conducted with a large, diverse dataset consisting of non-homogenous images of desired objects to detect [3].

2.3 Convolutional Neural Network

Convolutional Neural Networks are a subset of the NN domain. These networks specialize in assistance with tasks requiring working with images. One of the first CNNs is LeNet, proposed by Yann Lee et al. in 1989 [16]. This NN performs analysis on numbers written by hand and outputs detected numbers. This CNN is known for being one of the first NNs with practical applications. For example, it has helped the US Postal Service efficiently and with fair accuracy at detecting postal numbers, proving that CNNs can solve practical visual tasks [16].

CNNs learn to identify shapes inside images by feeding the image through multiple layers, such as convolution-, pooling- and fully connected layers [17]. The three layers act as a foundation upon which the network can further evolve. The previously mentioned layers’ role is to dissect the full image into subsets. As the image sequences propagate through the layers, patterns begin to emerge [18]. These patterns show different aspects of the image of varying importance decided by the network’s weights and bias. The learned patterns allow the network to detect whether a previously learned pattern is detected or not [19].

(13)

Convolutional layers

The convolutional layer’s role, as depicted in Figure 3, is to extract high-level features from an

image, for example, edges and color. These are commonly known as convolved features [19].

This convolution is possible because of the way that computers see pictures. For a computer, viewing images consists of looking at pixels with different corresponding values, initially in the red, green, and blue (RGB) color scales. RGB scale of an image is a triple matrix where every color mentioned above has a matrix. Processing a full-sized image with dimensions of 1920 × 1080px is greatly resource-consuming for a CNN, hence images are commonly reduced to greyscale to reduce their computational size [19]. Convolving images into a smaller form factor without losing the essential features reduces this problem. Performing convolution is a process of applying a filter, which can be a N xM matrix, going over the whole matrix representing the image to convolve. As the previously mentioned RGB matrix consists of three different M xN matrices and the filter strides each, the output results are three convoluted matrices. These matrices are added together to produce a one-dimensional matrix with essential features for further processing by the network. When it comes to greyscale pictures, the convolution outputs one M xN matrix.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 2 1 0 2 1 0 0 0 1 0 0 0 2 0 2 1 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 2 1 0 2 1 0 0 0 1 0 0 0 2 0 2 1 0 2 1 0 2 0 -1 1 1 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 1 0 Input Data X Filter Weights

W_a,b Output DataY

0 0 0

Figure 3: Visualization of a convolutional layer presented by Maghrebi et al.[20]

Max pooling

To further down-sample an image after a convolution, a CNN may use a max-pooling method to extract the highest matrix values previously obtained from the convolutional layer. It is possible to preserve the previously detected features on a smaller scale by performing this operation [21]. This operation is performed similarly to a convolutional layer, see Figure 4. Furthermore, max-pooling may help prevent overfitting. Overfitting occurs when a NN has been overtrained on the training dataset and performs badly when running inference on images not included in the training dataset. [22].

(14)

Figure 4: Max pooling performed on a 4x4 matrix using a 2x2 filter with a stride of 2. Every quadrant consists of 4 squares, only the square with the highest value will be saved in the matrix after max pooling for further processing by the CNN.

Dataset

Datasets are collections of records where every piece is from within the same domain. These sets can contain information about any number of parameters. Within the realm of NNs, some of the more known databases are Modified National Institute of Standards and Technology database MNIST [23], and Canadian Institute For Advanced Research CIFAR-10 [24]. The MNIST dataset consists of 70 000 images of handwritten numbers for the training of NNs. The CIFAR-10 dataset is more diverse than MNIST as it contains multiple datasets ranging from airplanes to frogs consisting of 60 000 images. Datasets for training NNs are often large as a significant variety of visual aids helps the network learn how to perform the desired task.

Transfer Learning

Transfer learning is commonly referred to as fine tuning a NN to perform additional detections of a newly introduced object. The NN uses its previous knowledge, also known as weights, to detect new objects [25]. An example of transfer learning is a network that can recognize camping items such as bottles, knives, backpacks, and sandwiches. When introducing a new object, such as a thermos, the old weights are retrained to help the network identify thermoses. Fine tuning takes already known data points and helps the network learn new features to detect. By transfer learn-ing the network, there is no need to retrain it from the beginnlearn-ing, thus savlearn-ing time. Furthermore, transfer learning is possible to perform on relatively small datasets, which further reduces the time required to train the network [25].

2.4 CNN Performance Evaluation

CNNs can be evaluated in several ways. For example, using multiple evaluations for a CNN

can provide a better understanding of how a network is performing and what aspects of it need improvement.

Precision and Recall

Precision is the evaluation metric measuring the percentage of positive detections see Equation 3. It is relative to all positive detections. It shows what percentage of all predictions are correct. [26]

P recision = T P

T P + F P (3)

Equation 4 is the recall which is similar to precision. The difference is that it shows what percentage of all detections are positive detections. [26]

Recall = T P

(15)

Mean Average Precision

Mean Average Precision (mAP) output is based on a multivariate analysis of several paramet-ers. The parameters consider how well the ground truth bounding boxes overlap with detection bounding boxes, average precision, amount of classes detected, and the number of classes to detect [27], see Equation 5. NNs trained on the COCO [28] datasets commonly use mAP to compare performance between each other as COCO provides well defined evaluation datasets.

mAP =

PQ

q=1×AP (q)

Q (5)

where:

Q = Amount of classes detected

AP (q) = Average Precision for a class

Balanced F-Score

The balanced F-Score is also known as F1 score, is a gauge of harmonic mean between precision

and recall [29], see Equation 6. This metric ranges between 0 and 1. The lower this score is, the

more imbalance exists between precision and recall. High F1 is perceived as favorable for a NN.

F1= 2 ×

P recision × Recall

P recision + Recall (6)

2.5 Preprocessing of Data

Preprocessing is an essential part of any NN. It can increase the network’s accuracy by changing different aspects of the data. Real-world data is rarely perfect, and misinterpreted data can cause the network to train on faulty data. The preprocessing can address the imperfect data by filtering out noise or feeding NN missing data [30].

Data Augmentation

Data augmentation is part of the preprocessing as a way to enlarge a dataset or create new perspectives of the desired object, such as rotation, blurring, color space modification, or noise injection. The goal of the data augmentation is to increase the accuracy of the network by preparing it for scenarios that are not in the dataset [30].

Rotation

The object will rarely be in the same orientation in a real-world scenario. By performing Rotation

augmentation, the network will train on images rotated from 1◦to 359◦. The rotation degree

para-meter will determine how many degrees each image will rotate. When using rotation augmentation its is critical to notice that the labels for the image need to be corrected to represent the rotated image [30].

Color space modification

One of the biggest challenges for image recognition is lighting biases, as the color values would change when the light changes. The most effective countermeasure for this is to manipulate the lighting in the data set. For example, changing the contrast, white balance, sharpening the image, or even blurring the images as in real-world applications, perfect focus can not be achieved at all times. By altering these parameters, the network will be better trained [30].

(16)

2.6 Dependability

"Dependability is the ability to deliver service that can justifiably be trusted" Laprie et al. [31, pp.1]. Dependability is an umbrella term consisting of several properties to enhance and provide guidance for developing dependable systems. There are three main areas in this domain: attributes, means, and threats. Attributes refer to the properties that a system should possess to achieve a higher standard of operation; such features are reliability, safety, and integrity [32]. Means tell about various methods used to predict, remove and tolerate faults. Threats classify different levels of unintended function in a system, which are in the following order of increasing severity: fault, error, and failure. Failure describes an event where incorrect system service occurs [31].

CNNs most often provide percentages as an output together with a label of the finding in an image. Thus, increasing the output certainty of a NN can be seen as an attribute of dependability, especially reliability.

Diverse Redundancy

Diverse redundancy is a method of providing high safety integrity and fault tolerance. It falls under the means category mentioned in the previous section. This redundancy method can mitigate faults such as design faults and common mode faults, which are easy to miss during development

and may lay dormant for an unknown amount of time until causing a failure. Implementing

diverse redundancy implies the use of parallel architectures that are diverse forms of each other. Implementing this redundancy in practice can be coding different algorithms that aim to produce the same results. The difference between each of the algorithms varies in the way they are coded. This method can lead to a reduction or nearly total elimination of design faults. The algorithms may even use different programming languages for further diversification [33].

Time Redundancy

Time redundancy is a robust method of detecting and masking faults that may arise during data transmission between different subsystems. Two common ways of implementing time redundancy in a system are re-transmission and re-computation of previously computed data to ensure the integrity of results and sequential measurements of values output by a sensor. Sensor output values are then measured at close time intervals between each other and compared. One example of time redundancy is ensuring that the velocity of a moving vehicle is correct. Logically if at time = 0 the velocity is 2 m/s, time = 1 the velocity is still 2 m/s and time = 3 the velocity is 2,1 m/s the sensor is working correctly. Shall the sensor at t = 4, for example, suddenly output a value of 99 m/s, then there is a reason to suspect that the velocity sensor has failed. Time redundancy can be implemented in a triple voter. Thus, it can help correct faults if two out of three inputs share the same value, meaning that if the third input is faulty, the faulty value does not propagate further through a system. This redundancy method protects against transient faults, which are faults that sporadically appear and quickly disappear due to various reasons such as bit flips [32].

(17)

3. Related Work

This section familiarizes the reader with other researchers’ work and problems from the neural network domain. The work included in this section is written from a dependability standpoint. It includes modular redundancy, new frameworks for dependability in NNs, and different drone detection methods. Later in the report, the results will be compared to the related work section to make relative comparisons between the outcomes of this thesis and state of the art.

3.1 Modular Redundancy in CNNs

In the dependability domain, modular redundancy (MR) achieves a higher standard of reliability by implementing a module multiple times to provide robustness against errors in a system. Latifi et al. [4] propose using MR coupled with NNs to reduce false-positive (FP) results and increase system confidence detection. In their approach, up to 30 parallel NNs with identical architectures were fed differently with pre-processed images. Pawelczyk et al. [3] did training individually on each of the networks. All of the weights are randomly initialized to achieve training diversity. During training, images were fed in a random order for the NNs. The different pre-processing methods are based on a pre-study conducted by Pawelczyk et al. [3]. The analysis of other pre-processing methods ranges from simple horizontal flips to more complex techniques such as ImAdjust [34] in MatLab. These pre-study results show that more straight forward pre-processing methods that preserve an image’s core features are preferred as those end up with more accurate results from the CNN.

Furthermore, Pawelczyk et al. [3] provide evidence supporting their hypothesis that by not looking at the outputs of the CNNs as a whole but instead analyzing how confident the networks are in a detected feature; networks can improve the overall detection reliability. This way of evaluating the outputs may help resolve the issues of NNs having high confidence in predicted outcomes coupled with a FP result. For achieving the previously mentioned outcomes, Pawelczyk et al.[3] have implemented a two-step method where all outputs from the network’s softmax layer [35] are stored as probability vectors, which are then compared to a threshold value set by the user. Vectors that cross the threshold are assumed to be reliable outcomes. Since an n-Amount of networks work together, the correctness further amplifies if multiple networks come to the same conclusion. By combining all of the above steps, Pawelczyk et al. [3] managed to reduce false positive outcomes overall by 33.5% while achieving high accuracy of detected objects [4].

3.2 Drone Detection Methods

Currently, multiple methods for the detection of drones are available to use. These include RADAR, acoustic, visual, and radio-frequency (RF). All of the technologies have advantages and disadvant-ages that make them more or less suitable for different scenarios. There is also a method that suggested fusing the data from these techniques. While the complexity and cost would increase substantially, the advantages would be to get a broader application scenario and increase detection accuracy and range.

Furthermore, Yang et al.[36] explore how different methods compare to the detection range and the different challenges of each method mentioned above. Their findings suggest that RADAR technology may detect drones for up to 3000 meters. An acoustic method focusing on detecting a drone in the time-frequency domain may detect a drone between 40 and 300 meters. A visual approach can detect a drone by detecting motion and features exclusive to UAVs between 100 and 1000 meters. The final researched method, RF, can detect drones up to 1000 meters by analyzing the incoming frequency of a control signal, commonly 2.4Ghz for amateur UAVs. By knowing these limitations, an experimental anti-drone system, "ADS-ZJU," was developed. The system consists of an acoustic array, vision-, RF-sensor, and an RF jamming unit. The whole system is tested in an urban environment, on a rooftop next to a football field. For analyzing the performance of the system, researchers ran 10 000 detections per detection method. The probability of detection was calculated with Equation 7

Pd=

Nd

Np

(18)

Nd is the number of accurate TP detections, and Np is the number of detections for each distance. Further testing shows the number of misdetections by this system, aiming to catch FP results by exposing the system to 10 000 detections where no drone is in a frame. The FP-predictions were calculated with Equation 8

Pfa=

Nfa

Nn

(8)

Nn is the total amount of detections made, and Nfa is the amount of FP detections.

True-positive predictions made at 100 meters ranged from 21.1% by using acoustics, 73.3% with RF up to 95.4% when using a vision sensor. The false-negative ratio was similar where the RF method scored highest with a probability of 2.8%, followed by acoustic with 2.2% and ending with vision at 1.1%.[36]

As the acoustic is noise sensitive, it would not be suitable for drone detection at an airport. After all, airports are highly noisy environments where a single engine can produce 140dB of audible noise [37]. Furthermore, the RF-based drone detection method can not detect autonomously flown drones, and as demonstrated above, the detection confidence significantly falls off after 100 meters. Therefore, these options do not seem viable to be utilized in this master thesis; coupled with the complexity of developing and researching radar technology and the rapid development of CNNs, a visual-based approach is more optimal as it can provide a robust solution with high accuracy and relatively low mispredictions.

Visual Detection

Several architectures have been tested and analyzed for use in drone detection. Taha et al.[38] compared multiple architectures, and the YOLO v2 was concluded as the most accurate network with approximately 90%. When this article was published, the YOLO v2 was state of the art. During the writing of this thesis, YOLO v5 is the newest official version.

3.3 Drone Focused Neural Network

CNNs require immense amounts of diverse real-world data in the form of images to train and receive satisfying reliable results. For the data to be useful, it has to be pre-processed and annotated, which can be a time-consuming task if done by hand. Pawelczyk et al.[3], describe the current problems with acquiring datasets for training NNs in detecting UAVs. Because it is such a niche subject, datasets are relatively scarce. Currently, there exist no databases with large UAV datasets. Researchers created their datasets by filming UAVs in different environments to address these problems, extracting individual frames, and annotating them. A dataset created by Zhao et al. [39] was investigated. This dataset, however, is very homogenous and does not accurately represent real-world scenarios. Pawelczyk et al. [3] excluded this dataset from the training. Based on the

previously mentioned datasets, two different NNs were trained. One is a CNN based on the

MobileNet V1 [40] architecture, and the other is a HAAR Cascade Algorithm. The HAAR method scores 55% accuracy and 32% F1 score while the CNN scores 69,4% in accuracy and 60,2% in F1 score. For future work, Pawelczyk et al. [3] propose that more diverse and multiple datasets should be used to increase the accuracy of a CNN further. It is essential to underscore that the CNN was not trained to its full potential; authors argue that the model performance mAP has stagnated after 750 000 iterations while steadily increasing, meaning that further accuracy can be extracted from their model. The evidence supporting this comes from non-smoothed accuracy data, which rises from around 55% to 58,5% between iterations of 750 000 and 1 000 000 of the network.

(19)

3.4 Dependability in CNN

The dependability aspect of engineering mainly focuses on improving the safety and security of products. As mentioned above, NNs are emerging in engineering as a cutting-edge method to solve tedious tasks and improve products’ versatility. The products range from non-safety critical, such as optimizing cameras in mobile phones, to safety-critical applications like self-driving cars. NNs possess little to no deterministic features in terms of outcome certainty when constantly processing new inputs as the detections are probabilistic. NNs, however, behave deterministically when pro-cessing the same input multiple times. These challenges can be resolved if a framework can cover dependability aspects of the NN domain. Such framework is written about by Nuhrenberg et al.[5]. In their work, some key ground aspects are being mentioned. These aspects follow the RICC ab-breviation, see Figure 5.

Robustness focuses on the NNs ability to tolerate misrepresentations of inputs and a security aspect of inputs designed to trick the NN into believing that input is something else than repres-ented in reality. Interpretability is a term that may help in understanding how a NN learns. A sub-aspect of this is how precise the network is in classifying what it interprets. Completeness also relates to NN’s training process and aims to identify how all of the different scenarios to learn are covered by a NN. Correctness is directly related to the dependability framework proposed by Laprie et al.[31] where the correctness of operation is evaluated as the possibility to perform a task without errors.

Figure 5: The proposed framework together with detailed relationship between different aspects of a neural network [5].

(20)

4. Problem Formulation

Airports are vulnerable to aerial intrusions [41], [42]. An intrusion can prove to be challenging to contain and poses a severe safety risk to airports passengers, equipment and personnel. There is a need for a reliable method of finding drones in airport environments. This need is based on recent years drone intrusions in and around airports by UAVs [2]. This master thesis will evaluate how diverse architectures implemented in parallel with each other affect mispredictions and if higher detection accuracy can be achieved.

The authors of this thesis will focus on further experimenting in increasing detection reliability by exploring how time redundancy can be implemented into a CNN and, if it can, what kind of results can be achieved. For the authors to be able to answer the above inquires, the following research questions and hypothesis have been formulated:

RQ1: How can redundant architectures of convolutional neural networks sustain high drone de-tection and low mispredictions of other objects?

RQ2: What impact does time redundancy have on the selected architectures in RQ1?

4.1 Hypothesis

The hypothesis is that time redundancy and diverse redundant architectures, including convolu-tional neural networks, can improve drone detection and specifically reduce false-positive predic-tions.

4.2 Expected Outcomes

This master thesis’s primary focus is to provide research for neural networks from the standpoint of dependability. For this to be possible clear and rational results must be obtained for further analysis. These results will then provide valid conclusions. One of the desired outcomes is to show how diverse redundancy methods can be applied to neural networks, and an increase in accuracy and lower mispredictions will occur. When it comes to time redundancy, the authors hope to further increase accuracy by validating and verifying results by recomputing the area of intersection over union produced by the parallel convolutional neural networks.

4.3 Limitations

As of the writing of this thesis, there is a pandemic of SARS-CoV-2. This proposes restrictions on accessing campus grounds and can limit the amount of work done at SAAB as travel is not recommended. Neural Networks require powerful hardware to both train and run them with good results. The authors currently do not have access to state of the art hardware as the current generation of graphical processing units are extremely short supply worldwide. A possible solution to this problem is to use cloud computing services such as Google Colab [43] or Amazon Web Services [44]. This limitation compels the work to converge towards using networks with faster computational time, which are not very deep. These networks may not be as accurate as very deep neural networks. Thus, this has to be factored in during the evaluation of test results.

(21)

5. Method

To attempt to answer RQ1:

"How can redundant architectures of convolutional neural networks sustain high drone detection and low mispredictions of other objects?"

The question will be divided into three phases. The first phase, which is the pre-study phase, will consist of a literature study to compare and decide which diverse architectures are suitable for this application. During the second phase, which will be set as the experimental phase, experimental research will be used as a method. Throughout this phase, the implementation of the chosen ar-chitecture will be performed. The phase will end with testing and validation of the implemented architectures, where each model will be tested separately. In combination with the other models on the same video, the flowchart can be seen in Figure 6. The goal of testing on the same video is to get results that will not be affected by the testing environment. The third phase will consist of analyzing the testing data from the experimental phase. The data should be extensive enough to conclude if diverse redundant architecture can increase the accuracy and lower the number of mispredictions.

Literature study

Conclusion which architectures

are most suitable

Implementing proposed architecture RQ1 Proposed architecture Testing Conclusion of RQ1 Data analysis Pre-study phase Experimental phase Conclusion phase

(22)

For the second research question RQ2:

"What impact does time redundancy have on the selected architectures in RQ1?" Experimental research will be used as a method, see Figure 7. The hypothesis is that accuracy can be increased by recomputing the data that one or more neural networks have already processed. Both computations will be analyzed together to produce a good overview of produced results. The hypothesis for RQ2 is that when three neural networks detect the same object they will produce an intersection over union covering the detected object. To implement time redundancy, a cropped image based on the borders of the intersection over union will be fed to a neural network. This method can make it possible to further increase overall accuracy of detection and the reliability of results.

(23)

6. Ethical and Societal Considerations

This thesis aims to produce a reliable method of finding drones in airport environments by intro-ducing diverse CNNs. The thesis may lay grounds for further development of a surveillance system that will operate in civilian airports. The societal aspects are that passengers at airports can feel safer if this system is in use. In addition, they recognize that something other than only humans is continuously monitoring the airspace around airports. The thesis is done with good faith and may help increase safety and security in civilian airport environments. The authors currently cannot express any concerns or knowledge about this work being implemented with nefarious intentions. Any technologies and methods produced will be overturned to SAAB AB, a Swedish defense com-pany. They may use this as they deem fit, and it is not in the domain of authors writing this thesis to speculate on how SAAB will use said developed methods. To protect the interests of SAAB and the Swedish Government, students have signed multiple non-disclosure agreements. If any work in this thesis is deemed unsuitable for public release, two reports will be produced. One for SAAB and one for Mälardalen University. Any redacted information will be modified to suit the latter version.

The environmental impact of using this method will be negligible as it is a computer-based system that will not train for weeks on end. This thesis does not involve any human research subjects, only material objects. Thus any need for concealing personal data is not needed.

(24)

7. Test Preparations

This section of the master thesis will familiarize readers with work performed by the authors in preparation for the results section. Work that is featured here will cover areas of implemented CNNs, produced python algorithms and plans for experiments together with test cases.

7.1 Dataset

The dataset on which NNs from Section 3.3 were trained has been created initially by Pawelczyk et al. [3]. The dataset consists of 50 000 drone images. The depicted UAVs in this dataset are not only restricted to quadcopters but other objects such as remote controlled airplanes and cars are also included in the dataset, see Figure 8 for images. The previous objects technically can be classified as drones. To eliminate unwanted items from the original dataset which are not quadcopters, the authors derived their dataset based on the images from the original dataset. The derived dataset consists of 8800 handpicked images split between training and validation sets. The training set includes 8000 images, while the validation set consists of 800 images. The chosen images have been diversified as much as possible to achieve better detection rates. Diversification implies that images depict quadcopters in varying scenarios and backgrounds with shifting blur and rotation, and brightness.

Figure 8: Example images from dataset on which all three CNNs are trained.

7.2 Implemented Convolutional Neural Networks

In this subsection, three fast and efficient convolutional neural networks are summarized. Their most outstanding features are described further down in this section. All of the networks are trained on Google Colab [43] using Nvidia Tesla K80 graphical processing units.

You Only Look Once V5s

YOLO V5s is the 5th iteration of the object detection model implemented in PyTorch framework [45]. This model is made by Ultralytics [46], which is a company specializing in artificial intelligence. All of the previous versions of YOLO including this one, operate on the same principle of performing grid square detections of an image. First, the image is divided into M × M cells producing a grid system over the image, see Figure 9. This grid system makes it possible to perform object detection on every cell independently. Furthermore, shall an object of interest be detected inside a cell, then that cell will output a bounding box with information attached to it such as confidence of detected

(25)

object, the coordinates where on an image it resides together with height and width of the bounding box.

Image divided into M x M grid

Bounding Boxes

Heat map based on class detection probabilities

Final Detection

Figure 9: Visualization of different detection stages while running inference on YOLO. YOLO is known as the head of the NN responsible for predictions [47]. For the backbone, making the convolutions needed for a prediction, ResNet101 was chosen by YOLO V5s creators. ResNet101 is part of the residual networks proposed by He et al.[48]. ResNets are defined by the ability to skip over specific layers containing ReLU functions and batch normalizations to provide more efficient training and better results than plain networks [48]. Furthermore, see Figure 10, by skipping those layers, the vanishing gradient problem is reduced [48].

Figure 10: Average Precision and Recall for the Pytorch implemented Yolo V5s after transfer learning on the handpicked dataset.

MobileNet V2

MobileNetV2 is a CNN architecture designed by Sandler et al.[49]. The architecture is built upon the original MobileNetV1 [40]. MobileNetV2 differentiates itself from its predecessor by including linear bottlenecks and inverted residuals in the core architecture. The core idea of a MobileNet is that it is optimized for running on resource constrained devices such as mobile phones or embedded systems. The implementation of inverted residuals is possible by connecting the first layer of a convolutional block with the last fully connected layer, thus bypassing all of the layers in between. Since the last layer of a convolutional block in this CNN is a 1 × 1 × M kernel, the total amount of parameters to learn is reduced. MobileNetV2 further reduces its computational costs by performing depth-wise separable convolutions (DWSC). The DWSC process is comprised of three stages. The first stage converts an input volume into an output volume for further convolution, see Equation 9.

(26)

Du× Du× M → Dv× Dv× N N > M. (9)

where:

Du, Dv= Image width and height

M = Number of channels

N = Amount of kernels

The second stage of this process maps the output volume per RGB-channel. The mapping is comprised of M-amounts of single channel filters, see Equation 10.

M

X

1

= (Dv× Dv× N → Dr× Dr× 1) (10)

The last stage ensures that individual tensors are created by further reducing the input volume thus creating spatial filtering, see Equation 11.

N

X

1

= (1 × 1 × M ) (11)

The above volumes are linearly recombined kernels to produce convolution over the input volume M amount of times. Thus, implementing DWSC performance gains can be achieved by reducing the computational cost of convolutional operations [50].

MobileNet V2 is trained with 25 000 steps and a batch size of 32. The model implemented is based on TensorFlow 2 [51] and has been previously trained on the COCO 2017 [28] dataset. Figure 11 shows the average precision (AP) and average recall (AP) are achieved by using transfer learning on MobileNet V2:

Figure 11: Average Precision and Recall for the implemented TensorFlow MobileNet V2 after transfer learning. Maxdets metric shows maximum possible amount of detections.

The validation images have been selected from 8800 hand picked images from Pawelczyk et al. [3]. EfficientDet D1

EfficientDet is a family of NNs built upon EfficientNet architecture [52] with ImageNet backbone. Google develops this architecture. The unique features of EfficientDet CNNs architecture are the Bi-directional Feature Pyramid Network BiFPN coupled with an updated set of rules for scaling its architecture. The scaling method is called compound scaling, and it aims to enlarge the depth, width, and resolution by the same factor. There are eight different EfficientDets, ranging from D0 to D7. The implemented EfficientDet is the D1 variant. D1 variant of this network was selected due to its fast inference time and memory usage, and it is a different architecture compared to YOLO V5s and MobileNet V2. The different variations of EfficientDet exist because expanding the maximum input resolution of a network changes its accuracy and computational costs. Allowing

(27)

for finer feature detection at the cost of computational performance.

EfficientDet D1 is fine-tuned on the above mentioned handpicked drone dataset. The network has been previously trained on COCO2017 [28] dataset and is implemented in Tensorflow 2. Figure 12 shows the average precision and average recall are achieved by using transfer learning on the model:

Figure 12: Average Precision and Recall for the implemented TensorFlow EfficientDet D1 after transfer learning.

This validations set is the same as used for other networks and does not contain any training images.

8. Testing Software

The test environment is set up on Ubuntu 18.04.5 LTS (Bionic Beaver) operating system. The test software is created in the Python programming language with version 3.8 [53]. The hardware of the test setup can be seen in Table 1:

Operating System Processor Graphics Unit RAM

Ubuntu 18.04.5 LTS Intel Xeon W2123 8 Cores / 16 Threads 3.6 Ghz Nvidia Quadro P4000 8GB GDDR5 SDRAM PCI-E 3.0 16x 32 GB DDR4 ECC 3200Mhz Table 1: Hardware components of testing setup

The software is initiated by starting each back-end for the models and waiting for every instance to finish loading in. Each model will load the video and start to run inference on the first frame; when the inference is complete, it will publish a message containing the image, bounding box and confidence over a socket to main and wait for a response. When each instance have published a message, a continuous response will be sent to each instance, telling them to run inference on the next frame. This handshake method keeps the networks from running inference out of sync, see Figure 13.

(28)

Start test software

Initialize MobileNet

V2 EfficientDetInitialize Initialize YOLO V5s

All networks ready?

Yes No

Wait for finished initialization

Wait for the networks to process

first / n-frame Feed the networks

first / n-frame of the test video

Frame processed?

No

Yes

Visual Confirmation Write data to text file

Feed the networks next frame of the test video

All frames processed?

No Yes

Save results, exit software

Figure 13: Flowchart depicting test software execution process for extracting data and visual confirmation of results.

8.1 Intersection Over Union

Intersection over Union is usually used to compare a networks bounding box with the ground truth. The results will tell how accurately the network can detect an object and print out the result. In this thesis, the IoU will be used as an evaluation method to determine if the NNs detects the same object. It ranges from 0-1, where 0 is no intersection, and 1 is 100% intersection. The IoU of a parallel CNN system output can be interpreted as a confidence between 0 and 1. The higher the value, the higher possibility exists of the detected object being an object of interest. The threshold at which a detection shall be deemed a TP is up to the user to set.

(29)

The IoU is calculated according to Equation 12, 13, 14 :

W idth = arg min(X1max, X2max) − arg max(X1min, X2min)

Height = arg min(Y1max, Y2max) − arg max(Y1min, Y2min)

(12) AoI = W idth × Height

BB1= (X1max− X1min) × (Y1max− Y1min)

BB2= (X2max− X2min) × (Y2max− Y2min)

AoU = BB1+ BB2− AoI

(13)

IoU = AoI

AoU (14)

where:

BBn = Area of bounding box n

AoI = Area of Intersection

arg min = Arguments of the minimum arg max = Arguments of the maximum

(100, 120) (120, 100) (200, 20) (180, 40) x1-max, y1-max = 180, 120 x1-min, y1-min = 100, 40 x2-max, y2-max = 200, 100 x2-min, y2-min = 120, 20 (0, 0) 100 100 20

Area of intersection (overlap)

200 20

(30)

(100, 120) (200, 20) (180, 40) x1-max, y1-max = 180, 120 x1-min, y1-min = 100, 40 x2-max, y2-max = 200, 100 x2-min, y2-min = 120, 20 (0, 0) 100 100 (120, 100) 200 20 20 Area of Union

Figure 15: Visual representation of area of union AoU can be seen as the red area.

8.2 Region of Interest Based Re-inference

To perform experiments with time redundancy, re-inference is chosen as the main method. Re-inference is based on the intersection over union of projected bounding boxes from all of the networks running inference. The image on which inference will be performed again on is based on the area of intersection from figure 14. All of the NNs have their requirements regarding input resolution. The input size of an image cannot be too small as it will not be possible to convolve on it. By only cropping the area of overlap, the produced image would be too small to feed into the networks. To combat this sizing issue, the image is cropped 50px from the detection box at each side. This helps in producing a larger image which in turn makes it possible to run inference on. This method should work for all CNNs. By cropping the image, it is expected that possible mispredictions will be reduced, and the overall reliability of results will increase. In addition, results of re-inference will be added to total confidence, and the image on which re-inference is run will be visually shown with a detection bounding box.

8.3 Total Confidence of a Parallel CNN System

The total confidence is used to measure how the accuracy changes when the three networks are

combined. To interpret the total confidence, the IoU needs to be above 0.5. Otherwise, the

networks may have predicted different objects. The calculation of total confidence can be seen in Equation 15

TC= 1 − ((1 − Cn1) × (1 − Cn2) × (1 − Cn3)) (15)

where:

TC = Total Confidence

(31)

9. Results

In the following sections, the results of the implementation are presented. The results cover all of the research questions and are divided into two main sections, one for each question.

9.1 Diverse Parallel Architectures

To evaluate the impact that diverse convolutional neural network architectures have on the reliab-ility of results, test videos have been produced on which parallel inference is conducted according to the software created in Figure 13. In the first test, a quadcopter drone is flown from the right to the left side of the video and back, shown in Figure 16. The drone is briefly out of frame before making its return flight and again disappears out of frame towards the end of the video. The test script from Figure 13 produced results that have been plotted using MATLAB software to better visualize and analyze data.

(a) Frame A (b) Frame B (c) Frame C

(d) Frame D (e) Frame E (f) Frame F

Figure 16: Example frames from the analyzed test video on which inference is done.

Analysis of the first test

The test with triple diverse CNNs is conducted with a 2% confidence threshold for all networks. The three NNs can correctly identify the flying quadcopter when it is flying within the camera’s view. Confidence of the bounding box accuracy for the detected object varies between all of the networks. On average, when disregarding mispredictions, the most accurate is YOLO V5s, followed by EfficientDet D1 with MobileNet V2 coming in last, see Table 3. Table 2 shows all possible true detections.

Ground Truth

All frames 359

TP 184

TN 175

Table 2: Representations of the total amount of frames in test video 1, all TP and all TN. All TN detections are detection where no drone is within the view of the camera.

(32)

Figure 17: Plot of accuracy relative frame of the video from the first test. I over U is the intersection over union from all three neural networks bounding boxes. Total confidence is the sum of the prediction accuracy for all of the networks.

Figure 17 shows that the quadcopter disappears out of sight between frames 94 through 225, which also occurs in the test video. While the intersection over union is 0 at frame 94, NNs report a confidence output causing multiple FP predictions. EfficientDet D1 is mispredicting the sign in the lower right side corner of the video frame as a drone for the duration of the time that the drone is out of frame, which is 49% of the time that the drone is not in the frame. MobileNet is doing similar mispredictions randomly on the screen. As its confidence is not 0 we classify those predictions as FP predictions. Bounding boxes of these predictions do not overlap; a decision is taken to not see these predictions as viable TP detections due to the low overlap.

Detection Type YoloV5 MobileNet EfficientDet I o U

True Positive 183 99.4% 182 98.9% 182 98.9% 184 100%

True Negative 112 64% 0 0% 0 0% 172 98.2%

False Positive 64 26.7% 177 50.2% 177 50.2% 3 1.6%

Table 3: Representations of all TP, FP, and TN predictions. False negative detections are not included as these have not been observed to occur. Confidence threshold is set at 2% for the inference.

To be able to calculate the rates of detections as percentages, following Equations 16-19 were used: Tpr=

Tpd

Atf

(16) where:

Tpr = True positive detection rate

Tpd = True positive detections

Atf = All true frames

Tnr=

Tnd

Anf

(17) where:

(33)

Tnr = True negative detection rate

Tnd = True negative detections

Anf = All negative frames

Fpr=

Fpd

Af

(18) where:

Fpr= False positive detection rate

Fpd= False positive detections

Af = All frames in the test video

Fnr=

Fnd

Af

(19) where:

Fnr = False negative detection rate

Fnd= False negative detections

Af = All frames in the test video

Furthermore, the three networks can cooperatively produce a more accurate visual prediction. In Figure 18, both EfficientDet and YoloV5 produce bounding boxes with an inaccurate height while Mobile Net can produce a bounding box with a more accurate height at the cost of accuracy. In the point where all of the networks intersect exists the highest possibility of a drone existing with a probability of 95.5%.

Figure 18: Graphical screenshot of test video 1 on which inference has been performed with applied graphical confirmation.

From the graph, it is eminent that using the intersection over union from all three networks as further correctness of operation check is a viable strategy for achieving more reliable results from

multiple networks capable of heavy mispredictions. For example, in Figure 19 EfficientDet is

mispredicting heavily when the drone is out of frame. However, those mispredictions have no effect on the intersection over union of all of the networks, which confidently does not show any detection.

(34)

Analysis of the second test

The second test is performed on the same video as the first test. The only parameter that is changed for this test is the confidence threshold which is made more strict with an increase from 2% to 20%.

Figure 19: Plot of accuracy relative frame of the video from the first test. IoU is the intersection over union from all three neural networks bounding boxes. Total confidence is the sum of the prediction accuracy for all of the networks.

Detection Type YoloV5 MobileNet EfficientDet IoU

True Positive 183 99.4% 170 89% 184 100% 184 100%

True Negative 167 95.4% 167 93.2% 0 0% 175 100%

False Positive 9 4.6% 4 2.2% 130 41.4% 0 0%

False Negative 0 0% 7 3.6% 0 0% 0 0%

Table 4: Representations of all TP, FP, and TN and FN predictions. Confidence threshold set at 20%. Values for comparison with ground truth are the same as in Table 3. This table shows the total percentage of correctly identified and detected false detections relative to ground truth. This test has shown that increasing confidence affects mispredictions in MobileNet but not in EfficientDet. Compared to the first test, the EfficientDet mispredictions, see Figure 20, have not changed at all due to the high confidence of the network for the mispredictions. Increasing the confidence threshold has marginally increased the IoU accuracy. The IoU now adjusts itself much more quickly as it only takes 1 frame to drop from high confidence of 80% to 0% instead of 4. This result is more closely represented in Figure 20. The outcome implies that by raising the confidence threshold of all of the networks, it is possible to faster detect a TN detection while reducing FP detections.

(35)

Figure 20: The plot of IoU at two different confidence intervals converging towards a TN value.

9.2 Time Redundancy

The tests regarding time redundancy are performed in the same setting as in the previous section 9.1 in this thesis. This test introduces a fourth instance of previously implemented neural networks to run inference once more on a frame that inference has been run on. Three tests were conducted where every network got the opportunity to act as time redundancy method. The images below were all taken on the same frame during each of the test:

(a) Yolo V5 (b) MobileNet V2 (c) EfficientDet D1

(d) Yolo V5 re-inference Confidence: 74%

(e) MobileNet V2 re-inference Confidence: 94.2%

(f) EfficientDet D1 re-inference Confidence: 88.8%

Figure 21: Example frames from the analyzed test video on which inference is done. Scale on the re-inference images is 1:1 compared to the original images.

From the above Figures 21a - 21f, some differences and contrasts come out that differs from the previously obtained results, which do not include re-inference. The main difference is that the bounding boxes produced by the NNs fit better around the drone than the "original" detections.

(36)

This improvement is mainly seen in the YOLO V5s and MobileNet V2, while in contrast, Effi-cientDet D1 did not improve its bounding box accuracy. The improvements are most visible in correctly predicting the height of the object. During re-inference YOLO surpassed its bounding box prediction compared to MobileNet. It is vital to underscore that this improvement was seen consistently during the test and not only on the visualized frame 75. Another advantage of using re-inference as a time redundancy method is the overall improved reliability of results. This method re-validates previous detections at a trade-off of being 1 frame behind the rest of the system. On average, the total accuracy improved to above a 99% average, see Figures 22 , 23 and 24.

Figure 22: Graph showing the impact of recalculation on the total confidence of the system. Recalculation is done with an extra runtime of YOLO which is 1 frame behind main networks.

Figure 23: Graph showing the impact of recalculation on the total confidence of the system. Recalculation is done with an extra runtime of MobileNet which is 1 frame behind main networks.

(37)

Figure 24: Graph showing the impact of recalculation on the total confidence of the system. Recalculation is done with an extra runtime of EfficientDet which is 1 frame behind main networks. On average, between the frames of 1 and 93 the overall accuracy has improved by:

Total Avg Confidence NN ACRI PPD PD

80.82% Yolo V5s 98.41% 17.59 21.80%

MobileNet V2 99.50% 18.68 23.10%

EfficientDet D1 99.33% 18.51 22.90%

Table 5: Table showing total average confidence (TAC) from all three of the networks. Average

Confidence Re-Inference (ACRI) provides total confidence with one network doing re-inference. Percentage Point Difference (PPD) shows the difference in percentage points between total average confidence and ACRI. Percentage Difference (PD) shows the difference between total ACRI and TAC as a percentage.

Compared to the tests done in section 9.1. All of the above results are shown as a percentage increase between percentages of total confidence and total re-inference confidence.

(38)

10. Discussion

The discussion section discusses the performed work from different angles. It will provide the reader with more insight into the results, thoughts about the chosen method, and a connection with the related work section.

10.1 Limitations

It is commonly known that running inference using very deep NNs requires high computational capability. Running multiple NNs in parallel in real time even further expands this problem. We have identified this as a hardware bottleneck. Based on this limitation, which could prove challenging to get over, we had to prioritize NNs, which were fast and efficient. This is the main reason behind why YOLO V5s, EfficientDet D1 and MobileNet V2 were chosen for this thesis when setting their diversities aside. Yet, we know that the selected networks are diverse in several aspects such as the way they perform convolution, detection and how they differ from each other architecturally. These networks were deemed to be possible to implement as a parallel structure on a single graphics processing unit.

10.2 Methodology

The dataset contains around 50 000 labeled images of drones in XML VOC format. This dataset includes drones with shapes of airplanes and other drones in very unusual forms. To achieve better results, we have chosen to train our NNs on images of quadcopters. Images which did not depict quadcopters were removed. To gather the dataset needed, we manually selected 8000 training images. The selection helped us to train the networks only to detect quadcopters and help prevent misdetections of other objects as drones to some degree. The same dataset was used to train all the networks.

The main limitation of this dataset was that it almost exclusively contained drones depicted as quadcopters, meaning that running inference on anything that is not a quadcopter is not reliable. For this reason, we have made a test video in which a quadcopter is seen airborne. To confirm this, we did run inference on a video where a hexacopter was in the frame, and all networks had troubles detecting it with satisfactory accuracy. To better reflect on real world cases where a drone can take any shape or form, the training dataset to use would need to be vastly more diverse; this poses major difficulties such as the previously mentioned drone in the form of an airplane and similarly drones that look like helicopters. It is needed to distinguish between those two cases, and constructing a dataset for it is challenging and was not the scope of this thesis.

We have constructed the software that allows us to run parallel NNs on Linux based desktop computers. The software is optimized to run as close to a real time application as possible, this however is currently limited by the inference time of the slowest network which happens to be EfficientDet D1. The slower inference time does not negatively affect results, which means that every frame can be analyzed. This is done mainly because every frame of the test video has to be analyzed in order to produce reliable and accurate results. Skipping frames to achieve better performance is seen as unnecessary due to the low amounts of frames per second in the test video. Furthermore, skipping a frame, whether voluntarily or not, directly affects results. By introducing this as a potential variable, separate results have to be obtained and analyzed with that in mind. While those results could better reflect the real world operation of parallel NN architecture, it is not certain that it would happen; thus, it is better to provide clear semi-simulated results and show what kind of performance this method achieves. The main focus of this thesis is to show how diverse parallel CNNs affect the reliability of results and what kind of results can be achieved. As none of the authors had any previous experience with either Tensorflow or Pytorch, some parameters we had to optimize were only discovered in the later stages of implementation. One obstacle that had been discovered was Tensorflows usage of virtual random access memory. Tensor-flow used all of the available virtual random access memory, meaning that other networks could not

Diverse Time Redundant Triplex Parallel Convolutional Neural Networks for Unmanned Aerial Vehicle Detection

School of Innovation Design and Engineering

Västerås, Sweden

Thesis for the Degree of Master of Science in Engineering - Dependable

Systems 30.0 credits

DIVERSE TIME REDUNDANT TRIPLEX

PARALLEL CONVOLUTIONAL NEURAL

NETWORKS FOR UNMANNED AERIAL

VEHICLE DETECTION

Martin Bilger

Hubert Stepien

Examiner: Masoud Daneshtalab

Supervisors: Håkan Forsberg, Johan Hjorth

Company Supervisor: Billy Lindgren

Acknowledgment

Contents

List of Figures

List of Tables

Acronyms

1.

Introduction

2.

Background

2.1

Unmanned Aerial Vehicles

2.2

Neural Networks

2.3

Convolutional Neural Network

2.4

CNN Performance Evaluation

2.5

Preprocessing of Data

2.6

Dependability

3.

Related Work

3.1

Modular Redundancy in CNNs

3.2

Drone Detection Methods

3.3

Drone Focused Neural Network

3.4

Dependability in CNN

4.

Problem Formulation

4.1

Hypothesis

4.2

Expected Outcomes

4.3

Limitations

5.

Method

6.

Ethical and Societal Considerations

7.

Test Preparations

7.1

Dataset

7.2

Implemented Convolutional Neural Networks

8.

Testing Software

8.1

Intersection Over Union

8.2

Region of Interest Based Re-inference

8.3

Total Confidence of a Parallel CNN System

9.

Results

9.1

Diverse Parallel Architectures

9.2

Time Redundancy

10.

Discussion

10.1