• No results found

Dataset Evaluation Method for Vehicle Detection Using TensorFlow Object Detection API

N/A
N/A
Protected

Academic year: 2022

Share "Dataset Evaluation Method for Vehicle Detection Using TensorFlow Object Detection API"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Faculty of Technology and Society Computer Engineering

Bachelor Thesis

Dataset Evaluation Method for Vehicle Detection Using TensorFlow Object Detection API

Utvärderingsmetod för dataset inom fordonsigenkänning med användning av TensorFlow Object Detection API

Bojan Furundzic Fabian Mathisson

Degree: Bachelor of Science in Engineering in

Computer Science Supervisor: Reza Malekian

Examinator: Majid Ashouri Mousaabadi

(2)

Acknowledgement

We are thankful for the help and support we have received from our supervisor Reza Malekian during this research. His comments and support have helped us tremendously. We would also like to thank our examiner Majid Ashouri Mousaabadi for his interest in our thesis and the comments and suggestions he has provided.

Last but not least, we would like to thank Magnus Krampell and Johan Holmgren for the Axis IP Camera that we borrowed and was used to produce our experimental results. Magnus Krampell’s comments and suggestions helped us a lot in the early stages of our research.

(3)

Abstract

Recent developments in the field of object detection have highlighted a significant variation in quality between visual datasets. As a result, there is a need for a stan- dardized approach of validating visual dataset features and their performance contri- bution. With a focus on vehicle detection, this thesis aims to develop an evaluation method utilized for comparing visual datasets. This method was utilized to deter- mine the dataset that contributed to the detection model with the greatest ability to detect vehicles. The visual datasets compared in this research were BDD100K, KITTI and Udacity, each one being trained on individual models. Applying the developed evaluation method, a strong indication of BDD100K’s performance supe- riority was determined. Further analysis and feature extraction of dataset size, label distribution and average labels per image was conducted. In addition, real-world ex- perimental conduction was performed in order to validate the developed evaluation method. It could be determined that all features and experimental results pointed to BDD100K’s superiority over the other datasets, validating the developed evaluation method. Furthermore, the TensorFlow Object Detection API’s ability to improve performance gain from a visual dataset was studied. Through the use of augmenta- tions, it was concluded that the TensorFlow Object Detection API serves as a great tool to increase performance gain for visual datasets.

(4)

Sammanfattning

Inom fältet av objektdetektering har ny utveckling demonstrerat stor kvalitetsvari- ation mellan visuella dataset. Till följd av detta finns det ett behov av standard- iserade valideringsmetoder för att jämföra visuella dataset och deras prestationsför- måga. Detta examensarbete har, med ett fokus på fordonsigenkänning, som syfte att utveckla en pålitlig valideringsmetod som kan användas för att jämföra visuella dataset. Denna valideringsmetod användes därefter för att fastställa det dataset som bidrog till systemet med bäst förmåga att detektera fordon. De dataset som använ- des i denna studien var BDD100K, KITTI och Udacity, som tränades på individuella igenkänningsmodeller. Genom att applicera denna valideringsmetod, fastställdes det att BDD100K var det dataset som bidrog till systemet med bäst presterande igenkän- ningsförmåga. En analys av dataset storlek, etikettdistribution och genomsnittliga antalet etiketter per bild var även genomförd. Tillsammans med ett experiment som genomfördes för att testa modellerna i verkliga sammanhang, kunde det avgöras att valideringsmetoden stämde överens med de fastställda resultaten. Slutligen studer- ades TensorFlow Object Detection APIs förmåga att förbättra prestandan som er- hålls av ett visuellt dataset. Genom användning av ett modifierat dataset, kunde det fastställas att TensorFlow Object Detection API är ett lämpligt modifieringsverktyg som kan användas för att öka prestandan av ett visuellt dataset.

(5)

Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Research Questions . . . 2

1.3 Limitations . . . 2

2 Theory 3 2.1 Machine Learning . . . 3

2.2 Deep Learning . . . 3

2.2.1 Deep Neural Networks . . . 3

2.2.2 Supervised Learning . . . 4

2.3 Object Detection . . . 4

2.4 Transfer Learning . . . 4

2.4.1 Fine-Tuning . . . 5

2.5 TensorFlow Object Detection API . . . 5

2.5.1 TensorBoard . . . 5

2.6 Dataset . . . 5

2.6.1 Labeling . . . 6

2.6.2 Common Objects in Context (COCO) . . . 6

2.6.3 KITTI . . . 6

2.6.4 Berkeley DeepDrive - BDD100K . . . 7

2.6.5 Udacity . . . 7

2.7 Convolutional Neural Network - CNN . . . 7

2.8 Residual Network - ResNet . . . 8

2.9 Faster R-CNN . . . 8

2.10 Single-Shot Detection - SSD . . . 9

2.11 Parameters for Training a Model . . . 9

2.11.1 Mean Average Precision - mAP . . . 9

2.11.2 Learning Rate . . . 10

3 Related Work 11 3.1 Analysis Of the Influence of Training Data on Road User Detection . 11 3.1.1 Comments . . . 12

3.2 Object Detection Using Convolutional Neural Networks . . . 12

3.2.1 Comments . . . 12

3.3 A review: Comparison of Performance Metrics of Pretrained Models for Object Detection using the TensorFlow Framework . . . 12

3.3.1 Comments . . . 13

3.4 An Approach for Validating Quality of Datasets for Machine Learning 13 3.4.1 Comments . . . 14

3.5 Comparison of Visual Datasets for Machine Learning . . . 14

3.5.1 Comments . . . 14

4 Method 15 4.1 Methodology of choice . . . 15

4.2 Construct a Conceptual Framework . . . 15

4.3 Develop a System Architecture . . . 15

4.4 Analyze and Design the System . . . 16

4.5 Build the (Prototype) System . . . 17

4.6 Observe and Evaluate the System . . . 18

(6)

5 Results & Analysis 21

5.1 Data Preparation . . . 21

5.1.1 KITTI . . . 21

5.1.2 Berkeley DeepDrive - BDD100K . . . 22

5.1.3 Udacity . . . 22

5.2 Evaluation Method . . . 22

5.2.1 Best Performing Dataset . . . 23

5.2.2 Features From the Best Performing Dataset . . . 23

5.3 Dataset Improvement Using the TensorFlow Object Detection API . 28 5.4 Experimental Results . . . 29

5.4.1 Axis IP Camera and Experimental Setup . . . 29

5.4.2 Experiment Conduction . . . 30

6 Discussion 34 6.1 Evaluation of Models . . . 34

6.2 Feature Extraction . . . 34

6.3 Data Augmentations . . . 35

6.4 Threats to Validity . . . 35

7 Conclusion 37 7.1 Future Work . . . 37

References 39

(7)

List of Figures

1 Difference between image classification and object detection [23] . . . 4

2 Inductive Transfer [26] . . . 5

3 Region Proposal Network (RPN) [41] . . . 8

4 SSD Layer Architecture [45] . . . 9

5 The five stages of system development. Based on [55]. . . 15

6 Speed comparison SSD and Faster R-CNN [53] . . . 18

7 Label heatmaps . . . 25

8 Average vehicle labels per image for the BDD100K, KITTI and Udac- ity datasets . . . 27

9 Distance Variation . . . 30

10 Vehicle occlusion and angle variety . . . 31

11 Vehicle density and occlusion . . . 31

12 Deployed model trained on KITTI . . . 32

13 Deployed model trained on Udacity . . . 33

List of Tables

1 Evaluation of model trained on BDD100K . . . 22

2 Evaluation of model trained on KITTI . . . 22

3 Evaluation of model trained on Udacity . . . 22

4 Average mAP and standard deviation through all evaluations . . . . 23

5 Evaluation of Model Trained on the BDDv7 Dataset . . . 24

6 Evaluation of Models Trained on Augmented BDD100K Dataset (mAP (%)) . . . 28

7 Average mAP and Standard Deviation of Augmented Dataset Evalu- ations . . . 29

(8)

1 Introduction

This section describes the general area of this thesis and introduces research questions along with limitations.

1.1 Background

In recent years, there has been a significant increase of research in autonomous ve- hicle development. Autonomous vehicles are cognizant of their surroundings and have the ability to operate and navigate themselves between a starting point and a predetermined destination [1]. According to the standards organization Society of Automotive Engineers (SAE), driver systems are separated into six levels of auton- omy, ranging from level 0 (no Automation) to level 5 (full Automation). Current development of such systems are approaching the third level of automation, meaning that vehicles have full control with the exception that a driver is at any time, able to intervene at request [1].

An important aspect in autonomous vehicle development is ensuring the safety of drivers and pedestrians in traffic [2]. A study conducted by Aditi Ithal at the Uni- versity of Bournemouth indicated promising results for the automotive industry and road safety [1]. The study investigated three categories of crashes, which varied from high to low severity. In all three categories, autonomous vehicles experienced the lowest amount of crashes per million miles driven. The study notes that autonomous vehicles have not been on the road for very long but these early reports indicate that they do in fact provide higher safety for both vehicles and pedestrians [1].

However, vehicle detection in autonomous vehicles includes a rigorous and challeng- ing development process. This task of simulating the human vision and cognition ability has been a topic of discussion and research for numerous years [3]. Conven- tional object detection models include heuristic methods such as edge aggregation and template matching. These methods require expert knowledge and are uniquely engineered to detect predefined patterns [4]. However, in recent years, the explosion of data availability and computational power has led to the emergence of advanced object detection methods [3]. Deep learning is currently one of the leading approaches of accurate and fast object detection systems and has gained a substantial amount of traction amongst researchers and developers [5]. Due to its use of neural networks, deep learning based vehicle detection methods answer many of the problems present in traditional detection methods [5].

Furthermore, training a model is an essential step in the development process of object detection systems using deep learning [6]. In the field of vehicle detection, large amounts of high quality data is important and has a great impact on the accu- racy of the system [7]. However, datasets may be expensive, limited and may contain duplicates, unlabeled data, noise and other anomalies. This may negatively impact the vehicle detection system and in turn produce inadequate results [7, 8]. Impor- tant features that determine the quality of a vehicle detection dataset may include annotation quality, data amount, diversity of environments and label distribution [9, 10, 11]. A way of studying, comparing and determining important features of vehicle detection datasets has been proposed to be an important topic of research [8]. In addition, constructing dataset evaluation methods in a standardized approach would serve as a significant contribution to the field of vehicle detection development [8].

(9)

1.2 Research Questions

The aim of this thesis is to develop an evaluation method for visual datasets. This is done by studying and analyzing three diverse datasets that are used in the de- velopment of vehicle detection systems. The TensorFlow Object Detection API is used to develop, train and evaluate these systems. In addition, the datasets used in the training process are compared and analyzed, this to determine important fea- tures that impact generalized vehicle detection ability and accuracy. The following research questions are studied and evaluated.

RQ1: What method can be applied for determining the dataset that contributes to the highest performing vehicle detection system?

– RQ1.1: Which of the selected datasets contributed to the highest accuracy and generalization ability of a vehicle detection system?

– RQ1.2: What are some possible features from the dataset with the highest accuracy and generalization ability that contributed to its superiority over the other datasets?

– RQ1.3: How can the TensorFlow Object Detection API be used to improve the dataset that contributed to the model with the highest accuracy and generalization ability?

1.3 Limitations

• Object detection is limited to a front view camera and not rear or side camera angles as this is not the focus of this thesis.

• Only one object detection model is used in this thesis, this being the SSD ResNet 50 V1 FPN 640x640 (RetinaNet 50) model.

• The datasets BDD100K [12], KITTI [13] and Udacity [14] are the three datasets exclusively studied in this thesis. Nevertheless, methods applied for studying these datasets can be applied to other datasets.

• The system is not tested in an autonomous car.

(10)

2 Theory

This section describes some important and fundamental theoretical aspects related to this thesis. The aim of this section is to provide important information in regards to the theory behind vehicle detection.

2.1 Machine Learning

Machine learning is a branch within Artificial Intelligence that has been a fundamen- tal part in modern digital solutions [15]. The purpose of machine learning algorithms is to over time, learn during its execution from different kinds of pattern recogni- tion methods. These algorithms are able to extract features from input data that is used to make decisions based on data of similar kind. Machine learning solutions are used in a large array of areas including robotics, traffic prediction and product recommendation [15, 16, 17].

2.2 Deep Learning

Traditional machine learning algorithms require careful, hand-engineered character- istic extractors which in turn is time-intensive and prone to error [18]. However, modern solutions commonly make use of representation learning methods such as deep learning which can be described as automatic feature extractors. These meth- ods can contain multiple layers of extraction that are able to detect multilayered features without the help of a human engineer [18].

2.2.1 Deep Neural Networks

Deep Neural Networks (DNN) are computer systems, that are an extension of Arti- ficial Neural Networks and are defined as networks with more than 3 layers. DNNs are designed to learn new tasks based only on raw data which means they have the capability of learning without any hard-coding [19]. Because of this ability, DNNs are used in various application areas such as object detection and image classification [20]. Traditionally, linear functions are used when predicting object class labels in an input image where weights are adjusted to more accurately output a correct label.

The following equation exemplifies a multinomial logistic classification function [20].

W x + B = y

Where W represents the weights, x is the input, B is the bias and finally y represents the output which is a one dimensional array.

With DNNs, it is possible to use non-linear functions which can later be converted into linear functions. Using linear functions is more effective and stable. The follow- ing Rectified Linear Units (ReLU) function y, transforms non-linear functions into linear ones [20].

y =

 0, if x < 0.

x, if x ≥ 0.

(11)

2.2.2 Supervised Learning

Supervised learning is one of the most commonly used forms of deep learning. During the development of a system that detects objects from a given image, an algorithm is used to learn the mapping function that links the output to a given input [18]. With supervised learning, a model is trained using a dataset where the correct answers are known in advance. When the model tries to predict the answer, but fails, the mapping function is corrected by adjusting weight vectors in a way that reduces error loss. The training is complete when the system has reached a desirable level of performance, meaning that it predicts the output of a given input with a certain level of accuracy [21].

2.3 Object Detection

Object detection is a computer vision technique that is used to identify objects from images and videos. Object detection is beneficial and used in several application areas. One heavily discussed and researched topic is its use in autonomous vehicles.

In state-of-the-art object detection models, deep learning based solutions using con- volutional neural networks (CNN), referenced in section 2.7, are used to correctly identify object from images and videos. These concepts are explained in detail in upcoming parts of this thesis [22].

Figure 1 illustrates the difference between object detection and image classifica- tion. In cases where there are multiple objects that are ought to be localized and categorized, object detection methods are used. However, should a single object in an image only be categorized, image classification methods are used [22].

Figure 1: Difference between image classification and object detection [23]

2.4 Transfer Learning

In the real world, knowledge gathered from other domains are used and applied to learn or to further deepen knowledge in a new domain, this is the basis of transfer learning. Transfer learning is a commonly used machine learning technique that utilizes a model that is pre-trained on one task for it to learn a new related task [24].

Object detection models that are trained on large and diverse datasets can be further trained on more specific data and in return specialize its detection capabilities on distinct class sets. Traditionally, deep learning and machine learning algorithms have been designed to work in isolation and are trained to solve specific tasks. Transfer learning utilizes previous knowledge from systems and use this knowledge to develop

(12)

new task solving capabilities [25]. In the field of deep learning, transfer learning exploits the inductive biases prevalent in deep learning algorithms. An example of this is by adjusting the inductive bias, search process and narrow down the hypothesis space. Figure 2 illustrates this process.

Figure 2: Inductive Transfer [26]

2.4.1 Fine-Tuning

Fine-tuning is a subcategory of transfer learning and is used to specialize a pre- trained model into performing new tasks [27]. Fine adjustments are made to the various weights of a model with the goal of improving performance. In machine learning, fine-tuning is often done as a last step and involves tweaking model hyper- parameters which helps with fine adjustments of weights [27].

2.5 TensorFlow Object Detection API

The TensorFlow Object Detection API [28] is an open source framework that is based on TensorFlow [29]. Utilizing this API simplifies the training, evaluation and deployment process of object detection models [28]. TensorFlow can be used in a plethora of applications but has a focus on training and analysis of deep neural networks [29]. It is built for major platforms like macOS, Windows, Linux, Android and iOS and has support for easy deployment on CPUs, GPUs, TPUs and clusters of computational devices [29]. TensorFlow computations are based on tensors which can be defined as multi-dimensional typed arrays. Mathematical operations are carried out at the node of a tensor, while input-output operations are carried out on the edges of a tensor [29].

2.5.1 TensorBoard

TensorBoard is a visualization tool used with TensorFlow, this to inspect and analyze graphed representations of how TensorFlow runs [30]. The visualizations include metrics such as mean Average Precision, training loss and evaluation loss. With TensorBoard, it is possible to evaluate the performance of a model through the time plane which makes it easier to analyze a model throughout its development [30].

2.6 Dataset

A dataset can be be defined as a collection of related data represented in columns and rows [31]. If D = {X, Y } is a dataset, it can be broken down into a set of input data (x, y). Where x ∈ X is a vector containing features from the dataset D, and

(13)

y ∈ Y is the correlated values from these features [31].

In the field of object detection, datasets that are appropriate and fit the recog- nition tasks at hand are important [8, 32]. When developing an object detection system, datasets are a necessary component for training as well as evaluating devel- oped object detection model. At the time of writing, there are reoccurring problems with datasets aimed for vehicle detection. These problems impact models in a neg- ative way and results in poorly performing detection systems [8]. Some of these problematic features include mislabeled objects, lack of variety in vehicle viewpoints and small amounts of background variety [8]. The resulting model trained on data with these features may perform well on evaluation data similar to the data used in training, but performs significantly worse in other types of evaluation data. This can be seen as a misrepresentation of the actual performance of the detection system and further exemplifies the importance of datasets and their quality of features [8].

2.6.1 Labeling

Labeling is used to highlight data to make it readable for machines [33]. In the field of object detection, labeling is the process of drawing bounding boxes around objects to detect. However, there is a significant difference between labeling and detecting objects in an image. Labeling is used in the training and evaluation stages when developing an object detection model, while detection is the resulting ability of the system to detect objects in unlabeled data [33]. Datasets are often labeled with different classes, depending on the domain that the dataset is to be deployed in [34].

In the field of object detection for autonomous vehicles, classes that often occur are

’car’, ’pedestrian’ and other traffic related classes. The labeling process of datasets can be conducted manually by humans, or using some form of automation [33]. While these processes are different in approach, they all produce labeled data in the form of bounding boxes. These bounding boxes can be defined in different ways, most commonly they follow a common format that is defined with the following vector:

label = (x, y, w, h) ∈ R4. In the definition, (x,y) is the position of the bounding box while (w,h) signify the width and height of the bounding box [33]. During the labeling process, the data collected and labeled may include errors, noise or other anomalies. This mislabeling of data can have a significant impact on the resulting performance of an object detection system [7].

2.6.2 Common Objects in Context (COCO)

COCO is a large-scale dataset that includes a total of 328.000 images containing 2.500.000 annotations in a large variety of classes [35]. The creation of the dataset originates from the three major problems in scene understanding: detecting objects that are not the focus of an image, precise 2D boundary box annotation and the importance of context between objects in an image [35]. COCO aims to include a majority of non-iconic images and objects that are captured in a natural context.

The COCO dataset is widely used for object detection purposes and is well suited for models that require great generalization. It is the dataset used for pre-training models in the TensorFlow 2 Detection Model Zoo [36].

2.6.3 KITTI

The KITTI dataset was recorded in Karlsruhe, Germany with a plethora of cameras and sensors. The dataset is diverse with 6 hours of recordings in different traffic scenarios, environments and contains 2D-bounding boxes [13]. KITTI has a total of

(14)

eight object classes, ’Car’, ’Pedestrian’, ’Van’, ’Truck’, ’Person (sitting)’, ’Cyclist’,

’Tram’ and ’Misc’. The dataset has a heavy bias for the ’Car’ and ’Pedestrian’

classes, which displays the dataset’s focus on autonomous vehicles [13]. The KITTI dataset also provides a benchmark for several computer vision tasks such as 2D object detection and optical flow, provided by the dataset website [13]. The KITTI dataset contains approximately 52.000 labeled objects, where the ’Car’ class accounts for circa 50% of these labeled objects [37].

2.6.4 Berkeley DeepDrive - BDD100K

BDD100K is a large-scale dataset aimed to overcome the limitations of existing datasets which may include constraints in scene variation, annotation richness and geographical spread [12]. The dataset contains 100.000 diverse video clips of realis- tic visual driving scenes obtained from 50.000 dashcam sessions mainly in the US.

However, images are labeled at the 10th second of every clip, resulting in a total of 100.000 labeled images split across training (70%), evaluation (10%) and testing (20%) data. As well as containing diverse scene types such as city and residential areas, BDD100K is recorded in a variety of extreme weather conditions and times of the day. A total of ten different classes are annotated which include ’Car’, ’Sign’,

’Light’, ’Person’ ’Truck’, ’Bus’, ’Bike’, ’Rider’, ’Motor’ and ’Train’ [12].

2.6.5 Udacity

The Udacity dataset is an open-source dataset with a focus on autonomous vehicles.

Udacity is composed of 15.000 frames and contains over 65.000 labeled objects [14].

The dataset was recorded in Mountain View, California and contains images of ve- hicles in a variety of lighting conditions. Udacity has a total of four classes, which are ’Car’, ’Truck’, ’Pedestrian’ and ’Traffic light’ [14]. The objects in the dataset are labeled and annotated using 2D-bounding boxes and was manually annotated by humans [14].

2.7 Convolutional Neural Network - CNN

A Convolutional Neural Network (CNN) is a specialized type of deep neural network which includes many different deep learning algorithms. CNNs are widely used in areas such as object detection and image classification for feature extraction. CNNs have the advantage of being able to convolve over large datasets with high speed as part of the training process [38].

CNNs are constructed using three different layers which include the following [39]:

• The Convolutional Layer consists of relatively small and learnable filters that are used to convolve small areas of the layer’s input. This process is done by periodically applying the dot product of the input and the filter until a feature map is generated. Feature maps serve as the output and shows activations found in different parts of the input vector and is in turn used to extract features present in the input.

• The Pooling Layer has the objective to reduce the dimensions of the visual representation. This results in a minimized number of parameters and also reduces the complexity of the network.

• The Fully-Connected Layer collect the resulting data from the convolu- tional and pooling layer, analyzes the data from the layers independently and makes the final classification decision.

(15)

2.8 Residual Network - ResNet

ResNet is a particular type of CNN that addresses the degradation problem that may occur when a model is expanded with deeper layers. The deep network causes a difficulty for the layers to propagate information from previous layers, resulting in a rapid decrease in accuracy. The creators [40] tackle this problem by adding "shortcut connections" between layers on different levels, allowing information to skip certain layers, as opposed to a linear flow. This method has shown to increase accuracy gains by adding deep layers and bypassing the degradation problem without adding complexity.

2.9 Faster R-CNN

Faster R-CNN is an object detection system consisting of two modules. The first module is a Region Proposal Network (RPN) and the second one being a Fast R- CNN object detector [41]. These two modules unify into a two-stage unified object detection system. The RPN takes an image as an input and outputs region proposals for the Fast R-CNN module that makes the final object detection. RPN is trained to produce the output region proposals without the need of using external features such as Selective Search [42]. During the generation of the region proposals, the RPN slides a small network over a convolutional feature map that was the output from the previous shared layer. At the center of the sliding window an anchor is located, which has the role of predefining bounding boxes for the RPN. These anchor boxes are often depicted with different ratios and sizes, this to fit the general shape and size of various object classes that are defined in the dataset that the model is training on [42]. This is illustrated in Figure 3. Using Region Of Interest (ROI), the Fast R-CNN obtains vectors from the cropped images containing features of each object [41]. The vectors are sent to two connected layers, one that produces the probability of the estimation over the classes (box-regression layer) and the other layer outputting four real numbers for each class (box-classification layer) [42] [43].

Figure 3: Region Proposal Network (RPN) [41]

(16)

2.10 Single-Shot Detection - SSD

Unlike the Faster R-CNN system that uses a two-stage detector architecture, the SSD system performs detections using a single-stage approach. This means that the region proposal stage is eliminated and that detection is performed in a single iteration through the network. The first layers of the system consists of a network which is used for feature extraction. This is referred to as the base network and may include standard backbone networks such as the previously described ResNet or VGG [44].

The last classification layers of the base network are truncated and replaced by multi- scale feature maps [45]. Instead of using the feature map from the last layer such as in the case of Faster R-CNN, a multi-scale feature map combines multiple feature maps of different sizes from previous layers and can increase detection accuracy of smaller objects [46]. Figure 4 shows how the layers progressively decrease in size deeper into the network. In comparison to the large sized layers at the beginning of the network, the smaller layers are able to construct abstract representations and is useful for detecting larger objects. An important concept for the SSD architecture is the use of default boxes. These are similar to the anchor boxes described in Faster R-CNN but are applied to multiple feature maps of various sizes [45].

Figure 4: SSD Layer Architecture [45]

2.11 Parameters for Training a Model 2.11.1 Mean Average Precision - mAP

Mean Average Precision (mAP) is one of the most commonly used evaluations metrics for object detection models [47]. mAP is a metric based on Average Precision (AP) which is used to calculate the precision and recall of a model. This is done using the following formulas [47].

P recision = T P

T P + F P (2.1)

Recall = T P

T P + F N (2.2)

Precision in equation 2.1 is defined as the ability for an object detection model to identify objects, this is represented in percent. Recall in equation 2.2 is defined as the ability for an object detection model to identify all objects in sample data, this is also represented in percent. To determine TP (True Positive), FP (False Positive) and FN (False Negative) and also define when a correct or incorrect detection is made, intersection over union (IoU) is needed. IoU is used to measure the overlap between two boundary boxes, this is done to determine how much of the detected

(17)

object boundaries overlap with the true boundaries set in the dataset. A threshold is also predefined to determine if the object was actually detected, this threshold is often 50% [47].

IoU = Area of Overlap

Area of U nion (2.3)

If the object detection system detects an object within the IoU threshold it is seen as a True Positive, else it is seen as a False Positive. False Negatives occur when an object detection system fails to detect an object that is present in an image. AP is used to calculate the mAP of an object detection model and is the average AP for all different classes of objects that the model is designed to detect, given by equation 2.4 [47] where N represents the total number of classes and APi is the average precision for each class.

mAP = 1 N

N

X

i=1

APi (2.4)

2.11.2 Learning Rate

Learning rate, in the field of deep learning, is a hyper-parameter that is used in the process of training a deep neural network. The learning rate is the parameter that controls the frequency of the weight adjustments during the training process, this in response to the estimated error [48]. Because of this ability, learning rate is one of the most important parameters and has a tremendous impact on the final fully trained model [49]. Learning rates can also be implemented using some form of decay, which means that the learning rate starts at a certain value and slowly gets minimized during the training process [48, 49]. This has shown to improve performance of object detection models [50].

(18)

3 Related Work

In this section of the thesis, a number of related papers are examined and reviewed.

3.1 Analysis Of the Influence of Training Data on Road User De- tection

In 2018, C. Guindel et al. [51] presented the influence of training data on road user detection. Using deep learning solutions, the authors conducts several experi- ments utilizing two relatively small datasets in an area that commonly requires large amounts of training data. These experiments involve manipulating and combining data in various ways with the purpose of improving detection capabilities when deal- ing with limited amounts of data. In the first experiment, the authors studied the difference in accuracy using several combination techniques. First off, the datasets were combined into a single dataset and then the model was trained using the Faster R-CNN architecture. In this case, the model is trained from scratch and is able to randomly select any image from the mix of datasets. The result of this combina- tion pointed to an increase in detection accuracy. The second combination utilized the use of transfer learning, meaning that the model was pre-trained on one of the limited datasets and then transferred over and further trained on the other. The re- sults of this combination reached a higher detection accuracy than using the datasets separately but had a lower accuracy than combining the data into a single dataset.

Finally, the weights of a pre-trained model using a large and generalized dataset was transferred and further trained using one of the limited datasets. The results showed that pre-training a model using a large dataset can greatly increase the model’s gen- eralization ability and detection accuracy.

An experiment was also conducted by augmenting the datasets in four different ways, this to evaluate if augmentation would have an effect on detection accuracy.

These four augmentations were the following:

• The dataset was transformed by adding random values between -40 and 40 to each pixel in an image.

• The dataset was transformed by multiplying a random factor, from the range [0.5, 1.5] to each pixel in an image.

• The dataset was transformed by adding a small jitter to each pixel in an image, this using a Gaussian distribution with mean 0 and standard deviation between 0 and 5.1.

• The dataset was transformed by adding a random value between -20 and 20 to the saturation and hue channels of the model.

The augmentations were added in a random manner to all images in the dataset.

Subsequently, an experiment was conducted to compare performance using a base- line, augmented dataset and augmentation of a combined dataset. Augmentation on the baseline dataset resulted in a decreased detection accuracy compared to when no augmentation was done. Conducting the same experiment on the combined dataset resulted in an increase in detection accuracy.

(19)

3.1.1 Comments

Guindel’s experiments provide relevant results for this thesis. First of all, combi- nation techniques are advantageous due to the limited amount of data present in several vehicle and pedestrian datasets. Secondly, transfer learning using a gener- alized dataset is a powerful and time-efficient way of increasing detection accuracy.

This method is an essential part during the development of the vehicle detection system and will possibly make the training process faster and superior. Thirdly, the results regarding data augmentation has inspired new approaches in increasing the model’s generalizability and as a result, its detection accuracy. This is used in the thesis as an insight for practical augmentations related to RQ1.3 as well as the use transfer-learning.

3.2 Object Detection Using Convolutional Neural Networks

In 2018, R. Galvez et al. [52] presented a comparison between two state-of-the-art object detection models using CNNs. The TensorFlow Object Detection API is used to implement the two models, with training being done on a relatively small dataset consisting of a total of 444 images.

SSD with MobileNetV1 was the first TensorFlow model used for training, with 30.113 steps conducted during training. Using TensorBoard, the result of the training il- lustrated a maximum loss of 17.63 and a minimum loss of 1.17. The model had relatively low accuracy rates but because of its speed, the researchers argued that it could be used in real-time applications.

The second model used in the comparison was Faster-RCNN with InceptionV2, with 68.122 steps conducted during training. Using TensorBoard, the maximum and minimum losses were recorded to be 2.39 and 0.01 respectively. Faster-RCNN per- formed with high accuracy rates compared to the SSD model, but was slower than the previously mentioned model. The researchers concluded that there is a signifi- cant trade-off between speed and accuracy and also stated that the SSD model was good for real-time applications while Faster-RCNN was better when high accuracy is desired.

3.2.1 Comments

This is useful for the research of this thesis because of the use of the TensorFlow Object Detection API. The paper illustrated the use of the API while as well as comparing two relevant models.

3.3 A review: Comparison of Performance Metrics of Pretrained Models for Object Detection using the TensorFlow Framework In 2020, S A. Sanchez et al [53] presented a review comparing state-of-the-art object detection models and their performance based on training using different datasets, using the TensorFlow framework. This review was conducted by comparing evalua- tions of different object detection models in a number of other research papers, and displaying the approach of determining the best performing object detection system.

The datasets used in the comparison were the COCO dataset and the PASCAL VOC 2007 and 2012 datasets [54]. The researchers referenced a comparison con- ducted by Shaoqing Ren Kaiming et al, in this comparison a Faster R-CNN model

(20)

with different object detection methods was trained using the datasets in different combinations. The researchers, using the mAP metric, determined that the Faster R-CNN model besed on RPN 300 + VGG architecture had the best performance with an mAP of 75.9%. In the review, the researchers also analyzed the SSD method for different input sizes based on research conducted by Wei Liu et al. SSD using a input size of 512x512, trained on a combination of the VOC 2007, VOC 2012 and COCO datasets performed the best, with an mAP of 83.2%.

The researchers also presented the differences between different methods such as SSD, Faster R-CNN and YOLO, this was done to compare the advantages and dis- advantages between them. Lastly, the researchers state that the difference between object detectors are shrinking. Single Shot Detectors such as SSD and YOLO are more complex than before, but are effective to maintain speed. The researchers argue that, based on the training and evaluation from other researchers, that the SSD and YOLO methods work well for applications were speed is one of the main require- ments. The Single Shot Detectors are, according to the researchers, less accurate than the Faster R-CNN models but in cases where object are large they manage to be more accurate than Faster R-CNN and CNN based models.

3.3.1 Comments

These results are relevant to this thesis because they argue for the use of a Single Shot Detection model in applications where speed is a requirement. It is often in real-time environments where vehicle detection is a great necessity and further argues for this approach. Furthermore, the approach of using mAP combined with comparing the same model trained on different datasets is relevant to RQ1 and can be used as a guidance when exploring relevant methods used for this purpose.

3.4 An Approach for Validating Quality of Datasets for Machine Learning

In 2018, J. Ding et al. [7] studied how the quality of datasets impact the accuracy of machine learning models. The researchers exemplify this impact with an example but also develop a testing technique to validate a machine learning system and its dataset.

The researchers evaluate the quality of different datasets using three properties of each newly produced dataset, these three properties were fidelity, variety and ve- racity. The fidelity property means that a newly produced data item is valid and that it contains relevant information about the non-augmented data. The variety property means that each new produced data is unique and is not a direct copy of already existing data. Lastly, the veracity property means that the performance of a machine learning system trained on the augmented form of the collected data should be high.

The tests that were done by the researchers using Metamorphic Relations (MRs).

During testing, nine MRs were conducted to test the impact of the training dataset on a machine learning model. These nine MRs varied in augmentation of the dataset, with augmentations varying from cropping an image to adding 10% more images to the training set. From the tests conducted by the researchers, they concluded that the dataset quality had an significant impact on the machine learning accuracy. Most datasets had common problems according to the researchers, most of these problems

(21)

were duplicate data or similarity of images that greatly impacted the performance of the system. The researchers claimed that these problems were hard to identify and study using traditional machine learning validation methods. They argued that the validation method they proposed adequately tested the system and its dataset in an effective way.

3.4.1 Comments

The results presented by J. Ding et al. indicates that dataset quality does in fact have a significant impact on the accuracy of an object detection system. This is relevant for the research conducted in this thesis because of the comparison of datasets and determination of which features of the dataset that contributed to its superiority over other datasets. This relates to RQ1.2.

3.5 Comparison of Visual Datasets for Machine Learning

K. Gauen et al. [11] emphasize the importance and difficulties in acquiring high quality labeled data for images. The researchers bring forward the issue of image selection bias and labeling error present in many existing visual datasets, focusing on the "person" class. To investigate the differences between labeling approaches used in well known visual datasets, the researchers proposes two important quantitative comparison methods.

First of all, a solution comparing label distribution of each dataset is put forward.

By looking at all individual labels marked with the "person" class, the researchers are able to construct density maps showing the concentration as well as location variety of labels across datasets. Secondly, the researchers present a solution that visualizes the size of labels relative to image size.

Through examination of seven popular visual datasets using above methods, K.

Gauen et al. highlighted interesting results. Many of the examined datasets contain a centered label distribution. Furthermore, the researchers concluded that 70% of the total bounding boxes in two of the datasets covered more than 10% of the total image size, which would indicate a large object size. Most of the datasets included bounding boxes that took up less than 10% of the total image, which would indicate that there are many small objects prevalent in these datasets.

3.5.1 Comments

The approach used in this paper is relevant to this thesis because it illustrates two methods that can be used for analyzing visual datasets which relates to RQ1.2.

Firstly, the label density map can be utilized as a tool to determine the distribution of labels in a visual dataset. Secondly, it can be used to determine if a dataset contains labels that are heavily concentrated around a specific area which could indicate that objects are biasedly labeled towards specific camera angles.

(22)

4 Method

The aim of this section is to state the methodology used in this thesis and its rele- vancy. Each step in the chosen methodology will be analyzed and broken down with the purpose of giving insight to the development steps.

4.1 Methodology of choice

In this thesis the "Systems Development in Information Systems Research" outlined by Nunamaker et al. [55] is chosen as the research methodology. The methodology is composed of a five stage system development approach and is illustrated in Figure 5.

Because of the method’s focus on gathering domain knowledge as well as the ability to be used in an iterative approach, it is suitable for this thesis. The methodology can be used to construct a system from the ground up with appropriate steps for research and planning of the systems architecture and features. An iterative process is good for this thesis because it is a helpful tool for breaking down a system into smaller components. These components can be implemented separately, tested and analyzed before continuing further development of the system.

Figure 5: The five stages of system development. Based on [55].

4.2 Construct a Conceptual Framework

When constructing a conceptual framework, relevant research and literature stud- ies are performed to develop knowledge in the area and problem domain. With the knowledge gained from the literature studies and research, appropriate research questions are formulated in a chosen problem domain.

4.3 Develop a System Architecture

This section describes the system architecture of the vehicle detection system used to achieve the stated objectives.

TensorFlow is one of the dominating machine learning frameworks [56]. The frame- work provides an extensive amount of documentation and tutorials which is helpful when training, evaluating and deploying object detection models. In addition to its support for the user-friendly Object Detection API [28], TensorFlow provides a large library of models and tools such as TensorBoard. These tools can be used to moni- tor the training and evaluation process, while also providing easy access to metrics such as mAP [30, 28]. These advantages led to the conclusion that TensorFlow is a suitable framework for this thesis.

The first step involves researching and selecting appropriate datasets used for the training phase. These datasets include annotations for vehicles in various forms.

(23)

Thereafter, the selected datasets are reformatted and generalized for an easier obser- vation and evaluation process. The reformation process includes modifying annota- tions into a common vehicle class as well as filtering features that are irrelevant for this thesis. The vehicle class cover objects such as cars, trucks and vans. In addition, the datasets are converted into the TFRecord format [28] which is a binary format utilized by the TensorFlow Object Detection API. The reformatted dataset is used as part of the training phase of the appropriately selected model. The stage of choosing an object detection model involves researching as well as studying well-known and documented models that are suitable in a real-time setting. These models are part of the TensorFlow Object Detection API and have been pre-trained on large gen- eralized datasets such as COCO. Using fine-tuning, the model is further trained on selected dataset to specialize in vehicle detection. The output data from the vehicle detection systems is collected and evaluated to answer relevant research questions.

Relevant output data that is used during the evaluation is mostly the mAP of the respective models trained in the three datasets. Key features from each dataset such as label distribution, average labels per image and dataset size are also collected.

4.4 Analyze and Design the System

The TensorFlow 2 Detection Model Zoo [36] includes a wide variety of models pre- trained on the COCO dataset. As well as containing pre-trained weights that allow for transfer learning, these come with a pipeline configuration file allowing the user to define the training procedure. This file contains important configurable parame- ters that are essential for producing a vehicle detection system with great accuracy.

To fairly compare each dataset’s impact on vehicle detection performance, their respective pipeline configurations are identical with the exception of training and evaluation data paths. The initial pipeline configuration values are gathered from the model download in the TensorFlow 2 Detection Model Zoo [36], and the following configurations are analyzed:

model: Selected model is listed here as well as important parameters specific to its architecture such as feature meta-architecture and feature extractor. There are a large number of adjustable parameters which are largely dependent on its application, however, necessary parameters are: num_classes: Defines the amount of classes that the model is able to detect. Since this system focuses on vehicle detection only, this is set to 1.

train_config: Defines which parameters to use as the training parameters such as SGD and input pre-processing. Modified parameters here are: batch_size which describes the amount of training images that are propagated through the network. Based on previous testing, this is set to 4.

eval_config: Parameters that decide the evaluation methods used to estimate a model’s performance.

train_input_reader: Defines the dataset that the model should be trained on. This parameter will be adjusted for every new dataset that is used.

eval_input_reader: Defines the dataset that the model should be evaluated on. This parameter will be adjusted for every new dataset that is used.

Many important parameters are included under the train_config clause. The pa- rameters that are not mentioned below are kept as default for all datasets.

(24)

learning_rate: The learning rate of the SSD model is kept as its default value from when it was gathered from the Model Zoo. This implies a learning rate of 0.013333 during the first 2000 steps and thereafter a base learning rate of 0.039999. The learning rate slowly decays throughout the training process which is determined by a cosine based decay factor.

num_steps: Represents the number of steps the training process will take before finishing and as a result, determines the length of the training process.

During each step, a number determined by batch_size of images are processed and the model’s parameters are updated. This parameter is set to 80.000 for all datasets.

batch_size: Represents the number of images that are propagated through the network for each network update. This parameter is set to 4 because of the limited amount of memory available on the GPUs used for this research.

fine_tune_checkpoint_type: This parameter is set to ’detection’, this to configure the model for object detection.

use_bfloat16: This parameter is set to false because it is not supported with the GPUs used for this research.

data_augmentation_options: The augmentation options are removed when comparing the datasets with each other, this to fairly evaluate each dataset us- ing their raw data. However, data augmentations are added during further testing to answer RQ1.3.

4.5 Build the (Prototype) System

Building and implementing the system involves several important steps. After a dataset and model was found, training of the selected model is commenced using fine-tuning. Before, during and after training of the model, TensorBoard is used to test and validate the model to make sure that it detects relevant object in an image.

TensorBoard provides a suite of helpful tools used to visualize models trained using TensorFlow.

There are several important evaluation metrics that are collected from TensorBoard.

During training, the model saves important event files that include metrics of the model. These include the following: the accuracy of the model, loss during training and evaluation and examples of images that the model detected vehicles in. With the collected mAP metric, a comparison is done between different datasets to deter- mine which one produced a better performing model. With the information gathered from TensorBoard, in combination with features gathered from the datasets such as label distribution, average labels per image and dataset size, a method is utilized.

This method, which is in further detail explained in 4.6, is used to determine which dataset contributed to the highest performing vehicle detection system. Determi- nation of the best performing dataset gives insight into which dataset contributed to the inferior performing system. To achieve a better performing vehicle detection system using the same dataset, the best performing model setup is modified using the TensorFlow Object Detection API. Evaluation of the modified dataset is done in the aforementioned way and is compared with the datasets previous impact on the vehicle detection system. It should be noted that, when initially comparing the datasets, no changes are done to the sizing, color or other attributes of each dataset.

(25)

During the development process, it is observed that choosing an appropriate object detection model architecture is a crucial step when utilizing fine-tuning techniques.

SSD and Faster R-CNN are the main architectures considered when testing and evaluating datasets. Both of these architectures are prevalent in the TensorFlow 2 Detection Model Zoo [36] and are well-known state-of-the-art detectors [57]. It is clear that one-stage detectors such as SSD are faster and suitable in applications where speed is an important aspect [58]. While the accuracy for two-stage detectors such as Faster R-CNN might be higher [59, 53], the balance between accuracy and high speeds is an important feature of architectures such as SSD. In many cases, the accuracy between SSD and Faster R-CNN only differ by a few percent while the speed is often several times faster for SSD architectures [59, 53]. According to the TensorFlow 2 Detection Model Zoo, SSD ResNet50 V1 FPN 640x640 (RetinaNet50) architecture has a speed of 46 ms and a COCO mAP of 34.3% while Faster R-CNN ResNet50 V1 640x640 has a speed of 53 ms and a COCO mAP of 29.3%, perform- ing worse in both accuracy and speed. Therefore, SSD ResNet50 V1 FPN 640x640 (RetinaNet50) is selected as the model architecture for this thesis. However, it should be noted that Figure 6 illustrates that YOLO is able to be run at a higher frame rate than the SSD model. The TensorFlow Object Detection API does not include support for YOLO, which excludes this option [36].

Figure 6: Speed comparison SSD and Faster R-CNN [53]

The three datasets used for the training and evaluation process are KITTI, BDD100K and Udacity. There are a considerable amount of datasets suited for vehicle detection such as Waymo [60] and nuScenes [61]. However, after careful consideration, it is decided that KITTI, BDD100K and Udacity fit the time span considering the size of the datasets. Additionally, these datasets include an interesting variety of traffic, weather and location diversity and are therefore appropriate for this thesis [62].

Furthermore, because of the TensorFlow Object Detection API’s use of the TFRecord format, the datasets were selected because of convenient ways of converting them into this format, as opposed to other datasets listed.

4.6 Observe and Evaluate the System

Evaluation of the systems are done to validate that they detect vehicles in a correct manner, without detecting other objects not specified in the training configuration. If the systems do not work as expected, relevant steps of the methodology are repeated.

(26)

There is mainly one metric that is used to verify the performance of each model trained with a different dataset, this metric being the COCO mAP. Using this met- ric, it is possible to observe the model’s detection ability using a wide variety of functions. Each model trained with a different dataset are evaluated using all other datasets, including its own, meaning that each model is evaluated a total of three times. This to ensure that the model has developed a generalized ability to detect vehicles using unfamiliar data. Based on the results presented by J.Ponce et al. [8], in this paper it is explained that high performance on evaluation datasets is not nec- essarily an indication of high performance on real images where a large background variety is present. This further argues for this approach. It is important to note that the data used in the training process will not be part of the evaluation pro- cess to avoid false representations of a model’s performance. To answer RQ1.1, the evaluation metrics are gathered and an average of the COCO mAP is calculated for each model. TensorFlow Object Detection API offers a wide variety of augmenta- tion options that are examined to address RQ1.3. Examined augmentation options include: random pixel scaling, random jitter boxes, random jpeg quality conversion and random adjustment of brightness. These augmentations are selected as they have been proven to be effective ways of improving dataset quality, as seen in the results presented by C. Guindel et al [51]. In addition, random flipping and crop- ping of images are added as augmentations to study their impact on the performance of a vehicle detection system. The average COCO mAP of the augmented dataset models is compared to its non-augmented performance. In total, three models are trained using separate categories of augmentations and consists of Flip and Crop, Pixel Manipulation as well as a combination of these two. The following section aims to clarify these categories and examine data augmentations in detail.

Flip & Crop

Flipping of the images are done through the configuration file of the model. The flipping is done horizontally with a 50% probability of occurring. During the flip- ping process, all labels in the image are also flipped, this to ensure that all object are labeled correctly in the dataset [28].

Cropping of the images are done in a similar manner as flipping [28]. Images in the dataset are randomly cropped with a random ratio factor of 0.1 and 1. The cropping is also done on the bounding boxes in the image, this to match a possibly cropped object in the dataset.

Pixel Manipulation

This category includes augmentation options that are relevant to pixel manipulation.

First of all, random pixel scaling multiplies each pixel in an image by a constant value in the range [0.9, 1.1], effectively adding minimized distortions to the image. The random jitter boxes slightly modifies the corners of boundary boxes present on the image. Random jpeg quality modifies the jpeg encoding quality and induces potential jpeg noise to the image. Lastly, random brightness adjustment randomly changes the brightness of an image.

(27)

Combined Augmentation

The last category of augmentations is a combination of the Flip & Crop and Pixel Manipulation augmentation categories. This to analyze the impact of combined augmentations on a dataset.

Deployment of Models

To further validate the developed evaluation method, the three models trained on BDD100K, KITTI and Udacity are deployed in a real-world setting. The deploy- ment of the models is conducted as a proof of concept for the developed evaluation method. During the deployment, the detection capabilities of the three models are compared and analyzed. This is done to validate that the obtained evaluation results corresponds to the results from real-world usage.

(28)

5 Results & Analysis

In this thesis, three datasets, KITTI [13], Udacity [14] and BDD100K [12], were used in the training of a SSD ResNet 50 V1 FPN 640x640 (RetinaNet 50) [36] object detec- tion model. The training was done using the TensorFlow Object Detection API [28].

During the evaluation stage, evaluation metrics were collected and evaluated using a method to determine the impact of each dataset on the model. This method was based on related work gathered during the literature study and was applied during the conduction of research. With the utilized method, it could be determined which of the chosen datasets that contributed to the best performing vehicle detection sys- tem. However, the method applied is general in its approach and can be applied for determining the impact of other datasets on vehicle detection performance.

During the training process, the model was configured identically for each dataset.

This was done to make sure that every dataset had an equal starting ground which made it easier to determine their impact on the vehicle detection system. In the con- figuration file of the model, all augmentation options were removed and the learning rate was set to the same value. Because this thesis has a focus on vehicle detection, the amount of classes in the configuration file was set to the value 1. This makes an indication to the model that it should only be possible to detect one class of objects.

The TensorFlow Object Detection API [28] creates checkpoint files during training, but only six checkpoints were saved at a time. A change was made in the TensorFlow Object Detection API [28] to make sure that all checkpoints were saved, not only the latest six.

Before the training phase could commence, the datasets had to be prepared. The TensorFlow Object Detection API [28] expects inputs to be in the TFRecord format.

During preparation, the datasets were converted to the correct format and specific modifications were made explained in the following section.

5.1 Data Preparation

In order to simplify the evaluation process, annotations for each dataset were modi- fied so that they only contain a generalized vehicle class. Classes that belong to the vehicle category, which are cars, trucks vans and buses, were merged while classes that do not belong in this class were ignored. Thereafter, each dataset was converted to TFRecords which is the expected format for the TensorFlow Object Detection API [28].

5.1.1 KITTI

The classes ’Car’, ’Van’ and ’Truck’ were merged into a common vehicle class, while the other five classes were ignored. To do this, the labels of the dataset had to be changed and this was done by iterating through all label files and changing each occurrence of ’Car’, ’Van’ and ’Truck’ into ’Vehicle’. Using the KITTI dataset [13]

tool present in the TensorFlow Object Detection API [28], the images and their complimentary labels were converted into the TFRecord format. While features such as 3D boundary boxes and observation angle are included in the KITTI dataset [13], these were disregarded.

(29)

5.1.2 Berkeley DeepDrive - BDD100K

Out of the ten classes included in the BDD100K dataset, ’car’, ’bus’ and ’truck’ were merged into a common ’vehicle’ class while the remaining were filtered and ignored.

By default, the evaluation data contains 10.000 labeled images, however, this data was reduced to 1.000 images to match the evaluation data size of the other datasets.

Out of the 70.000 training images, 730 images were found to not contain any labels for ’vehicle’. Lastly, the data was converted to the TFRecord format.

5.1.3 Udacity

Preparations for the Udacity dataset [14] was straight forward because of the simple format of the dataset. Firstly, the labels ’Car’ and ’Truck’ in the dataset were merged into a common vehicle class, while the other two classes were ignored. The dataset was downloaded in a format consisting of 15.000 jpg-images and a csv-file containing all labels, this format can not be used in the TensorFlow Object Detection API [28]

and needed to be converted into the TFRecord format. Unlike KITTI and BDD100K, Udacity does not come prepared with evaluation data. Therefore, the dataset was split by a ratio of 90% training data and 10% evaluation data.

5.2 Evaluation Method

Evaluations were achieved by utilizing tools provided by the TensorFlow Object Detection API [28]. The following tables show the evaluation results from the models trained on the BDD100K [12], KITTI [13] and Udacity [14] datasets.

Table 1: Evaluation of model trained on BDD100K Evaluation dataset mAP (%)

KITTI 38.8

BDD100K 42.5

Udacity 33.3

Table 2: Evaluation of model trained on KITTI Evaluation dataset mAP (%)

KITTI 75.3

BDD100K 3.4

Udacity 10.5

Table 3: Evaluation of model trained on Udacity Evaluation dataset mAP (%)

KITTI 13.5

BDD100K 13.1

Udacity 36.1

(30)

The results obtained from Table 1 - 3 illustrate the difference in mAP upon evaluation using each dataset’s respective evaluation set as well evaluation data gathered from the other datasets. Taking results gathered from the evaluation phase into consid- eration, an average mAP and standard deviation was calculated. Table 4 illustrates this metric for the selected datasets.

Table 4: Average mAP and standard deviation through all evaluations Dataset Average mAP (%) Standard Deviation

KITTI 29.7 32.4

BDD100K 38.2 3.8

Udacity 20.9 10.8

5.2.1 Best Performing Dataset

From Table 1 - 3 it is apparent that each dataset has the highest performance using its own evaluation data. This is likely a consequence of similarities in the data used for training and evaluation. Even though none of the images occur in both of the training and evaluation sets, it is clear, based on the results from table 1 - 3, that similarities in data such as traffic environment and annotation standards have a significant impact on a model’s performance metrics during evaluation. This shows that evaluation of models on data that is similar to the data used in the training phase can lead to a misrepresentation of the model’s generalized ability and accuracy to detect vehicles in varying environments. This difference in accuracy and generalized detection ability is particularly noticeable for the models trained on the KITTI [13] and Udacity [14] datasets. The model trained on BDD100K [12]

has a considerably higher consistency across all evaluation data compared to KITTI and Udacity. To account for these deviations, an average mAP as well as standard deviation was calculated for each evaluation result. By observing a model’s average mAP in addition to its standard deviation across multiple evaluation datasets, it is possible to examine its universal performance. This observed deviation is presented in table 4. Unsurprisingly, the dataset that stands out from this table is BDD100K, which differs significantly from the other datasets. BDD100K contributed to the highest average mAP as well as the lowest standard deviation, meaning that it is seen as the dataset that contributed to the highest performing vehicle detection model.

Proving its generalized ability to detect vehicles in unfamiliar data and doing so with the highest average mAP. Lastly, this research has shown that the average mAP and standard deviation are suitable metrics for determining the dataset that contributed to the highest performing vehicle detection system, which answers RQ1. As a result of this, it can be concluded that BDD100K is the dataset that contributed to the best generalized detection ability and accuracy, which answers RQ1.1.

5.2.2 Features From the Best Performing Dataset

From the section above, it can be determined that the BDD100K dataset contributed to the highest performing vehicle detection model while also proving its generalized ability to detect vehicles. This section relates to RQ1.2, where the features from this dataset is analyzed to determine the possible reasons behind its superiority compared to the other two datasets.

(31)

Dataset Size

The BDD100K dataset has a significantly larger amount of images compared to the KITTI and Udacity datasets. This feature can possibly be one of the reasons be- hinds its superiority over the other two datasets. To determine this, a new model was trained with 7.000 random images from the BDD100K dataset, compared to the 70.000 images used in previous training presented in Table 1. To understand the impact of dataset size on the BDD100K dataset, the following table was created. It should be noted that this modified dataset is referred to as BDDv7 hereafter.

Table 5: Evaluation of Model Trained on the BDDv7 Dataset Evaluation dataset mAP(%) Percentage point decrease in mAP

KITTI 27.3 11.5

BDD100K 32.8 9.7

Udacity 28.6 4.7

Table 5 shows a decrease in mAP for all evaluations using a reduced amount of training images for BDD100K. The most significant decrease in mAP was found when evaluating on it the KITTI evaluation dataset while the least decrease was found on the Udacity evaluation dataset. With the gathered results, the average mAP and standard deviation of the model was calculated. The average mAP had a value of 29.6% and the standard deviation was calculated to be 2.4, compared to its previous average mAP of 38.2% and standard deviation of 3.8 presented in Table 4. Although its immense reduction in training images, BDDv7 acquired an av- erage mAP that was marginally lower than the model trained on the KITTI dataset, while still having a lower standard deviation. When compared to the model trained on the Udacity dataset, the model maintained a higher mAP and lower standard deviation. Overall, these results suggest that there are other features present in the BDD100K dataset that boosts its performance compared to Udacity and KITTI. To find out which features boosted BDD100Ks performance, the research was extended and label distribution was studied.

Label Distribution

Figure 7 illustrates label heatmaps for each of the datasets. The label heatmaps convey label distribution of the datasets, revealing where labels are most prevalent in the processed dataset images. The heatmaps were constructed by extracting every bounding box from the datasets. Thereafter, each pixel value located inside each bounding box area were incremented by one, creating a distribution throughout a two dimensional array. Lastly, a colormap was created and displayed using the Matplotlib plotting library for Python.

(32)

[a]

[b]

[c]

Figure 7: Label heatmaps

References

Related documents

Genom att ge barnen förutsättningar för att kunna göra dessa val exempelvis med miljön, olika rum för olika aktiviteter, eller genom att låta dem leka ostört i trygghet kan

For evaluation of the new speed detection method, 27 scenarios were set up to investigate whether the following factors influence the performance of the method under different

This project trained and implemented two important functions including object detection and semantic segmentation in Carla simulator for autonomous vehicles environment perception

The simulations show that 802.11p is not suitable for periodic position messages in a highway scenario, if the network load is high (range, packet size and report rate) since

Notably, states and international organizations participating in peace operations can acquire international responsibility in connection with a wrongful conduct committed by

Denna text redogör hur urvalet ur databaserna utfördes, det vill säga hur undersökningen gick till väga för att söka och välja ut de vetenskapliga artiklar som använts i

Adolescent girls suffer from internalizing problems, such as somatic symptoms and mental health problems, at higher rates than in decades.. This thesis highlights health effects

föräldrarna har inte sett detta; läkarbedömning inte gjord; om ett blåmärke fanns, så lär det knappast gå att säkert påstå att det var från en hand - en tolkning finns inbyggd