• No results found

2D object detection and semantic segmentation in the Carla simulator

N/A
N/A
Protected

Academic year: 2021

Share "2D object detection and semantic segmentation in the Carla simulator"

Copied!
66
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

2D object detection and semantic

segmentation in the Carla simulator

CHEN WANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

2D object detection and semantic segmentation in the

Carla simulator / 2D-objekt detektering och semantisk

segmentering i Carla-simulatorn

Chen Wang

Master of Engineering Thesis

Communication Systems

School of Information and Communication Technology KTH Royal Institute of Technology

Stockholm, Sweden

20 Nov 2020

Supervisor: Yu Yang

(3)
(4)

Abstract

The subject of self-driving car technology has drawn growing interest in recent years. Many companies, such as Baidu and Tesla, have already introduced automatic driving techniques in their newest cars when driving in a specific area. However, there are still many challenges ahead toward fully autonomous driving cars. Tesla has caused several severe accidents when using autonomous driving functions, which makes the public doubt self-driving car technology. Therefore, it is necessary to use the simulator environment to help verify and perfect algorithms for the perception, planning, and decision-making of autonomous vehicles before implementation in real-world cars.

This project aims to build a benchmark for implementing the whole self-driving car system in software. There are three main components including perception, planning, and control in the entire autonomous driving system. This thesis focuses on two sub-tasks 2D object detection and semantic segmentation in the perception part. All of the experiments will be tested in a simulator environment called The CAR Learning to Act(Carla), which is an open-source platform for autonomous car research. Carla simulator is developed based on the game engine(Unreal4). It has a server-client system, which provides a flexible python API. 2D object detection uses the You only look once(Yolov4) algorithm that contains the tricks of the latest deep learning techniques from the aspect of network structure and data augmentation to strengthen the network’s ability to learn the object. Yolov4 achieves higher accuracy and short inference time when comparing with the other popular object detection algorithms. Semantic segmentation uses Efficient networks for Computer Vision(ESPnetv2). It is a light-weight and power-efficient network, which achieves the same performance as other semantic segmentation algorithms by using fewer network parameters and FLOPS.

In this project, Yolov4 and ESPnetv2 are implemented into the Carla simulator. Two modules work together to help the autonomous car understand the world. The minimal distance awareness application is implemented into the Carla simulator to detect the distance to the ahead vehicles. This application can be used as a basic

(5)

ii ABSTRACT

function to avoid the collision. Experiments are tested by using a single Nvidia GPU(RTX2060) in Ubuntu 18.0 system.

(6)

Sammanfattning

¨

Amnet sj¨alvk¨orande bilteknik har v¨ackt intresse de senaste ˚aren. M˚anga f¨oretag, som Baidu och Tesla, har redan inf¨ort automatiska k¨ortekniker i sina nyaste bilar n¨ar de k¨or i ett specifikt omr˚ade. Det finns dock fortfarande m˚anga utmaningar inf¨or fullt autonoma bilar.

Detta projekt syftar till att bygga ett riktm¨arke f¨or att implementera hela det sj¨alvk¨orande bilsystemet i programvara. Det finns tre huvudkomponenter inklusive uppfattning, planering och kontroll i hela det autonoma k¨orsystemet. Denna avhandling fokuserar p˚a tv˚a underuppgifter 2D-objekt detektering och semantisk segmentering i uppfattningsdelen. Alla experiment kommer att testas i en simulatormilj¨o som heter The CAR Learning to Act (Carla), som ¨ar en ¨oppen k¨allkodsplattform f¨or autonom bilforskning. Du ser bara en g˚ang (Yolov4) och effektiva n¨atverk f¨or datorvision (ESPnetv2) implementeras i detta projekt f¨or att uppn˚a Funktioner f¨or objektdetektering och semantisk segmentering. Den minimala distans medvetenhets applikationen implementeras i Carla-simulatorn f¨or att uppt¨acka avst˚andet till de fr¨amre bilarna. Denna applikation kan anv¨andas som en grundl¨aggande funktion f¨or att undvika kollisionen.

Nyckelord: Objekt-detectiopn, semantisk segmentering, Yolov4, ESPnetv2, Carla

(7)
(8)

Acknowledgements

This report is the final version in course degree project(EA246X) at the department of communication systems, KTH Royal Institute of Technology. I hereby highly thank Prof. Ahmed Hemani, Mr Yu Yang. Thanks for your valuable suggestions for my work in each meeting.

(9)
(10)

Contents

1 Introduction 1 1.1 Goal . . . 2 1.2 Research methodology . . . 3 1.3 Delimitation . . . 3 1.4 Hypothesis . . . 3

1.5 Structure of this thesis . . . 4

2 Literature Study 5 2.1 Autonomous driving system . . . 5

2.2 Object-detection. . . 6 2.2.1 DPM . . . 7 2.2.2 RCNN . . . 7 2.2.3 Fast-RCNN . . . 8 2.2.4 Faster-RCNN . . . 9 2.2.5 YOLO. . . 10 2.3 Semantic segmentation . . . 12 2.3.1 FCN . . . 13 2.3.2 Segnet. . . 13 2.3.3 ESPnetv2 . . . 14 2.4 Simulator . . . 16 2.4.1 Carla . . . 16 3 Implementation 19 3.1 Testbed environment . . . 19

3.2 2D Object-detection module: YOLOv4 . . . 21

3.2.1 Dataset Preparation . . . 21

3.2.2 YOLOv4 Network Modification . . . 22

3.3 Semantic segmentation model: ESPnetv2 . . . 23

3.3.1 Dataset Preparation . . . 23

3.3.2 ESPnetv2 Network Modification . . . 23

(11)

viii CONTENTS

4 Result and Analysis 27

4.1 2D object detection: YOLOv4 . . . 27

4.1.1 YOLOv4: Training procedure . . . 27

4.1.2 YOLOv4: Model evaluation . . . 29

4.2 Semantic segmentation: ESPnetv2 . . . 33

4.2.1 ESPnetv2: Training procedure . . . 33

4.2.2 ESPnetv2: Model Evaluation. . . 37

4.3 Application: Minimal distance awareness . . . 39

5 Conclusion 41 5.1 Limitations . . . 41

5.1.1 Carla simulator . . . 41

5.1.2 YOLOv4 . . . 42

5.1.3 ESPnetv2 . . . 42

5.1.4 Minimal distance awareness application . . . 42

5.2 Ethical Considerations . . . 42

5.3 Sustainability . . . 43

5.4 Future work . . . 43

5.4.1 Carla simulator . . . 43

5.4.2 YOLOv4 and ESPnetv2 . . . 43

5.4.3 Minimal distance awareness application . . . 43

(12)

List of Figures

2.1 Overview of autonomous driving system . . . 5

2.2 Subtasks of environment perception [1] . . . 6

2.3 The DPM model contains (a) a root filter, (b) multiple part filters at twice the resolution, and (c) a model for scoring the location and deformation of parts. [2] . . . 7

2.4 Basic model of RCNN[3] . . . 8

2.5 Basic model of Fast-RCNN[4] . . . 9

2.6 Basic model of Faster-RCNN[5] . . . 9

2.7 Basic model of YOLOv4[6] . . . 10

2.8 Detail information about YOLOv4 network structure . . . 11

2.9 YOLOv4 network units explanations . . . 11

2.10 Mish activation function[7] . . . 12

2.11 Basic structure of FCN and tradition CNN [8] . . . 13

2.12 Basic structure of Segnet[9]. . . 14

2.13 ESPnetv2 network structure . . . 15

2.14 ESPnetv2 network unit . . . 15

2.15 EESP model(GConv-n: n*n group convolution, DDConv-n: n*n depth-wise dilated convolution)[10] . . . 16

2.16 ClearNoon. . . 17

2.17 HardRainNoon . . . 17

2.18 WetCloudyNoon . . . 17

2.19 WetCloudySunset . . . 17

3.1 System Pipeline . . . 19

3.2 Town 01 map positions . . . 20

3.3 Server Window . . . 21

3.4 Client Window . . . 21

3.5 Learning rate policy:Hybird[10] . . . 23

3.6 Pixel coordinate to world coordinate[11] . . . 26

4.1 Loss-chart window . . . 28

(13)

x LIST OFFIGURES

4.3 PR curve for cars in validation set . . . 29

4.4 PR curve for cyclist in validation set . . . 29

4.5 PR curve for cars in test set . . . 30

4.6 PR curve for cyclist in test set . . . 30

4.7 PR curve for cars in test set Sunny condition . . . 30

4.8 PR curve for cyclist in test set Sunny condition . . . 30

4.9 PR curve for cars in test set Cloudy condition . . . 31

4.10 PR curve for cyclist in test set Cloudy condition . . . 31

4.11 PR curve for cars in test set Rainy condition . . . 31

4.12 PR curve for cyclist in test set Rainy condition . . . 31

4.13 Low clarity image example . . . 32

4.14 YOLOv4: Inference time . . . 33

4.15 GPU Power consumption with YOLOv4 . . . 33

4.16 GPU memory used with YOLOv4 . . . 33

4.17 Learning rate strategy in training process. . . 34

4.18 First stage . . . 35

4.19 Second stage . . . 36

4.20 Input image . . . 36

4.21 Trained model with class weights in loss function . . . 37

4.22 Trained model without class weights in loss function . . . 37

4.23 Average inference time . . . 38

4.24 GPU efficiency measurement . . . 39

4.25 Server agent . . . 40

(14)

List of Tables

3.1 YOLOv4 Dataset Distribution . . . 21

3.2 YOLOv4 Network Configuration . . . 22

3.3 Class weights in loss function . . . 24

3.4 ESPnetv2 Network Configuration First stage. . . 25

3.5 ESPnetv2 Network Configuration Second stage . . . 25

4.1 Performance of best weights obtained in training process . . . 28

4.2 Data distribution of test set . . . 29

4.3 Network parameters and FLOPs in training stage . . . 34

4.4 Confusion matrix for test set . . . 38

4.5 Evaluation metrics . . . 38

(15)
(16)

List of Acronyms and Abbreviations

PR Precision-Recall

SPP Spatial Pyramid Pooling

EESP Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions

IoU Intersection of Union

AP Average Precision

PA Pixel Accuracy

CPA Class Pixel Accuracy

FLOPS Floating-point Operations per Second

(17)
(18)

Chapter 1

Introduction

In 2019, Sweden had a historic low in fatalities from traffic crashes. A total of 223 people have died in traffic crashes, according to preliminary estimates by the Swedish Transport Agency[12]. Moreover, in 2019, about 22,800 traffic accident fatalities were reported in the 27 Member States of the European Union(EU)[13]. While the aforementioned statistics have declined over the last ten years relative to EU reports, the EU has also released a range of new regulations effort to minimize possible car collision occurrences. In 2019, one of the rules requires that the advanced safety equipment should be mandatory in all new road vehicles sold on the EU market[14]. On the other hand, autonomous cars will enter the EU market from 2020. From the details of the statistic of car accidents, 95% of all road traffic accidents are caused by human error in the EU. Driverless cars could, hence, greatly mitigate the chances of traffic accidents and ensure the reliability of driving. Besides, the issue of congestion and the greenhouse effect can be gradually addressed by this new technology. Levels 3 and 4 of autonomous driving vehicles have recently been tested in the real world. It is planned that fully autonomous vehicles will come in 2030. In the next few years, the market of self-driving vehicles may dramatically grow, which will create more jobs and develop the EU automotive industry up to C620 billion by 2025[15].

The research of driverless cars has been a popular topic in the past decade. There are three main components in autonomous driving at the software level, which are perception, planning, and control. Perception includes computer vision, sensor fusion, and positioning. Computer vision detects objects such as lanes, motorcycles, pedestrians, traffic signs, etc., and can identify lane lines, vehicle numbers, traffic light signals, and other details effectively. Digital Vision is similar to human eyes, which are able to perceive the world. Sensor fusion is used to deepen the understanding of vision, obtain information such as the distance of the vehicle, the moving speed of other objects, and understand the relationship

(19)

2 CHAPTER 1. INTRODUCTION

between oneself and the surrounding world. Positioning is not a simple GPS positioning. The self-driving car requires centimeter-level positioning. GPS with meter-level errors cannot meet the requirements. Therefore, self-positioning technology is also required. It needs to use landmarks, particle filtering, triangulation, and other methods for positioning. The path planning of the navigation map is a global summary plan that guides the autonomous vehicle where to go. Control is the last step of autonomous driving. It controls the steering wheel, brakes, accelerator, lights, and other equipment.

The main motivation of this project is to build up a benchmark by integrating the newest technology in the area of autonomous driving into one project. Performance measurement of each module used in the project. This project mainly focuses on the module of computer vision and sensor fusion. These two modules of autonomous driving are called environment perception. Cars make use of high-tech cameras and sensors to understand their surroundings. Two sub tasks called Object-detection and semantic segmentation are key parts in environment perception. Object-detection is responsible for classifying what occurs in an image and pointing out its position and categories. Image segmentation is the task of dividing the image into regions belonging to the same object. A convolutional neural network(CNN) is trained to recognize various objects. One method is to employ sliding windows. As the windows slid over the image, images are sent to CNN to be predicted into one class. However, the sliding window algorithm is much more computationally expensive. The first version of the You only look once(YOLO) algorithm was released in 2015. YOLO splits up an image up into a grid and runs the entire image through a convolutional neural network[16]. YOLOv4 is applied to this project. Besides, a light-weight segmentation network called ESPNetv2 is employed to perform image segmentation. The main character of ESPnetv2 is that it can learn representations from a large effective receptive field with fewer FLOPS and parameters. All of the experiments will be done in a virtual environment. Carla simulator is a popular open-source platform for testing the function of autonomous driving. It contains a flexible python API and configurable modules such as traffic generation, weather conditions, which can simulate most of the situations in the real-world.

1.1

Goal

(20)

1.2. RESEARCH METHODOLOGY 3

1.2

Research methodology

This project relies on the analytical method. The performance of YOLOv4 and ESPnetv2 can be obtained by measuring the model prediction precision, prediction time, total GPU memory used, and GPU power consumption. These evaluation metrics can reflect the basic performance of the model from the aspect of accuracy, inference speed, and GPU efficiency.

1.3

Delimitation

Our work has two main limitations:

• I made the dataset by myself. The whole dataset contains around 500 pictures with ground truth labels. Personal error during labeling might happen. All of the pictures are obtained from the default map called Town01 in the Carla simulator. So the performance of our model may not ideal when compared with the official result. If adding more training data, the result could be improved.

• The entire experiment is being tested on a single GPU RTX2060 with 6213 MB memory. The 10 FPS is set at the beginning of the Carla server, which is slower than the frame rate in real-time. However, two modules still can not work smoothly. Thus, extra time is set in the code to wait for the prediction result of two modules. Meanwhile, my computer’s computing ability limits the network configuration when the batch size and the input image size are large.

1.4

Hypothesis

(21)

4 CHAPTER 1. INTRODUCTION

1.5

Structure of this thesis

(22)

Chapter 2

Literature Study

2.1

Autonomous driving system

The overview of fully autonomous system is shown in Figure2.1.

Figure 2.1: Overview of autonomous driving system

(23)

6 CHAPTER2. LITERATURE STUDY

components in the software level, which are perception, planning, and control[17]. Each component contains different modules. The perception module collects data from the sensors and builds the environment model to help cars understand the world. Then travel planning is decided based on the user’s command. Finally, cars move by observing the actual environment. In this project, we will focus on environment perception, which contains four subtasks as shown in Figure2.2.

Figure 2.2: Subtasks of environment perception [1]

2.2

Object-detection

(24)

2.2. OBJECT-DETECTION 7

has a short inference time but performs low accuracy. The two-stage algorithm is just the opposite. The newest deep learning model for both strategies get a great process in accuracy and speed.

2.2.1

DPM

DPM is an extension of Histograms of Oriented Gradients(HOG)[21]. The general idea is the same as the HOG. The fundamental concept is to derive artificially DPM characteristics and then use latentSVM for classification. DPM recognizes objects with a mixture graphical model of deformable parts(as shown in figure2.3). The model consists of three major components:

• A root filter is a detection window that covers the entire image. • Multiple part filters in charge of smaller parts of the object.

• A spatial model can give different score to different small part relative to root filter

Figure 2.3: The DPM model contains (a) a root filter, (b) multiple part filters at twice the resolution, and (c) a model for scoring the location and deformation of parts. [2]

This feature extraction method has several obvious limitations. On the one hand, the DPM model is complicated in feature extraction and slow in inference speed. On the other hand, the artificial feature has poor performance when rotating or stretching objects.

2.2.2

RCNN

(25)

8 CHAPTER2. LITERATURE STUDY

Figure 2.4: Basic model of RCNN[3]

• Using the Selective Search algorithm[3] to extract about 2000 area candidate frames from the image. Such candidate frames may include the target that needs to be identified.

• Scale all candidate boxes to a fixed size

• The features of each candidate are extracted by DCNN and concatenated to a fixed-length feature vector.

• SVM inputs the feature vector and outputs the object category. Meanwhile, FCN is used to determine the position of the object according to the feature vector.

However, RCNN requires much time for training as it needs to produce 2000 area candidate frames for each image. Besides, RCNN is not able to do real-time detection. It takes about 47 seconds to detect each image.

2.2.3

Fast-RCNN

(26)

2.2. OBJECT-DETECTION 9

Figure 2.5: Basic model of Fast-RCNN[4]

2.2.4

Faster-RCNN

The proposal extraction, feature extraction, bounding box regression, and classification are merged into one network in Faster-RCNN, which improves the performance of the model from the aspect of speed and accuracy. Faster-RCNN first deploys CNN to extract image features. The Region Proposal Networks(RPN) in charge of generating region proposals. This layer employs the softmax function to recognize the anchors that belong to the foreground or background. Then above anchors are corrected through bounding box regression to get accurate proposals. Roi Pooling layer receives corrected anchors and original feature maps as input. The outputs of this layer are sent to the FCN to determine the target class. After that, the category of the proposal is obtained. Meanwhile, the accurate position of the object is obtained from the output of the ROI layer through bounding box regression. Figure2.6shows basic the structure of Faster-RCNN.

(27)

10 CHAPTER2. LITERATURE STUDY

2.2.5

YOLO

YOLO is a typical one-stage detector, which directly combines feature extraction, candidate box regression, and classification in one neural network. The detection speed of YOLO is 10 times faster than Faster R-CNN. YOLO scales the image to a uniform size. The input image of the YOLO CNN network is divided into n*n grids. Each separate cell is responsible for detecting the objects whose center point falls in the grid.

The newest version of YOLO is YOLOv4, which is released in April, 2020. It can make everyone to train a super faster and accurate object detector by using a single GPU, which is one of reasons that I employ YOLOv4 in this project. It combines a great deal of tricks in latest deep learning techniques into one detector. These tricks are divided into two types: Bag of freebies and Bag of specials. Bag of freebies increases the training time or changes the training strategy to make the object detector perform better with accuracy without increasing the inference cost such as CutMix[22], and DropBlock[23]. Bag of specials can significantly increase detector accuracy by adding a small amount of cost in inference process such as Spatial pyramid pooling(SPP)[24], and Spatial Attention Module(SAM)[25]. The structure of YOLOv4 is shown in Figure 2.7. It consists of:

• Backbone: CSPDarknet53[26] • Neck: SPP, PAN[27]

• Head: YOLOv3[28]

Figure 2.7: Basic model of YOLOv4[6]

(28)

2.2. OBJECT-DETECTION 11

Figure 2.8: Detail information about YOLOv4 network structure

(29)

12 CHAPTER2. LITERATURE STUDY

The main component in backbone is CSP resblock unit, which is the combination of the idea of CSPnet[26] and Resnet[29]. CSPnet proposed an idea that cut the feature map according to the channel, which can reduce the amount of calculation and memory used during training. Then two parts will concatenate at the end in the CSP resblock unit. Resnet is introduced to solve the problem of gradient degradation problems. As shown in figure2.10, a totally new activation function Mish is implemented in the backbone, which can guarantee that the smoothness of almost all the points. Regarding the Neck part, SPP and PANet module strengthens the ability of the network to extract features. Finally, the head part decodes the feature map and obtain the bounding box categories and coordinates.

Figure 2.10: Mish activation function[7]

2.3

Semantic segmentation

(30)

2.3. SEMANTIC SEGMENTATION 13

Encoder-decoder based models(Segnet)[9], (ESPnetv2)[10], which are popular research area in recent years.

2.3.1

FCN

Generally, the CNN network is connected to several fully connected layers after the convolutional layer. The output of the convolutional layer is mapped into a fixed-length feature vector. As shown in Figure 2.11, the FCN network uses the entire original image as input. It discards the fully connected layer and adds several CNN layers at the end to obtain a heatmap. Through several upsampling layers. FCN can get the same size heatmap as the input image. Its disadvantage is also obvious. The result of upsampling has low clarity. It is hard to get details from the FCN structure.

Figure 2.11: Basic structure of FCN and tradition CNN [8]

2.3.2

Segnet

(31)

14 CHAPTER2. LITERATURE STUDY

to the corresponding position when upsampling. This structure of the decoder not only saves boundary information but also removes the parameters added by FCN deconvolutional operation.

Figure 2.12: Basic structure of Segnet[9]

2.3.3

ESPnetv2

(32)

2.3. SEMANTIC SEGMENTATION 15

Figure 2.13: ESPnetv2 network structure

(33)

16 CHAPTER2. LITERATURE STUDY

Figure 2.15: EESP model(GConv-n: n*n group convolution, DDConv-n: n*n depth-wise dilated convolution)[10]

2.4

Simulator

One of the main problems encountered in the research and development of autonomous vehicles is that it is difficult to train vehicles to deal with all possible situations. Generally, unexpected situations occur with a relatively low probability. It is also expensive to train vehicles in the real-world. Thus, the autonomous driving simulator is needed to help researchers develop techniques for a fully autonomous driving system. There are several open-source autonomous driving simulation platform projects including Carla[32], AirSim[33] and Udacity[34]. This project deploys the Carla simulator since it is an open-source and free platform, which allows users to configure their own experiment environment based on Carla modules.

2.4.1

Carla

(34)

2.4. SIMULATOR 17

the scene. The client API interacts with the server via socket. Carla simulator has several advantages as follows:

• Flexible python API: users can control all modules in driving tasks such as traffic generation, pedestrian behavior, weather, sensors, etc.

• Users can configure various sensor kits that include lidar, camera, and GNSS sensor.

• Users can create their own maps to test the autonomous driving algorithm in a different environment.

The stable Carla version 0.8.4 is employed in our project. It contains two default maps Town01 and Town02. Users can edit 15 different weather condition inside game. Figure 2.16, 2.17, 2.18, and 2.19 shows the different scenes captured by the car camera.

Figure 2.16: ClearNoon Figure 2.17: HardRainNoon

(35)
(36)

Chapter 3

Implementation

Our project contains two different modules, each one provides key functions to autonomous driving in environment perception. In detail, we implemented 2D object detection(YOLOV4) and semantic segmentation(ESPnetv2) in the Carla simulator. Besides, an application is implemented to perceive the minimal distance to the ahead vehicles. The next section describes our testbed. Our system pipeline is shown in Figure3.1.

Figure 3.1: System Pipeline

3.1

Testbed environment

Our computer has a single Nvidia GPU(RTX 2060) with 6213 MB memory. Cuda 10.2 and Cudnn 7.6.5 were installed in Ubuntu 18.0. All of the experiments are tested under a virtual environment that is created by Anaconda[35]. In the

(37)

20 CHAPTER3. IMPLEMENTATION

computer vision field, once the rate of object detection for predicting image equals or larger than 30 frames per second(FPS), which means it has the good real-time performance. Regarding the configuration of the simulator, the Carla server agent runs with 10 FPS. 300 vehicles are added to the game each time it starts. Four types of weather are randomly set into the game, which includes clear noon, hard rain noon, wet cloudy noon, and wet cloudy sunset. We used the default map Town01 as the main experiment map. As shown in Figure 3.2, cars appear at a different place according to the different numbers on the map. Carla supports various virtual sensors for autonomous cars inside the game. In this project, an RGB camera and a depth camera are deployed in the car. RGB camera behaves like a regular camera that captures the game scene. The depth camera provides raw data of the scene, which is the distance of each pixel to the camera. The depth map is generated based on these data. More information about other sensors can be found on this website[36]. Carla simulator also contains a Graphical User Interface (GUI), which allows users to manually control cars by the keyboard. Each time game starts, the client window and server window are displayed. As shown in Figure3.3and3.4, the server window can display more detailed game settings, the client window shows the result of the object-detection module and semantic segmentation module in real-time.

(38)

3.2. 2D OBJECT-DETECTION MODULE: YOLOV4 21

Figure 3.3: Server Window Figure 3.4: Client Window

3.2

2D Object-detection module: YOLOv4

The YOLOv4 code is from the official Github website[37], which was released in April 2020. This module can perceive dynamic objects such as cars and cyclists. It predicts image and outputs a txt format file that contains the category and coordinate information of the object in the images.

3.2.1

Dataset Preparation

We used the code[38] from this Github website to generate training data from Carla driving simulator. Cars are set to the autonomous model. It saves the images to the disk each time it observed cars and cyclists. The code resets the environment if the agent is stuck or can’t find any agents or captured enough frames. There are 3 main weather conditions like sunny, cloudy, and rainy in the Carla simulator. LabelImg[39] was used to label the images. The training set contains 300 images. The validation set contains 100 images. The test set contains 100 images. Table 3.1shows the detail information about data distribution.

Weather Map Total Frames Cars Cyclists

Sunny Town01 157 212 87

Cloudy Town01 196 315 93

Rainy Town01 147 219 56

Total Town01 500 746 236

(39)

22 CHAPTER3. IMPLEMENTATION

3.2.2

YOLOv4 Network Modification

We need to modify the YOLOv4 network to fit our training cases. First, batch size defines the number of training examples(images) that will be propagated in the network. Generally, the batch size is smaller than the training samples. By setting a proper batch size, the training procedure requires less memory and goes faster. However the smaller the batch size is, the less accurate the estimate of the gradient will be. In this project, batch size and network size are set to 1 and 416x416 respectively as our computer has limited computing power. In general, 2000 iterations are enough for the network to train one class. As our project has 2 classes, max batches are set to 4000. The learning rate is set to 0.0013 by default. This value is changed after 80% and 90% steps of the max batches. Steps variables are set to 3200 and 3600. The input image size was set as large as possible to allow the network to detect the small objects. Limited to the computing power, 512x384 was set as input image size. Some data augmentation methods like modifying the angle, saturation, expose, and hue of input images are deployed to improve the performance of neural networks. The noticeable method is mosaic, which combines 4 training images into one in certain ratios. Table3.2shows the adjusted parameters of the network in YOLOv4.

Network Configuration

Batch 64 Exposure 1.5

Subvision 64 Hue 0.1

Width 416 Mosaic 1

Height 416 Learning rate 0.0013 Channels 3 Burn in 1000 Momentum 0.949 Max batches 4000 Decay 0.0005 Policy Steps Angle 0 Steps 3200, 3600 Saturation 1.5 Scales 0.1, 0.1

Table 3.2: YOLOv4 Network Configuration

Regarding the architecture of the YOLOv4 network, it includes three parts: Backbone, Neck, and Head as mentioned in Chapter 2. YOLOv4 implements Mish activation function in the backbone and Leaky-relu in the neck. There are 3 YOLO layers inside the network. The number of classes in each YOLO layer is modified to the number of categories in the training dataset. On the other hand, the number of filters in the convolution layer that before each YOLO layer are calculated based on the following formula:

(40)

3.3. SEMANTIC SEGMENTATION MODEL: ESPNETV2 23

By setting flag random to 1, it will increase precision by training Yolo for different resolutions.

3.3

Semantic segmentation model: ESPnetv2

The code is from the official Github website[40]. ESPnetv2 classifies each pixel in the image into 3 categories including cars, cyclists, and drivable space. Other objects are background. Different color is used to mark different object in pixel-level. For instance, red, yellow, green, and black represent vehicles, cyclists, the drivable area, and background respectively. This model takes an image as input and output an image with marked color at pixel-level.

3.3.1

Dataset Preparation

As mentioned in section 3.2.1, this model utilizes the same dataset as YOLOv4. Labelme[41] is used for image annotation. Labelme stores the information of the label in the JSON file. One of function file in Labelme json to dataset.py is used to convert JSON file into mask image. The original image is stored in JPG format with a size of 1248 * 384. The format of the labeled image is PNG with the same size as the original image.Dataset is made in PASCAL VOC 2012[42] format.

3.3.2

ESPnetv2 Network Modification

As mentioned in section 3.3.1, our dataset is made in PASCAL VOC format. Some necessary network modification needs to be done before training. The default PASCAL VOC configuration has 21 classes. First, we change the number of classes to 4 in the data loader voc.py. Regarding the learning rate policy, ESPnetv2 combines the feature of linear and cyclic learning rate policy. As shown in Figure3.5, the trending of learning rate is similar to cosine learning policy[43].

(41)

24 CHAPTER3. IMPLEMENTATION

The training procedure is divided into two stages. ESPnetv2 trains network with small resolution and large batch size in the first stage. As batch normalization(BN) layer performs well with larger batch size, when the image resolution is increased in the second stage, the BN layer is frozen to fine-tune network. Ignore index is set to 255, which means that pixel with value 255 is not taken into account for training and evaluation. Considering the unbalance of our training samples that the number of pixels for a cyclist is much smaller than other categories, we set the class weight in the loss function. Each category will have a class frequency which is the number of pixels in the category divided by the total number of pixels in the dataset. Then we calculate the median number among all class frequencies. This median number is divided by each class frequency to obtain the final class weights. This operation can ensure that the class with a small proportion has a weight greater than 1, and the class with a large proportion has a weight less than 1, achieving the effect of balancing. Table3.3shows the class weights in the loss function. Meanwhile, we also trained a model without setting class weights in the loss function. A comparison between the two strategies is shown in Chapter 4. Table3.4and3.5show the configuration of network in ESPnetv2. To evaluate the semantic segmentation model, we made a confusion matrix. Some metrics are introduced such as Pixel accuracy(PA), Class pixel accuracy(CPA), Mean pixel accuracy(MPA), and MIoU. These metrics are calculated based on the following formula:

PA= True positive Total pixel in matrix

CPA= True positive

Corresponding Column pixel in matrix

MPA=Total class o f CPA Classes

MIoU= True positive

T he sum o f the row and column o f the true positive value

Class Background Car Road Cyclist Weights 0.12612097 4.0838232 0.56975791 12.82901674

(42)

3.4. APPLICATION: MINIMAL DISTANCE AWARENESS 25

Network Configuration First Stage Batch 8 Cropsize 512 * 384 Learning rate 0.007 Scheduler Hybrid

Step size 20 Momontum 0.9 Epoches 50 Loss type bce Ignore-index 255 Clr-max 31 Cycle-length 5 Freeze-bn False Table 3.4: ESPnetv2 Network Configuration First stage

Network Configuration Second Stage Batch 4 Cropsize 1024 * 384 Learning rate 0.001 Scheduler Hybrid

Step size 20 Momontum 0.9 Epoches 50 Loss type bce Ignore-index 255 Clr-max 31 Cycle-length 5 Freeze-bn True

Table 3.5: ESPnetv2 Network Configuration Second stage

3.4

Application: Minimal distance awareness

(43)

26 CHAPTER3. IMPLEMENTATION

can be calculated using the following formula: X =(u − cx) ∗ Z

f Y =

(v − cy) ∗ Z

f

(u,v) represents the location of a point in pixel level, in which u is the row value and v is the column value. cx and cy are the position of the center point in the

pixel coordinate system. f is the focal length. These parameters can be found in camera calibration matrix K.

K=   f 0 cx 0 f cy 0 0 1   (3.1)

In Carla simulator, the camera calibration matrix is given. (u,v) is the output of YOLOv4. According to the similar triangle theorem, X and Y are easily obtained. Finally, the distance can be calculated:

distance=px2+ y2+ z2

(44)

Chapter 4

Result and Analysis

In this chapter, we present the results gained in this project by implementing a 2D object detection module(YOLOv4) and a semantic segmentation module(ESPnetv2) in the Carla simulator. YOLOv4 classifies dynamical objects such as cars and cyclists in the image. ESPnetv2 labels each pixel of the image with the corresponding represented class. These two modules are key functions in the environment perception task of autonomous vehicles. From the view of implementing relative functions in real-world cars in the future, those two modules were evaluated from different aspects such as real-time and power efficiency. Finally, we utilize the result of YOLOv4 to implement a function to detect the minimal distance to the ahead vehicle

4.1

2D object detection: YOLOv4

This section introduces the result of the 2D object detection module. In the first subsection, we analyze the logs during the training procedure, some metrics are summarized such as the recall, precision, average precision(AP), and the intersection of Union(IoU). In the second subsection, we evaluate the performance of the model using the testing set from the aspect of real-time and GPU efficiency.

4.1.1

YOLOv4: Training procedure

As shown in Figure 4.1, training loss dramatically decreases at the beginning of 800 iterations. It fluctuates around 0.4 at the end of the training. Mean average precision(mAP) is calculated for every 100 Epochs after 1000 iterations in the training process on the validation set.

One E poch= images in validation set batch

(45)

28 CHAPTER4. RESULT ANDANALYSIS

Figure 4.1: Loss-chart window Figure 4.2: mAP in Validation set

As shown in Figure 4.2, mAP reaches 91.39% at 1600 batches and stabilizes around 90% at the end. It may occur overfitting phenomena after 1600 batches. We saved weights at the 1600 batches as best weights. Table 4.1 shows more information about the performance of the model when using the best weights. The default confidential threshold is 0.25. Recall and F1-score are calculated based on the following formula:

Recall= T P (T P + FN) F1 − score = 2 ∗ Precision ∗ Recall

(Precision + Recall)

Recall value reflects the ability of the model to recognize positive samples. The higher the recall value is, the stronger the model’s ability to recognize positive samples. In contrast, precision value shows the performance of the model when recognizing negative samples. F1-score combines the above two metrics. The stable model usually has a higher F1-score. One confidential threshold is not enough when observing the performance of the trained model.

Validation Set

Class id Pricision Truth Positive(TP) False Positive(FP)

Car 90.41% 151 16

Cyclist 82.86% 29 6

Total (Car + Cyclist)

Confidential Threshold Recall Average IoU F1-score mAP

0.25 0.92 68.32% 0.88 0.845

(46)

4.1. 2D OBJECT DETECTION: YOLOV4 29

Figure 4.3: PR curve for cars in validation set

Figure 4.4: PR curve for cyclist in validation set

different confidential threshold in validation set. The model maintains high precision between the recall value of 0 to 0.8. Generally, it is a trade-off when balancing precision and recall value. When the recall value reaches 0.75, both classes still maintains high precision. With the recall value increases, model prediction precision value for cars and cyclists suddenly drops when the recall value reaches 0.9.

4.1.2

YOLOv4: Model evaluation

We evaluate our model with the best weights in the 100 images test set. PR curves for car and cyclist are shown in Figure4.5 and4.6. The model maintains a high precision value until the recall value increases to 0.8 for both categories. We also examine the performance of our model under different weather conditions in the Carla simulator. Therefore, we divided the original test set into three new test sets, which are mainly under sunny, cloudy, and rainy weather. The distribution of three datasets is shown in table4.2.

Test set

Weather Frames Cars Cyclists

Sunny 31 33 25

Cloudy 29 40 14

Rainy 40 51 18

Table 4.2: Data distribution of test set

(47)

30 CHAPTER4. RESULT ANDANALYSIS

average precision for cars and cyclists are 0.9297 and 0.8787 respectively which is higher than the average value in the whole test set. Our model can recognize more positive car samples than cyclists. Meanwhile, precision value stabilizes when the recall value is between 0 to 0.8, which illustrates that it is a stable and reliable model under sunny weather conditions.

Figure 4.5: PR curve for cars in test set Figure 4.6: PR curve for cyclist in test set

Figure 4.7: PR curve for cars in test set Sunny condition

Figure 4.8: PR curve for cyclist in test set Sunny condition

(48)

4.1. 2D OBJECT DETECTION: YOLOV4 31

sunlight and rain may affect the clarity of the image that we obtained from the car camera. An extreme case is shown in Figure4.13. Cyclists and cars are hard to be recognized under particular angle sunlight.

Figure 4.9: PR curve for cars in test set Cloudy condition

Figure 4.10: PR curve for cyclist in test set Cloudy condition

Figure 4.11: PR curve for cars in test set Rainy condition

(49)

32 CHAPTER4. RESULT ANDANALYSIS

Figure 4.13: Low clarity image example

(50)

4.2. SEMANTIC SEGMENTATION: ESPNETV2 33

Figure 4.14: YOLOv4: Inference time

Figure 4.15: GPU Power consumption with YOLOv4

Figure 4.16: GPU memory used with YOLOv4

4.2

Semantic segmentation: ESPnetv2

This section presents the result of the semantic segmentation module. In the first subsection, we analyze the logs recorded in the Tensorboard during the training procedure. In the second subsection, we evaluate the performance of the model from the view of accuracy and GPU efficiency.

4.2.1

ESPnetv2: Training procedure

(51)

34 CHAPTER4. RESULT ANDANALYSIS

Figure 4.17: Learning rate strategy in training process

As shown in Figure 4.18a, 4.18b, 4.18c, and 4.18d two strategies almost have the same performance regarding the mIoU and training loss in the first stage. Two metrics stabilize after 40 epochs. However, the training procedure with class weights seems to have large fluctuation before reaching epochs 40. That is because the class weight will change the range of loss, which may affect the stability of training. Comparing the value of mIoU in the training set and validation set, the neural network performs well when setting loss function without class weights in the first stage. Figure 4.19a,4.19b,4.19c and 4.19d present the training procedure in second stage. mIoU of network with setting class weights can reach 78.5%, which is larger than another strategy(64.5%). The model trained with setting class weights outperforms than another model without setting class weights when recognizing small objects such as cyclists. Figure4.20, 4.21 and 4.22 show the prediction result by using two different trained model when the input image contains a small object. Comparing the performance of ESPnetv2 in PASCAL VOC dataset and Cityscapes dataset, our model achieves high mIOU as our dataset contains simple objects. Also, ESPnetv2 has fewer network parameters and floating-point operations(FLOPs) during the training process as shown in table4.3. Moreover, ESPnetv2 has less inference time and lower energy consumption. More information about the inference time and GPU power consumption of the model will be discussed in the next subsection.

Training Stage Network Parameter(Million) FLOPs(Billion)

First Stage 0.779082 0.917

Second Stage 0.779082 1.833

(52)

4.2. SEMANTIC SEGMENTATION: ESPNETV2 35

(a) First stage training set without class weights in loss function

(b) First stage training set with class weights in loss function

(c) First stage validation set without class weights in loss function

(d) First stage validation set with class weights in loss function

(53)

36 CHAPTER4. RESULT ANDANALYSIS

(a) Second stage training set without class weights in loss function

(b) Second stage training set with class weights in loss function

(c) Second stage validation set with class weights in loss function

(d) Second stage validation set with class weights in loss function

Figure 4.19: Second stage

(54)

4.2. SEMANTIC SEGMENTATION: ESPNETV2 37

Figure 4.21: Trained model with class weights in loss function

Figure 4.22: Trained model without class weights in loss function

4.2.2

ESPnetv2: Model Evaluation

(55)

38 CHAPTER4. RESULT ANDANALYSIS

when detecting a large object like cars. CPA can reach 75.97%. Comparing the accuracy of recognizing cars and cyclists, the performance of ESPnetv2 is worse than YOLOv4.

Ground Truth & Prediction Background Car Road Cyclist Background 97.36% 0.75% 1.69% 0.2%

Car 1.78% 96.94% 0.91% 0.37%

Road 1.15% 0.66% 98.09% 0.1%

Cyclist 1.52% 0.04% 0.006% 98.434% Table 4.4: Confusion matrix for test set

Class Background Car Road Cyclist

PA 97.48%

CPA 99.68% 75.97% 92.66% 46.93%

MPA 78.81%

IoU 97.06% 74.19% 91.02% 46.58%

MIoU 77.21%

Table 4.5: Evaluation metrics

Figure 4.23: Average inference time

(56)

4.3. APPLICATION: MINIMAL DISTANCE AWARENESS 39

image contains extra time for loading model configuration. The average inference time can reach 0.043s/per image. The testing scrips lasts for 13.43s, which means our program has 7 FPS. There is a larger delay in the program which causes longer inference time. Total Memory used and power consumption is shown in Figure 4.24. We started the script for monitoring memory used and power consumption at the same time. As mentioned before, the total test script running time is 13.43s, which corresponds to the period between around 7s and 21s. After starting ESPnetv2, memory used and power consumption all reach the high value at the 8s, which corresponds to the stage that loading network configuration. However, the GPU power consumption and the total used GPU memory drop to a low level after reaching the highest stage. This phenomenon illustrates that GPU resources are not fully occupied during the training process. Overall, average power consumption increases from 7.73 W per 100 ms to 37 W per 100ms, the total memory used increases from 597 MB to 1543 MB. Comparing ESPnetv2 and YOLOv4, ESPnetv2 consumes less power and memory.

(a) Power consumption with ESPnetv2 (b) GPU Memory used with ESPnetv2

Figure 4.24: GPU efficiency measurement

4.3

Application: Minimal distance awareness

(57)

40 CHAPTER4. RESULT ANDANALYSIS

Figure 4.25: Server agent

(58)

Chapter 5

Conclusion

This project trained and implemented two important functions including object detection and semantic segmentation in Carla simulator for autonomous vehicles environment perception task. Apart from training and evaluating two neural networks, we also build an application that can automatically calculate the minimal distance to the vehicles ahead. Object detection module in charge of capturing objects such as cars and cyclists in the scene. The semantic segmentation module helps autonomous vehicles to understand the world. All modules are successfully implemented and work together. YOLOv4 contains plenty of tricks that are useful in training a neural network. It achieves high accuracy in our project. mIoU for cars and cyclists can reach 90%. ESPnetv2 is introduced for light-weight and power efficiency. It also performs well for recognizing drivable space and cars. Our application also has a smooth operation under 10 FPS in the server agent. However, 10 FPS is not enough for real-time applications. That is one of the limitations of Carla simulator. Besides, our project only do some experiments in the environment perception task of autonomous cars. Other tasks like vehicle localization, motion plan are not involved. This can be addressed as future work.

5.1

Limitations

This section will discuss the limitations of the Carla simulator, YOLOv4, ESPnetv2, and minimal distance awareness application.

5.1.1

Carla simulator

The version of the Carla simulator in our project is 0.8.4, which is not the latest version. It only contains two default maps. Our experiment is tested under the

(59)

42 CHAPTER 5. CONCLUSION

larger map called Town01. Thus we can not guarantee that our model performs well in the newest release of the Carla simulator. Besides, the server agent can not work under high FPS when we add cameras to the cars inside the simulator. This is a known issue that FPS drops while the scene is captured by using UE4.

5.1.2

YOLOv4

Dataset is made by myself. Thus it may exist some personal error when labeling the images, which influences the accuracy of the model. Also, the total number of the training set and validation set is 400, which can be larger to stabilize and improve the performance of our model under different weather conditions. Limited to the computing power of our computer, we set the batch size to 1 and input image size to 416x416. Batch size and input image size may affect the precision of the model. Although our trained model already achieves high accuracy, it may not be the best performance.

5.1.3

ESPnetv2

Similar issues as YOLOv4 when considering the personal error in the labeling process and the total number of the training set. The performance of ESPnetv2 for recognizing cars and cyclists is not ideal, which may be caused by the limited number of training set especially cyclists samples. Increasing the number of training set may improve performance in detecting small objects.

5.1.4

Minimal distance awareness application

The average inference time of ESPnetv2 is longer than YOLOv4. Considering the program execution time, the server agent can not start with a high FPS. Even though the server agent starts with 10 FPS, we still set a time delay waiting for the prediction result of YOLOv4 and ESPnetv2 in order to make sure the program runs smoothly. Besides, the whole application produces a larger amount of I/O operations, which may stuck the program in a high FPS situation.

5.2

Ethical Considerations

(60)

5.3. SUSTAINABILITY 43

5.3

Sustainability

In our project, we collected data from the simulator and test our application in a simulation environment. The whole experiment can be done by using a computer, which can reduce energy consumption and carbon dioxide emissions.

5.4

Future work

Our project can be improved in different ways. As addressed in limitation, we will discuss future work form aspect of Carla simulator, YOLOv4, ESPnetv2, and minimal distance awareness application.

5.4.1

Carla simulator

We can implement our models to the latest Carla simulator(0.9.10), which contains 7 maps for more scenarios. Meanwhile, the newest release contains a more configurable weather API, which can be used to test the stability of the model in more situations.

5.4.2

YOLOv4 and ESPnetv2

Regarding the dataset, 500 images set are not enough for training the model. More training samples need to be added to deal with the problem of data unbalance especially between large objects and small objects. In the training procedure, we can deploy high-performance computers to train models by using the different batch sizes and image sizes to obtain the best model.

5.4.3

Minimal distance awareness application

(61)
(62)

Bibliography

[1] DonkeyJason, “Introduction of autonomous driving,” https://www.jianshu. com/p/260cf80025e5, 2018.

[2] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in 2008 IEEE conference on computer vision and pattern recognition. IEEE, 2008, pp. 1–8.

[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.

[4] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.

[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.

[6] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020. [7] D. Misra, “Mish: A Self Regularized Non-Monotonic Neural Activation

Function,”https://www.codenong.com/cs105949629/, 2020.

[8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” CoRR, vol. abs/1411.4038, 2014. [Online]. Available:http://arxiv.org/abs/1411.4038

[9] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.

(63)

46 BIBLIOGRAPHY

[10] S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 9190–9200.

[11] opencv, “OPENCV: Camera Calibration and 3D Reconstruction,” https: //docs.opencv.org/master/d9/d0c/group calib3d.html, 2020.

[12] TheLocalnews, “Deadly road accidents in sweden drop to record low,” 2020. [Online]. Available: https://www.thelocal.se/20200108/ deadly-road-accidents-in-sweden-drop-to-record-low

[13] EuropeanCommission, “2019 road safety statistics,” 2020. [Online]. Available: https://ec.europa.eu/commission/presscorner/detail/en/QANDA 20 1004

[14] EuropeanParliamentNews, “Safer roads: new eu measures to reduce car accidents,” 2020. [Online]. Available: https: //www.europarl.europa.eu/news/en/headlines/society/20190307STO30715/ safer-roads-new-eu-measures-to-reduce-car-accidents

[15] EUParliamentNews, “Self-driving cars in the eu: from science fiction to reality,” 2020. [Online]. Available: https://www. europarl.europa.eu/news/en/headlines/economy/20190110STO23102/ self-driving-cars-in-the-eu-from-science-fiction-to-reality

[16] pjreddie, “Yolo: Real-time object detection,” 2015. [Online]. Available: https://pjreddie.com/darknet/yolo/

[17] A. C. Serban, E. Poll, and J. Visser, “A standard driven software architecture for fully autonomous vehicles,” in 2018 IEEE International Conference on Software Architecture Companion (ICSA-C). IEEE, 2018, pp. 120–127. [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with

deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[19] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.

(64)

BIBLIOGRAPHY 47

[21] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 886–893. [22] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix:

Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6023–6032.

[23] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Dropblock: A regularization method for convolutional networks,” in Advances in Neural Information Processing Systems, 2018, pp. 10 727–10 737.

[24] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.

[25] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.

[26] C.-Y. Wang, H.-Y. Mark Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, “Cspnet: A new backbone that can enhance learning capability of cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 390–391.

[27] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8759–8768.

[28] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.

[29] S. Targ, D. Almeida, and K. Lyman, “Resnet in resnet: Generalizing residual architectures,” arXiv preprint arXiv:1603.08029, 2016.

[30] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: A survey,” arXiv preprint arXiv:2001.05566, 2020.

(65)

48 BIBLIOGRAPHY

[32] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” arXiv preprint arXiv:1711.03938, 2017. [33] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and

physical simulation for autonomous vehicles,” CoRR, vol. abs/1705.05065, 2017. [Online]. Available: http://arxiv.org/abs/1705.05065

[34] M. Virgo, “Udacity’s self-driving car simulator,” 2020. [Online]. Available: https://github.com/udacity/self-driving-car-sim

[35] AnacondaDeveloper, “Your data science toolkit: Anaconda,” 2020. [Online]. Available: https://www.anaconda.com/

[36] CarlaDeveloper, “Carla sensors reference,” 2020. [Online]. Available: https://carla.readthedocs.io/en/latest/ref sensors/#depth-camera

[37] AlexeyAB, “Yolo v4, v3 and v2 for windows and linux,” 2020. [Online]. Available: https://github.com/AlexeyAB/darknet

[38] s. Brekke, F. Vatsendvik, and F. Lindseth, “Multimodal 3d object detection from simulated pretraining,” 2018. [Online]. Available: https: //github.com/Ozzyz/carla-data-export

[39] Tzutalin, “Labelimg. git code (2015),” 2015. [Online]. Available: https://github.com/tzutalin/labelImg

[40] S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “Proceedings of the ieee conference on computer vision and pattern recognition,” 2019. [Online]. Available: https://github.com/sacmehta/EdgeNets

[41] K. Wada, “labelme: Image Polygonal Annotation with Python,” https: //github.com/wkentaro/labelme, 2016.

[42] C. J. W. A. Z. Mark Everingham, Luc van Gool, “The PASCAL Visual Object Classes Homepage,”http://host.robots.ox.ac.uk/pascal/VOC/, 2005. [43] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm

(66)

TRITA-EECS-EX-2021:40

References

Related documents

En av dessa världar är World of Warcraft (WoW). Själv har jag sedan 5 år besökt denna datorgenererade digitala värld och upplevt den tillsammans med vänner och familj. Det har

H¨ agglund, Studies on Design Automation of Analog Circuits — Performance Metrics, Link¨ oping Studies in Science and Technology, Thesis No. Hjalmarson, Studies on Design Automation

För att upptäcka barns olikheter måste det finnas en medvetenhet om att alla barn är olika och har olika behov, vilket i sin tur gör att pedagoger måste kunna möta barn

Något som inte har analyserats är vilka kombinationer av de tre första slagen i en duell för servaren som genererar extra många vunna poäng för olika spelare och vilka

beslutsprocess i sig och därför kunna placeras in i steg tre av analysmodellen. Studien har dock ingen ambition att analysera de politiska strategier aktörer använt sig av. Det som

Denna text redogör hur urvalet ur databaserna utfördes, det vill säga hur undersökningen gick till väga för att söka och välja ut de vetenskapliga artiklar som använts i

Medan Reichenberg lyfter fram vikten av att väcka elevernas läslust, vilket är viktigt att tänka på när du som lärare väljer ut texter och böcker Reichenberg (2014, s. 15)

Dotterbolaget Road & Logistics erbjuder kombinerade transporter bestående av järnväg och lastbil. Järnvägen utnyttjas för de långväga transporterna och lastbilarna