A Combined Approach for Object Recognition and Localisation for an Autonomous Racecar

(1)

A Combination of Object

Recognition and Localisation for an Autonomous Racecar

JONATHAN CRESSELL ISAC TÖRNBERG

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT

(2)

(3)

Master of Science Thesis TRITA-ITM-EX 2018:193

A Combined Approach for Object Recognition and Localsation for an Autonomous Racecar

Jonathan Cressell Isac Törnberg

Approved

2018-06-19

Examiner

De-Jiu Chen

Supervisor

Lars Svensson

Commissioner

ÅF AB

Contact person

Tor Ericson

Abstract

With autonomous vehicles being a hot topic for research it has also become an interest in the world of motor sport. To be able to run a vehicle autonomously it needs to know what the current pose of the vehicle is and what the environment looks like. This thesis aims to solve this problem using SLAM and object detection with 2D LiDAR and camera as sensor input, looking at the performance in terms of accuracy and latency.

The object detection problem was repurposed as an object recognition problem by utilising the 2D LiDAR for cone candidate extraction which was projected onto the camera image and verified by a Convolutional Neural Network (CNN). Two different CNN architecture were used, MobileNet and a minimalistic architecture with less than 7 layers. The best performing CNN with four convolutional layers and two fully connected layers reached a total of 87.3% accuracy with a classification time of 4.6ms on the demonstrator constructed.

Three different SLAM algorithms were implemented, Pose Graph Optimization, Rao- Blackwellized Particle Filter and Extended Kalman Filter (EKF). When tested on the demonstrator the EKF solution showed the best results with a mere 20mm average error in vehicle position and 39mm average error in cone position. Further, the end-to-end timing of the EKF algorithm was the fastest at an average of 32ms.

The two best performing algorithms were combined for an evaluation, with the output of the CNN as input to the EKF. The performance was measured to an average error of 19mm for the position and 51mm for the cones. Further, the latency was only increased by the 4.6ms that the CNN required for classification, to a total of 36.54ms.

(4)

(5)

En kombination av Objektigenkänning och Lokalisering i en Autonom Racingbil

Jonathan Cressell Isac Törnberg

Godkänt

2018-06-19

Examinator

De-Jiu Chen

Handledare

Lars Svensson

Uppdragsgivare

ÅF AB

Kontaktperson

Tor Ericson

Sammanfattning

Autonoma fordon är ett aktuellt ämne för research och har på senare tid även tagit sig in i motorsporten. För att åstadkomma autonom körning med ett fordon behöver det veta den nuvarande positionen och hur omgivningen ser ut. Detta examensarbete ämnar att lösa det problemet med SLAM och objektdetektion med hjälp av en 2D Lidar och kamera som sensorer, och utvärdera utifrån precision och latens.

Objektdetektionsproblemet omformulerades som ett objektigenkänningsproblem med hjälp av 2D LiDAR sensorn för extrahering av kon-kandidater som sedan projicerades på kamerabilden och verifierades av ett Convolutional Neural Network (CNN). Två olika CNN arkitekturer användes, MobileNet och en minimalistisk arkitektur med färre än 7 lager. Det bäst presterande CNN:et med fyra faltningslager och två fullt kopplade lager nådde en total på 87.3% precision med en klassificeringstid på 4.6ms på demonstratören.

Tre olika SLAM algoritmer implementerades, Pose Graph Optimisation, Rao-Blackwellized Particle Filter, och Extended Kalman Filter (EKF). När de testades på demonstratorn gav EKF det bästa resultatet med endast 20mm genomsnittligt fel för fordonets position och 39mm för konernas position. Den totala tiden för samma algoritm var även den kortaste på ett genomsnitt av 32ms.

De två bäst presterande algoritmerna kombinerades för utvärdering, med utmatningen från CNN som inmatning till EKF. Prestandan mättes till ett genomsnittligt fel på 19mm för fordonets position och 51mm för konorna. Vidare så ökades latensen med endast den tid det tog för CNN att klassificera konorna, det vill säga 4.6ms, till en total av 36.54ms,

(6)

(7)

CN N Convolutional Neural Network DN N Deep Neural Network

EKF Extended Kalman Filter GP S Global Positioning System IC Individual Compatibility

ICN N Individual Compatibility Nearest Neighbour ICP Iterative Closest Point

IM U Inertial Measurement Unit

J CBB Joint Compatibility Branch and Bound LiDAR Light Detection and Ranging

mAP Mean Average Precision

M RP T Mobile Robot Programming Toolkit N N Nearest Neighbour

P DF Probability Density Function P GO Pose Graph Optimisation

RBP F Rao-Blackwellized Particle Filter RM SE Root Mean Square Error

ROS Robot Operating System

v

(8)

SEIF Sparse Extended Information Filter SIR Sampling Importance Re-sampling

SLAM Simultaneous Localisation and Mapping SOT A State of the Art

SSD Single Shot MultiBox Detector U KF Unscented Kalman Filter

vESC Vedder Electronic Speed Controller

(9)

1 Introduction 1

1.1 Background . . . 1

1.2 Problem Description . . . 3

1.3 Scope . . . 4

1.4 Research Question . . . 4

1.5 Methodology . . . 5

1.6 Safety and Risks . . . 5

1.7 Ethics . . . 6

1.8 Sustainability . . . 7

1.9 Division of Work . . . 7

1.10 Report Outline . . . 8

2 Simultaneous Localisation and Mapping 9 2.1 Introducing the SLAM problem . . . 9

2.1.1 The SLAM Problem Formulation . . . 11

2.2 Solving the SLAM problem . . . 12

2.2.1 Filtering Techniques . . . 12

2.2.2 Optimisation Techniques . . . 14

2.2.3 Peripherals . . . 16

2.2.4 Data Association . . . 18

2.3 State of the Art . . . 20

2.3.1 Pose Graph Optimisation . . . 20

2.3.2 Rao-Blackwellized Particle Filter . . . 20

2.3.3 Extended Kalman Filter . . . 21

2.4 Related Work . . . 22

2.4.1 A Platform for Indoor Localisation, Mapping, and Data Collection using an Autonomous Vehicle . . 22

2.4.2 MIT and Hypha Racecar . . . 22

vii

(10)

2.4.3 A Comparison of Data Association Techniques

for Simultaneous Localization and Mapping . . . 23

2.5 Implementation . . . 23

2.5.1 Pose Graph Optimisation . . . 23

2.5.2 Rao-Blackwellized Particle Filter . . . 24

2.5.3 Extended Kalman Filter . . . 25

3 Object Recognition 27 3.1 Computer Vision and Object Recognition Overview . . . 27

3.1.1 Deep Learning Era . . . 27

3.1.2 Convolutional Neural Networks . . . 28

3.2 State of the Art . . . 30

3.2.1 R-CNN and Fast R-CNN . . . 30

3.2.2 YOLO: You only look once . . . 31

3.2.3 SSD: Single Shot MultiBox Detector . . . 31

3.2.4 MobileNet . . . 32

3.3 Related Work . . . 33

3.3.1 Real-time Traffic Cone Detection for Autonomous Vehicle . . . 33

3.3.2 Cone Detection: Using a Combination of LiDAR and Vision-Based Machine Learning . . . 34

3.4 Implementation . . . 34

3.4.1 Cone Candidate Extraction . . . 34

3.4.2 Transfer Learning . . . 34

3.4.3 CNN . . . 35

3.4.4 Training Data . . . 36

4 Demonstrator 39 4.1 Demonstrator . . . 39

4.2 Software Architecture . . . 41

4.2.1 Robot Operating System . . . 42

5 Results 45 5.1 SLAM . . . 45

5.1.1 Accuracy . . . 45

5.1.2 Latency . . . 49

5.2 Object Recognition . . . 49

5.3 Combined Performance . . . 54

(11)

6 Discussion and Conclusions 57 6.1 SLAM . . . 57 6.2 Object Recognition . . . 59 6.3 Combined Performance . . . 61

7 Future work 63

7.1 SLAM . . . 63 7.2 Object Recognition . . . 63 7.3 Combined Performance . . . 64

A Confusion matrix 65

Bibliography 67

(12)

(13)

Introduction

This chapter introduces the problem, scope of the project, research questions and surrounding circumstances.

1.1 Background

Autonomous vehicles (AV) are currently a hot topic and can to no sur- prise be found high up in the Gartner hype curve, seen in Figure 1.1, and are currently on its way to the adoption stage. Companies in infrastructure, safety, automotive and service as they are all working towards autonomous vehicles [1]. The market of autonomous vehicles is forecast to grow at a Compound annual growth rate of 36.9% between 2017 and 2027 eventually ending up as a $126.8 billion dollar market according to [2]. These advancements and expectations have placed a heavy burden on various research fields such as sensors, percep- tion and machine learning. One major area being researched is called Simultaneous Localisation and Mapping (SLAM) which addresses a problem that is considered fundamental in robotics. The SLAM solution is a key component for creating a moving autonomous system in an unknown environment and has been solved in a number of different ways [3].

1

(14)

Figure 1.1: Gartner hype curve 2017 [4].

The topic of autonomous vehicles has spread even to motor sport.

The Formula Student racing series is a yearly set of events where student teams from around the globe participate in a competition using student built race cars. Formula Student Germany introduced a Driverless Vehicle class in 2017, where the participants vehicle is sup- posed to drive through a circuit outlined by cones, illustrated in Figure 1.2. In 2018 KTH Formula Student [5] aim to be the first Swedish team to present a functional solution in this competition. In cooperation with ÅF AB who wish to improve their competence in the AV area, the SLAM and Object Recognition problems are addressed in this master thesis.

(15)

Figure 1.2: Formula Student Autonomous Drive [6].

1.2 Problem Description

The Formula Student competition comes with a set of rules and limitations. Firstly, the event will be held outdoors on a tarmac surface and the circuit will be outlined with small traffic cones distributed at a max distance of 5m from each other, with yellow cones for the in- ner border and blue for the outer border of the circuit. Larger orange cones are placed at start, finish and time keeping lines. A description of the track can be seen in Figure 1.3. The borders of the circuit are painted, and there may be further markings that are not part of the circuit. Participants are not provided with any further map data prior to the event and no modification of the circuit is allowed, nor is ad- ditional placement of landmarks. Other than the above, there are no limitations regarding solutions of the problem. The goal is to com- plete ten laps around the circuit in the shortest time possible [7], fully autonomously. This can be done by running an object avoidance algorithm the first lap while mapping the track and in the following

(16)

laps use the obtained map and current position in a trajectory planner to obtain the optimal racing line. The part of mapping the track and providing accurate real-time positioning using SLAM and object recognition was set as the goal of this master thesis.

Start Position Yellow/Blue Cone

Small/Big Orange Cone Red TK Marking & TK Equipment (Shape undefined)

Start / Finish Line 10 Laps

6 m

3 m min.

Stop Area (after 10 laps)

5 m max.

High Contrast Track Limit Line

Figure 1.3: Formula Student Germany track drive for driverless [7].

1.3 Scope

The objects to be recognised were traffic cones in an unknown indoor environment. Two cones at the start/finish line were larger and cones on the left or right hand side of the track had different colours. The model for motion control was provided on beforehand and was not be a subject of this thesis. A scale demonstrator was constructed to test the system and obtain results. Implementation and analysis were limited to the autonomous racecar scenario.

1.4 Research Question

The SLAM problem has seen a great number of different solutions over the years. With different characteristics to each solution, there is no one-fits-all algorithm available. In a strictly defined scenario as the

(17)

one described in Section 1.2, it is interesting to know what aspects will affect the choice of SLAM algorithm and how the different algorithms will perform under the given circumstances. These questions are sum- marised as the first research question below.

• How does the choice of SLAM algorithm impact latency and accuracy for mapping and localisation in an autonomous racecar?

The SLAM solution relies on the sensors ability to accurately identify relevant landmarks in the vicinity of the robot. While there are several ways of achieving this, the great majority of solutions utilise either camera or laser scans, which in turn are well suited to identify the cones used as landmarks in this scenario. A leading method in object recognition using camera feed are the Convolutional Neural Networks, an extension to the traditional Neural Networks, which are well suited for identifying cones in a high speed racing environment.

This leads to the research question below.

• How does the choice of Convolutional Neural Network architecture for cone recognition with a small training set impact accuracy and latency in an autonomous racecar?

1.5 Methodology

This thesis begins with a qualitative study of related literature to obtain required knowledge on the subject and identify state of the art approaches to solving the problems in question. After deciding on which approaches to pursue, implementation is attempted experimen- tally starting from solutions available with open source licenses. These open source solutions are then modified to suit the scope of this thesis. A demonstrator is built in parallel to use as a test platform for the implementations. The results are gathered in a quantitative manner using the demonstrator on a pre-determined test track. The best performing algorithms are then combined to provide an understanding of how the object recognition and SLAM algorithms perform together.

1.6 Safety and Risks

The risks involved with the solutions presented in this work are related to those of the implementation, since the SLAM and Object Recog-

(18)

nition algorithms alone do not output anything that can cause harm to the environment. However, these algorithms are commonly combined with techniques that allow the robot or vehicle to move autonomously.

In such cases, it would be necessary to verify the integrity of the SLAM output map upon which for example a trajectory planner bases it output. Further, if an exploration algorithm uses the object recognition output to determine the existence of obstacles, this would also require an extra layer of safety. Such improvements could be redundant algorithms or sensors which would prevent the robot or vehicle from inflicting harm on the environment. Further, depending on the implementation, standards such as ISO 26262 [8] would require investi- gation. This subject has been further investigated in a neighbouring master thesis between ÅF AB and KTH Formula Student [9].

1.7 Ethics

A number of ethical questions arise with the application of SLAM algorithms and object recognition in AVs. Many of these dilemmas are often based on variants of the well known trolley problem [10]. If an accident cannot be avoided, how is the vehicle to choose whether it should hit a wall and sacrifice those in the vehicle, or hit a group of people with unknown consequences [11, 12]? According to six surveys performed on residents of the USA, there is a clear ethical dilemma with AVs and dangerous traffic situations ite[12]. One of the surveys showed that while it is seen as morally correct to program an AV to sacrifice the passengers for the sake of saving a greater number of pedestrians, the respondents themselves would rather not buy such a vehicle.

In the case of accidents there will also be an issue regarding liabil- ity. Should the owner, the vehicle producer or even the government who permitted the vehicles use in traffic be held responsible for the outcome [13]? This also has to be decided before AVs can become reality on common roads.

An object recognition system can become an ethical problem in its application, for example a facial recognition software can act racist [14]. This can happen even though the intention is not to create a biased or racist system but can be caused by naive assumptions or by a biased data set. A well performing system can still cause ethical issues

(19)

since there is a possibility that it might violate peoples personal life by gathering data about them if they are recognised and tracked by a system. The same applies here, even though the intention of the system is not to track a person the data is still there and can be in worst case exploited by another system or entity.

1.8 Sustainability

The algorithm itself has very little impact on sustainability without an application, as such autonomous vehicles are again discussed. AVs can be beneficial in the attempt to reach global goals such as EU 2020 [15]

and Transport 2050 [16]. From a general perspective, AVs will have a large impact on the social, environmental and economic sustainability [17]. In social aspects, AVs are likely to reduce traffic related deaths and reduce the number of jobs in the transport sector. Environmen- tally, AVs can improve the energy efficiency by for example car sharing where the increased utilisation of each vehicle would reduce the total required number of vehicles in an area [18]. From an economic perspective, automotive manufacturers risk lower sales and their cus- tomers may change from end consumers to car sharing companies or even governing bodies that aim to improve infrastructure. Further, on a private economy level, transport related expenses may generally de- crease. However, autonomous vehicles do not only introduce issues but will eventually reduce stress, improve productivity as well as reduce crash rates and insurance costs. They also enable shared vehicle possibilities and provide independent mobility for non-drivers such as adolescents and people with disabilities [19].

1.9 Division of Work

The work was divided between Isac and Jonathan where Isac was responsible for object recognition and Jonathan for the SLAM solution.

Remaining work was performed jointly.

(20)

1.10 Report Outline

The report is divided into 7 chapters. The first chapter introduces the problem as well as the scope and project outline. The second and third chapter focuses on the SLAM and object recognition problem with a brief background, state of the art solutions and algorithm implementation in each chapter respectively. The following chapter explains the hardware demonstrator implementation as well as the software in the system overview. Chapter 5 presents the results from the demonstrator and Chapter 6 discusses the results as well as the thesis as a whole, and draws a number of conclusions with answers to the research questions. Finally the future work in Chapter 7 describes what could be improved, spin-off ideas as well suggestions on continued work to do.

(21)

Simultaneous Localisation and Mapping

This chapter introduces the subject of Simultaneous Localisation and Map- ping from a theoretical perspective, presents state of the art solutions and a brief description of the implementations.

2.1 Introducing the SLAM problem

A fundamental problem in the field of autonomous vehicles and robotics in general is the subject of localisation; to allow a robot or vehicle to identify its own position and orientation, further called pose, using onboard sensors. A common solution to this problem is to use a Global Positioning System (GPS), but for applications where GPS is sub-optimal, unreliable or unavailable, alternative solutions are required [20, 21]. Odometry, inertial measurements and similar solutions that could be suggested suffer from accumulative errors and even the smallest drift will cause the pose to deviate greatly over time, thus other methods are necessary [22].

The localisation problem can be solved by using sensors to scan the environment in which the robot or vehicle is operating. By identifying range and bearing towards suitable landmarks in the vicinity, the robots position can be acquired. However, this requires a map to which the position can be projected. If this map is determined on beforehand and available to the robot this problem is solved, however when this is not the case a map of the environment has to be created

9

(22)

during operation. A chicken and egg problem occurs where the robot location is required to create the map and the map is required to identify the location. This has come to be known as the Simultaneous Lo- calization and Mapping (SLAM) problem. A visual description of the problem can be seen in Figure 2.1 where the white poses and yellow landmarks are ground truth, and the grey marks present the SLAM output.

Figure 2.1: The SLAM problem [23].

Solving the SLAM problem comes down to a matter of fusing data from multiple sources. A simple system would include a Light De- tection and Ranging (LiDAR) device which outputs laser range scans together with wheel odometry on a robot. The range finder would output observed features in the environment and the odometry would output the movement of the robot. A combination of these allows for creating a map of where the observations are located along with the trajectory traversed by the robot.

(23)

There are several solutions to the SLAM problem with one aspect in common; the solutions use probabilistic estimates. By predicting the robots next pose using control input and previous pose data, then cor- recting this by using the predicted pose along with observation data, it is possible to simultaneously create a map of the environment and identify the robot’s location.

2.1.1 The SLAM Problem Formulation

As mentioned above, the SLAM problem is solved using probabilistic estimates, with the input commonly being measurements and control signals and the output being the estimated map and robot pose. Here follows a mathematical formulation of the problem. The system sets of control inputs u1:tand observations z1:t are considered known, where:

u_1:t = {u₁, u₂, ..., u_t} z_1:t = {z₁, z₂, ..., z_t},

followed by a set of past poses x0:tand the actual environment map mthat are considered unknown, where

x_0:t = {x₀, x₁, x₂, ..., x_t}.

The control inputs and observations are used to estimate the map and robot pose by the following probabilistic approach using Bayes’

rule:

p(x_0:t, m|z_1:t, u_1:t).

The above equation is known as the full SLAM problem and has commonly not been possible to solve in real time due to the computational complexity with respect to the number of variables. By disregarding the previous poses of the robot in the problem formulation and attempting to approximate for only the current pose based on the latest sensor data, Online SLAM is defined in accordance with equation 2.1.

p(x_t, m|z_1:t, u_1:t) (2.1)

(24)

Even simplified, this proves to be a difficult problem to solve given that the environment and robot pose are both unknown, since the estimates for both map and pose correlate. This results in initial error propagation, but as the number of identified landmarks increase and previously identified landmarks are identified again, the error can be greatly reduced. This in itself is however a difficult issue called data association which will be investigated further at a later stage in this report. Successful data association can result in loop closure when returning to a previously traversed area, adjusting the map to better mirror reality.

2.2 Solving the SLAM problem

A great number of methods have been conceived to solve SLAM problem since it was formulated. These solutions are generally either filter or optimisation based and thus inherit characteristics from the method used.

2.2.1 Filtering Techniques

Known as Online SLAM techniques, filter based solutions incrementally update only the current pose of the robot together with the map, as discussed in Section 2.1. There are a number of different filters commonly used, the most common ones are discussed below.

Extended Kalman Filter

Branched from the Kalman Filter [24] which is limited to linear cases, the Extended Kalman Filter (EKF) was developed to allow computation of non-linear systems by means of linearisation using a first-order Taylor expansion [25]. The EKF has been shown to be well-performing as long as the linearisation is made around the true value of the state vector [26], which is in practice not entirely possible since this is the value to be estimated. If the actual true value is not included in the estimate, the uncertainty will not converge [27]. However, since the estimates are mostly close enough to the true value, EKF can be applied, as long as the maps do not grow too large [23]. The EKF function will calculate the jacobian of the state vector in each update resulting in a time complexity to the square of the state vector size. A solution to this

(25)

was presented by [28] where submaps of the recently updated area are used instead of the entire map, and a correlation between submaps are maintained but not processed every update.

Unscented Kalman Filter

In order to compensate for the limitations of the EKF linearisation method a new version of the Kalman Filter was introduced by [29]

called the Unscented Kalman Filter (UKF). By selecting a number of sigma points in the vicinity of the expected value and passing these through a non-linear function to then calculate an estimate, the lin- earised estimate will be greatly improved for non-linear functions. This however comes with greatly increased complexity, making implementation more difficult.

Information Filter

Another filter technique that can be utilised is the Sparse Extended In- formation Filter (SEIF) which is an information form of EKF [30] that exploits the computational properties of sparse matrices. The implica- tions of using SEIF are essentially getting a faster computations on the cost of enforcing a sparse information filter and therefore disregarding some of the correlations between landmarks [31], reducing the accuracy.

Particle Filtering

Rao-Blackwellized particle filters (RBPF) was utilised in a solution presented by [32] which, unlike the Gaussian distribution that Kalman filters use, apply a probability distribution with a finite set of samples or hypotheses, also called particles. A high density of particles in an area corresponds to a high probability and a low density corresponds to a low probability. This can be used to represent any multi-modal distribution, if enough samples are given, which makes particle filters suitable for a wide range of estimation problems. The capacity to track multi-modal beliefs and include non-linear motion and measurement models makes the performance of particle filters particularly robust [33]. However, this comes at the cost of high computational requirements, making implementation more difficult.

(26)

A commonly used factorisation of the joint posterior presented in Sec- tion 2.1, known as Rao-Blackwellization, allows computation of the robot pose, or trajectory, given only the observation and odometry measurements. This can in turn be used to compute the map. The posterior is estimated using a particle filter, where each particle repre- sents a potential trajectory of the robot with an individual map associated to each particle [32]. Using the estimated posterior, or pose, the technique "mapping with known poses" can be applied to analytically calculate the map [34].

Since each particle will not maintain a correct trajectory and map, hence the need for multiple particles, adjustments must be made in order to refresh the set of particles and prevent the uncertainty of the algorithm from diverging. For this purpose, a Sampling Importance Re- sampling (SIR) filter is commonly applied [35]. When used together with a Rao-Blackwellized particle filter, the map is incrementally updated when measurements and observations are made available, by updating the set of particles representing the vehicle pose and surrounding map. This has been explained as a four step process con- sisting of sampling, importance weighting, re-sampling and map estimation. The new set of particles is obtained from the existing set by sampling from a proposal distribution [35, 34]. The new particles are then assigned weights, as the proposal distribution is in general not accurate. The number of particles used in an application is often static, as such the particles with the lowest importance weight are discarded to leave room for the more accurate particles.

2.2.2 Optimisation Techniques

In opposition to the simplified Online SLAM approach, optimisation techniques attempt to solve the Full SLAM problem where the entire trajectory of the robot is updated, rather than only the current pose.

The Full SLAM approach can thus be advantageous depending on the purpose of the SLAM application. A common solution using optimisation, or smoothing, is the pose graph optimisation technique.

Pose Graph Optimisation

In Pose Graph Optimisation, the SLAM problem can be seen as a graph with nodes representing the pose of the robot in time as landmarks,

(27)

and edges between nodes as constraints between poses as seen in Fig- ure 2.2. The constraints are built up by observations and control signals to the robot.

Figure 2.2: A PGO visualisation. The dotted lines represent constraints, stars are landmarks and circles are poses.

Since the raw measurements from sensors include some kind of noise the graph has to be optimised against this by minimising the error. This means that the configuration of robot poses are matched to best fit the constraints. The graph based SLAM can therefore be seen as two separate parts, the graph construction, also called front- end, and the graph optimisation, also called back-end [36]. The constraints between poses can be updated when revisiting a previously traversed area by loop closing, reducing the uncertainty of the graph.

Since all poses are linked together with constraints, this update would be very costly. However, by introducing a hierarchical structure between poses where closely linked poses are grouped together in several stages, the computational cost of each update can be reduced enough to realise good enough performance for an online application [37]. A

(28)

visualisation of the constraint reduction can be seen in Figure 2.3.

Figure 2.3: The reduction of constraints in hierarchical PGO [38].

2.2.3 Peripherals

A number of commonly occurring peripherals are used along with most SLAM implementations. For example, there are a number of standardised mapping styles, where grid maps and feature maps are the most common in the two-dimensional case.

Map Styles

SLAM algorithms output maps in different styles depending on what is preferred by the user and the suitability to the algorithm. Two of the major methods are Occupancy Grid Maps, often called only Grid Maps, and Feature Maps. The former is a matrix where cells are filled when observed by the robot. A cell which maps to a position in reality that does not contain an object is labelled as zero, whereas a cell which maps to an object will be labelled with a one. Doing this results in a map of zeros and ones, often displayed in white and black, as seen in Figure 2.4.

(29)

Figure 2.4: An occupancy grid map.

Grid maps are often implemented with algorithms that rely heavily on LiDAR and scan matching methods, as the laser scans can almost directly be inserted into the map.

Algorithms that rely on features in the environment often apply feature map solutions. These maps are less intuitive, containing the po- sitions and uncertainties of the landmarks or features that have been detected in the environment. An example can be seen in Figure 2.5.

Figure 2.5: A feature map.

(30)

Features are saved as a middle point with an uncertainty, visually displayed as an ellipse and often magnified for improved user experi- ence.

Scan Matching

Scan matching techniques follow the development of range finders and are especially effective with 360^◦ solutions, as the larger scanning area gives a higher amount of points to use in matching. Iterative Clos- est Point (ICP) is a commonly used algorithm for scan matching methods. It aims to find the transformation between a laser scan and a grid map section by minimising the square errors between the two. It is iterative in the aspect that it will recalculate the error when it decreases, as such a decent initial guess is required for it to not be stuck in the first local minimum [39]. In practice, this means that a new scan cannot be too far from a previous scan, unless there is other information that will allow the scan matcher to identify the initial guess, such as reliable odometry.

2.2.4 Data Association

With a purpose similar to that of scan matching, data association is commonly implemented with algorithms using feature maps. Upon returning to an already traversed location and identifying previously mapped landmarks, uncertainties regarding vehicle pose and mapped landmarks can be reduced by associating new observations with already measured landmarks. However, with the above uncertainties and sensor uncertainties it is not trivial to identify which mapped landmark corresponds to the measured landmark at a given time, shown in Figure 2.6. If the uncertainties are too large there is a risk of associating an observation with the wrong landmark, something that may cause the algorithm to diverge.

Individual Compatibility Nearest Neighbour

A common and fairly straight-forward method of solving the data association problem can be found in the Nearest Neighbour (NN) filter. Individual Compatibility Nearest Neighbour (ICNN) judges the compatibility between an observation and a mapped landmark by the

(31)

Figure 2.6: Within the robot pose uncertainty, the measurements can often be associated in more than one way [23].

minimal Mahalanobis distance [40] and will commonly achieve an accuracy around 70% with millisecond computation times [41, 42]. The Individual Compatibility (IC) test evaluates the uncertainty distribution of the landmark and measurement towards a threshold χ² distribution value, used to determine the plausibility of a data association.

The low computational time requirement of NN comes from the fact that it is linear with the size of the map, since it will at the most perform mn tests, where m is the number of map features and n is the number of measurements. Aside the computational requirements, ICNN is not considered sufficiently robust to be used in most applications, since obtaining a faulty association may result in the SLAM algorithm diverging. However, if the cumulative error of measurements and estimates is smaller than the distance between the features and that the number of spurious measurements is small, the ICNN algorithm can be considered adequate to solve the data association problem [43].

Joint Compatibility Branch and Bound

The problem of spurious measurements being associated with map features can be limited by reconsidering the established associations [43]. The Joint Compatibility Branch and Bound (JCBB) algorithm con- siders all measurements in one update [42] by using the IC test for a matrix rather than a single value, thus expanding the ICNN concept to cover all possible data associations across the measurement set [21].

(32)

This expansion improves accuracy, though at the cost of computation times that are increased up to two orders of magnitude to that of ICNN [43]. Since the computation effort increases with the number of landmarks, it is often considered applicable only for smaller environments with fewer features.

2.3 State of the Art

A study of the State of the Art (SOTA) applications available as open source with Robot Operating System (ROS) was made to identify relevant software. Here follows those that implement some of the algorithms presented in the theory.

2.3.1 Pose Graph Optimisation

Google Cartographer [44] was developed as a solution for mapping indoor environments in 2D and 3D using SLAM by mounting the hardware as a backpack to be carried by a person. Cartographer uses pose graph optimisation and a local to global map setup to allow mapping of very large areas in real-time. The LiDAR scan data is inserted into a local submap and matched using a scan matching algorithm based on the Ceres solver [45] built for solving complicated optimisation problems. When the robot has moved out of the local submap area this submap is then sent to the global map for fixation and loop closure.

This process is then repeated for each submap that is completed. This solution prevents the scan-to-scan accumulative error which will oc- cur when using a single map and scan matching. Cartographer is also able to make use of landmarks to improve loop closure, and uses IMU data to stabilise the LiDAR scan data.

2.3.2 Rao-Blackwellized Particle Filter

As previously mentioned, the Rao-Blackwellized Particle Filter can be tuned to be suitable as a SLAM solution. There are numerous applications available, two of them are presented below.

(33)

Mobile Robot Programming Toolkit

The Mobile Robot Programming Toolkit (MRPT) provides several different configurations for both grid and feature maps, developed at the University of Málaga since 2004 [46]. Other than SLAM solutions, MRPT contains a great number of solutions for various issues mobile robots may face. An improved technique for implementation of particle filters to achieve real-time performance taking into account the accuracy of the robot’s sensors and applying an adaptive re-sampling technique that assists in maintaining a reasonable level of particle va- riety was suggested by [35]. This improved technique is implemented in the MRPT RBPF solution and also uses ICP scan matching to create grid maps, similar to Gmapping [47], which is also based on the solutions presented by [35, 34] and discussed in further detail below.

Gmapping

Another RBPF implementation is Gmapping, developed by the au- thors of [35, 34], which uses LiDAR scan data and odometry to map the environment into a grid map. Gmapping also implements the same adaptive re-sampling technique as MRPT’s RBPF, as well as an approach to calculate an accurate proposal distribution improves performance in the prediction step, increasing overall performance [47].

Loop closure is achieved by re-evaluating which is the most likely particle by matching scans of the area, the estimated position and the maps of all particles when the robot returns to a previously traversed area.

2.3.3 Extended Kalman Filter

As one of the first and most fundamental solutions to the SLAM problem, Extended Kalman Filters are still viable for feature map solutions when combined with accurate and efficient data association techniques.

Mobile Robot Programming Toolkit

The implementation of EKF in MRPT is not far from the theoretical background presented in Section 2.2.1 and allows for data association using ICNN and JCBB, based on either Maximum Likelihood [48] or

(34)

Mahalanobis distance [49]. Important for feature based SLAM techniques is the accurate evaluation of sensor uncertainties, as these are used for data association, thus the ability to adjust this is also included.

Hector SLAM

Developed at the Darmstadt University of Technology for an autonomous search and rescue robot competition series [50], Hector SLAM is a lightweight grid mapping SLAM algorithm using only LiDAR scan data. Since it requires no odometry or other input than the LiDAR scans, it can be easily configured and deployed on lightweight platforms. The algorithm is based on EKF for the state estimation and uses a scan matching method derived from a Gauss-Newton approach orig- inating from computer vision [51]. With low computational requirements and simple configuration, it is a modern and viable method for running primarily indoor SLAM systems.

2.4 Related Work

This section presents work related to the topic of SLAM under this thesis, as well as some work regarding the demonstrator.

2.4.1 A Platform for Indoor Localisation, Mapping, and Data Collection using an Autonomous Vehicle

State of the art open source SLAM applications were evaluated by [22]

as a part of a master thesis project to build an indoor localisation device. Algorithms were evaluated on loop closure, map accuracy, performance and trajectory using the same input data. While Gmapping and Cartographer had similar performance, the reliability of Cartogra- pher made it the recommended algorithm. Little discussion was made regarding the nature of the SLAM algorithms used in the applications, though the performance results are a good pointer towards which applications to investigate.

2.4.2 MIT and Hypha Racecar

Scale autonomous race cars have been implemented by [52] and [53]

as part of university learning projects. RC cars are rebuilt to house

(35)

sensors and embedded processing units, and run SLAM algorithms to navigate through unknown indoor environments. Both projects focus on accurate navigation and high movement speeds, which makes the available documentation valuable to this project, though the mapping process is rarely discussed. Further, the open source code provided has proved useful when creating a demonstrator similar to the RC race cars used in mentioned projects.

2.4.3 A Comparison of Data Association Techniques for Simultaneous Localization and Mapping

A master thesis was written on the subject of comparing data association methods [21] with SLAM. EKF and FastSLAM (an RBPF implementation) were used for each data association algorithm, where ICNN and JCBB were two of the algorithms evaluated using tests made in simulation and real world data sets. It was shown that JCBB provides greater accuracy at the cost of longer computation times, and while ICNN results in lower accuracy and shorter computation times, ICNN can perform almost equally to JCBB in areas with sparse feature density since the risk of making a faulty association will be reduced when features are further apart. The number of spurious measurements also affects ICNN to a greater extent than JCBB, where a higher number will result in lower accuracy and the risk of forcing the SLAM algorithm to diverge due to erring associations. This work gave a good idea of which data association methods to investigate, and what to ex- pect of them.

2.5 Implementation

The following SLAM algorithms were selected for comparison due to their inherently different characteristics. The algorithms were implemented in ROS using combined parts of various open source solutions, the main parts presented in Section 2.3. These solutions were stripped and adjusted to suit the application at hand as described below.

2.5.1 Pose Graph Optimisation

Pose Graph Optimisation was implemented with grid maps by fusing laser scans and ICP based scan matching, IMU data and odome-

(36)

try. It also incorporated landmark support for loop closure to further improve performance. As previously mentioned for grid mapping in general, the matching of map resolution to LiDAR resolution is a sen- sitive point for optimal performance, together with the scan matching update frequency to laser scan frequency. Further, the algorithm was tuned by adjusting the parameters regarding local update frequency, weighting towards adjusting previous map insertions and the score required to update the map. The scan matching was tuned by adjusting the weights for translation and rotation, to determine whether the odometry readings or scan matching result should be trusted. The parameters for PGO were chosen as presented in Table 2.1. The scores are a measure for how certain the scan matching needs to be to be inserted into the map. Map resolution was set to 5cm, which mirrors the uncertainty of the LiDAR at ~4m range, which was evaluated to be slightly higher than that specified by the data sheet [54].

Table 2.1: PGO parameters

Parameter Value Unit

Map resolution 0.05 m

Min. Score 0.65 -

Update frequency 30 Hz

SM Trans. weight 0.1 -

SM Rot. weight 0.2 -

2.5.2 Rao-Blackwellized Particle Filter

The Particle Filter solution was implemented with a Rao-Blackwellized Particle Filter with grid mapping using laser scans, ICP scan matching and odometry. As with the PGO it was extended to make use of landmarks for improved loop closure. Additionally, an adaptive method for adjusting the number of particles discussed in Section 2.2.1 was used to improve performance. The parameters were chosen as presented in Table 2.2. Linear distance and angular thresholds determine the maximum robot movement allowed before the map is updated.

The min/max particle number is used in the adaptive re-sampling method.

(37)

Table 2.2: RBPF parameters

Map resolution 0.05 m

Min. Score 54 -

Linear Dist. th. 0.02 m

Linear Ang. th. 1 deg

Min Particles 30 -

Max Particles 150 -

2.5.3 Extended Kalman Filter

The Extended Kalman Filter algorithm was implemented as a lightweight solution to the Online SLAM problem. The EKF solution was implemented with a feature map, a cone detection algorithm using laser scans, and odometry. EKF solutions rely heavily on the data association methods used, as such a comparison between Individual Compat- ibility Nearest Neighbour and Joint Compatibility Branch and Bound was made, splitting the EKF results into two. The EKF data association was implemented with the IC χ² and Mahalanobis IC thresholds shown in Table 2.3.

Table 2.3: Extended Kalman Filter parameters

IC χ²th. 0.95 -

Maha IC th. 0.90 -

(38)

(39)

Object Recognition

This chapter discusses the object recognition problem and presents a number of solutions with implementations.

3.1 Computer Vision and Object Recognition Overview

Machine vision has been a research topic which has grown more and more in late years and is now heavily researched. One of the first pa- pers written on machine vision in 1963 is the PhD thesis Machine per- ception of three-dimensional solids [55] which describes a computer program that is able to transform a photograph to a line drawing and then finally into a 3D representation. In the 1990’s there were a lot of different approaches to object recognition such as colour histograms [56], eigenfaces [57] and feature-based methods which were thoroughly researched. The area was then continuously expanded in the 2000’s as parts-and-shape models were introduced as well as bags of words. It was during this time researchers started utilising machine learning in computer vision and deep learning models such as Restricted Boltz- mann Machines [58] and Deep Belief Networks [59] were created.

3.1.1 Deep Learning Era

The springboard to where deep learning currently is in computer vision started in 2012 when the first Convolutional Neural Network reduced the error rate by almost half in the ILSVRC competition, an object recognition challenge [60], beating all the traditional algorithms

27

(40)

[61]. Since then the deep learning algorithms started being used more and more in computer vision resulting in other well performing image recognition nets such as VGG Net, GoogleNet and Microsoft ResNet, which all used different variants of architectures for improving the accuracy and performance [62].

3.1.2 Convolutional Neural Networks

The first implementation of convolutional neural networks (CNNs) in computer vision was in the late 1990’s [63] but gained traction in the computer vision field in 2012 after the ILSVRC competition. What dif- ferentiates a CNN from an ordinary Neural Network is the assumption that the input is formatted as an image e.g height, width and number of channels, and therefore being able to use certain properties within the network. For larger images e.g 200x200x3 the weights for each neuron in an ordinary neural net would be 120 000 different weights. With a CNN this is not the case as only a small region is connected to each neuron [62] when using convolutional layers.

The convolutional layers consist of learnable filters, also called ker- nels. These filters have a width, height and depth which slides over the input using the stride as step length. For each step across the input convolutions are done with the input and the kernel resulting in a single number. An example of a convolutional layer in 2D, meaning the Kernel only has a width and height, can be seen in Figure 3.1.

0 0 1 1 1 0 0 1 0

1 0 11 1 1 1 0 10 1

0 10 1 0 101 0

1 1 0 1

2 13 4 2 3 224 1

Kernel

Input Result

Figure 3.1: Convolutional layer.

As Figure 3.1 shows, the result is smaller than the input. To get the same size output as the input zero padding can be used. This means the input matrix is padded with zeroes and therefore making the output larger. An example of this can be seen in Figure 3.2.

(41)

0 0 1 1 1 0 0 1 0

1 0 11 1 1 1 0 10 1

0 10 1 0 101 0

1 1 0 1

2 22 3 3 2 341 1

Kernel

Input Result

0 0 0 0 0

22 4 2 2 2 2 231

1 1 1 3 1 2

Figure 3.2: Example of zero padding.

In CNNs down sampling layers are usually used. This means the input to the layer is down sampled into a smaller size, reducing the spatial size. This is done in a similar way to the convolutional layers except the mathematical operation for each stride is different. One way to do this by using max pooling, where each subregion is processed with a max operation, outputting the maximum number in that subregion [62]. An example of this can be seen in Figure 3.3 where a stride of 2 is used.

2 13 4 2 3 224

15 1

1 3 4 5

43 55

Input Output

Figure 3.3: Example of max pool layer.

A CNN can be built in many different wasy and there a lot of different layers which can be used. The layers mostly used are the following [62]

• Input - Raw input data to the CNN, in the case of image processing it will consist of width, height and the number of colour channels.

• Convolutional layers - The convolutional layers use filters that are convoluted with the image. These convolutional layers can detect different features such as edges or curves and use four parameters which does not change during training, also known as hyperparameters: number of filters, size of the filters, stride and amount of zero padding.

(42)

• Rectified linear unit layers - The ReLU layers apply an activation function to each element, e.g set all negative elements to zero.

• Pooling layers - The pooling layers perform down sampling for the width and height with several different types of down sampling where max pooling is the most popular.

• Fully connected layers - The fully connected layers are like normal layers used in neural networks where each node is connected to all the nodes on the previous layer. These layers are mostly used in the end of the network for classification.

3.2 State of the Art

In this section some of the most recent and popular algorithms in object recognition and object detection are presented.

3.2.1 R-CNN and Fast R-CNN

R-CNN is based on a Region-based Convolutional Network which combines the following insights according to [64]:

• "One can apply high-capacity CNNs to bottom-up region proposals in order to localize and segment objects"

• "When labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost"

First R-CNN uses a process called selective search to find region proposals which are then run through a CNN for feature extraction. The features are then classified using a support vector machine, SVM. The basic idea of R-CNN can be seen in Figure 3.4.

The R-CNN achieves good accuracy but it suffers from several draw- backs. One is the multi-stage training process requiring separate training for the CNN and SVM. R-CNN is also expensive in both time and space, which results in a slow object detection reaching 0.022 fps [65]. Fast R-CNN improves R-CNN with implementing single-stage training, higher mean average precision, mAP, and runs the detection faster, 3.3 fps, excluding the time for object proposal. This is done by utilising shared convolutions between proposals essentially meaning the CNN only has to run once on every image [65].

(43)

Figure 3.4: R-CNN overview.

3.2.2 YOLO: You only look once

YOLO differs from R-CNN in the case that it is not a traditional image classifier repurposed for object detection. YOLO uses a single neural net both for localisation of boundary boxes and recognition. This means that the whole network can be evaluated and trained for detection performance at once. The network divides the image into a total of S × S grid cells where each cell is responsible for detection if an objects center is within that cell. B number of boundary boxes for each cell are then predicted and scored. In the case of the VOC 2007 data set, where 20 different classes in a scene should be recognised [66], the parameters were set to S = 7 and B = 2 resulting in a total of 98 boundary boxes to be scored. The network reached up to 45 fps with 63.4%mAP on the VOC 2007 data set and FAST YOLO reached up to 155 fps with 52.7% mAP [67]. Since the first version of YOLO there have been updates with YOLO v2/YOLO900 [68], which introduced the possibility of accuracy and fps tradeoff, and later on YOLO v3 [69]

which mainly focused on increasing the accuracy of the network.

3.2.3 SSD: Single Shot MultiBox Detector

SSD is a method for detecting objects using a single deep neural net relying on features from multiple layers for classification. With SSD a high accuracy can be reached, even on images with low resolution, and a simple end-to-end training. The differences between the SSD and YOLO architecture can be seen in Figure 3.5. The SSD algorithm manages a fast high accuracy detection, 59 fps with mAP 74.3% on VOC2007 [70] test data where Faster R-CNN performs with 7 fps. As

(44)

SSD performs very well on large objects but suffers on detection on small objects, there is still room for improvements [71].

300 300

3

VGG-16 through Conv5_3 laye r

19 19

Conv7 (FC7)

1024 10 10 Conv8_2

512 5 5 Conv9_2

256 3 Conv10_2

256 256

38 38

Conv4_3

3

1 Image

Conv: 1x1x1024 Conv: 1x1x256 Conv: 3x3x512-s 2

Conv: 1x1x128 Conv: 3x3x256-s 2

Conv: 1x1x128 Conv: 3x3x256-s 1

Detections:8732per Class

Clas s ifie r : Conv: 3x3x(4x(Clas s e s +4))

512

448 448

3 Image

7 7

1024

7 7

30 Fully Conne cte d

YOLO Cus tomize d Archite cture

Non-Maximum Suppression

Fully Conne cte d

Non-Maximum Suppression

Detections: 98 per class

Conv11_2

74.3mAP 59FPS

63.4mAP 45FPS Clas s ifie r : Conv: 3x3x(6x(Clas s e s +4))

19 19

Conv6 (FC6)

1024 Conv: 3x3x1024

SSDYOLO

Extra Fe ature Laye rs

Conv: 1x1x128 Conv: 3x3x256-s 1 Conv: 3x3x(4x(Clas s e s +4))

Figure 3.5: SSD and YOLO architecture [71].

3.2.4 MobileNet

MobileNet is a set of object recognition architectures created to be able to run computer vision applications on embedded and mobile devices. The MobileNet architecture is built to first and foremost op- timise against latency but also size, to be able to fit on the limited resources available. This is done by creating light weight deep neural networks, DNNs, utilising depthwise seperable convolutions which essentially means the computations can be drastically reduced as well as the model size. This is done by seperating the convolutions into two steps, convolutions for each input channel and then combining them with pointwise convolutions, see Figure 3.6. MobileNet also introduce a parameter called width multiplier, α, to easily change the input channel size uniformly and reduce or increase resources needed [72]. The total amount of layers are 28 when counting pointwise and depthwise layers as seperate.

(45)

3x3 Depthwise Conv BN

1x1 Conv BN ReLU ReLU 3x3 Conv

BN ReLU

Figure 3.6: Left: Normal convolutions with batchnorm and ReLU, Right: Depthwise seperable convolutions with batchnorm and ReLU [72]

3.3 Related Work

There is previously conducted work related to the detection of cones using computer vision. Here follows two projects with different approaches to the problem of identifying cones.

3.3.1 Real-time Traffic Cone Detection for Autonomous Vehicle

In [73] radar data was used to focus the camera and find interesting image patches that correspond to potential cones. First chamfer matching was used for the shape detection on the image patches. The chamfer matching used a distance image which was created by a binary image of the shape of a cone. This distance image was then matched to the template of the cone using rotation, translation and re-sizing. Canny edge was used for detecting edges, even though more accurate algorithms exist these were not used because of the real-time performance requirement. To choose the threshold for classification a data set of 400 positive and 600 negative samples were used to create a probability density function, PDF, to get an accurate threshold. To deal with false positives when only using chamfer matching a simple colour segmen- tation was then used to check if the pixels correspond to those created with the PDF [73].

(46)

3.3.2 Cone Detection: Using a Combination of LiDAR and Vision-Based Machine Learning

A combination of 3D LiDAR and a vision system based on small CNN to detect cones with an input size of 128 × 128 was used in [74]. The Li- DAR data was filtered by looking at the intensities, where cones show a high intensity due to the high reflective surface. The scan data var also filtered so that only cones infront of the car was considered. The potential cone candidates from the LiDAR data were then transformed into image patches which were then classified by an CNN which had been trained on over 24000 cone images. The CNN reached a total of 95.4%accuracy [74].

3.4 Implementation

The cone detection was implemented by using an image classifier of image patches instead of object localisation on a full frame, similar to the method used in [74] and [73]. This was done to reduce the computation time and complexity for the cone detection to be able to run it in real-time on an embedded system. The flow chart of the full cone classification algorithm can be seen in Figure 3.7 and is explained below.

3.4.1 Cone Candidate Extraction

The image patches used for cone classification were acquired by utilising the laser scan data which was processed to identify possible cone objects. The process consisted of transforming the laser scan into a point cloud residing in the 2D euclidean space. This was done to be able to use euclidean clustering to extract objects with size similar to a cone. The extracted clusters were then projected into the rectified camera frame by utilising a pinhole camera model and the camera matrix [75].

3.4.2 Transfer Learning

The first cone detection algorithm was implemented using transfer learning on the existing image classifier type called MobileNets [72]

mentioned in Section 3.2.4. By utilising transfer learning a model that

(47)

Laser scan data

Camera image frame Laser scan

to point cloud

Euclidean clustering

Transform to image frame

Image patch classiﬁcation

Classiﬁed cones in 2D euclidean

space

Figure 3.7: Flow chart of cone classification algorithm

has been trained on e.g ImageNet which consists of 1.2 milion images over 1000 categories can be used to train to recognise new categories in an effecient way. This means the full architecture does not need to learn to find features in images but instead the final layers can be re- trained to classify a new class of choice [62]. The transfer learning was implemented using the Tensorflow machine learning framework [76]

on four different MobileNets. The input image was resized to a square image of either 128×128 pixels or 224×224 pixels depending on which net was used. For the training process cross-entropy loss was used, evaluating the error of the network for each training iteration. The MobileNet classifier was chosen for its relatively high accuracy and speed suitable for embedded systems.

3.4.3 CNN

The second cone detection algorithm was implemented as a small CNN.

As cones are objects with few features when compared with more feature complex objects such as faces or animals, and a light weight network was required, a minimalistic network architecture was chosen.

Each convolutional layer was followed by a max pool and a ReLU ac-

(48)

tivation function and classified by a fully connected layer. The input image was resized to a square image as mentioned in Section 3.4.2, but only to the size of 128 × 128 pixels, as well cross-entropy loss was used for training. The small CNN architecture was chosen for fast performance and the hypothesis that few layers are enough for classification of cones.

A small hyperparameter search was conducted by varying the number of convolutional layers, fully connected layers and learning rate.

3.4.4 Training Data

The training data was gathered using the demonstrator, described in Section 4.1, saving the cone candidate image patches from the LiDAR and camera. This was done both in an indoor office environment with artificial lighting and outdoor environment with natural lighting.

The image patches gathered from the demonstrator were also comple- mented by taking images of cones using a camera from a Samsung Galaxy S7 from different views, scale, lighting, scenery and both with and without occlusions in an outdoor environment. All the images were then labelled into the following classes blue cone, orange cone, yellow cone and no cone. Examples of training data can be seen in the Figures 3.8, 3.9 and 3.10.

Figure 3.8: Indoor image patch from LiDAR candidate

Figure 3.9: Outdoor on grass taken with smartphone camera

Figure 3.10: Outdoor on asphalt taken with smartphone camera For the training data 120 images were gathered for each class and each data set was then augmented using Augmentor [77] to increase the amount of data. The images were augmented by flipping them

(49)

vertically, varying contrast, varying brightness and introducing dis- tortions, increasing the data size to 480 for each class.

(50)

(51)

Demonstrator

This chapter presents the hardware and software solutions used to construct a demonstrator for testing the previously presented implementations.

4.1 Demonstrator

The hardware demonstrator was based on the components used in other small scale autonomous vehicles such as the MIT Racecar [52]

and F1/10 [78], as well as JetsonHacks that have written tutorials for the Jetson boards and the MIT racecar [79]. The main components of the demonstrator can be seen in the Table 4.1 and only differ slightly to the aforementioned platforms.

Table 4.1: Main parts for the demonstrator

Type Product

RC Car Himoto Tanto BL 1:10 Embedded

system

Nvidia Jetson TX2 Dev kit Stereo camera Stereolabs ZED LiDAR RPLidar A2M8

IMU Sparkfun

MPU-9250 9-DoF Motor driver vESC 4.12

The demonstrator was in short an RC car with an overhead construction housing the components. Platforms were laser cut from 4mm

39

(52)

acrylic sheets. The Jetson TX2 development board handled communi- cation with all devices and ran all software. The hardware communi- cation can be seen in the flow chart in Figure 4.1.

The positioning of the sensors was made to resemble the ones used on the MIT Racecar and F1/10. This created a compact car with sensor aligned in the vertical plane making the transformations between sensors simpler and placing the centre of gravity as low as possible to allow for higher speed manoeuvres.

Jetson TX2 RP

LiDAR

ZED Stereocamera VESC

BLDC MPU-9250

Powerbank

3.3V I2C 19V

Battery

11.1V

USB Hub USB 2.0 USB 3.0

USB Servo

PWM 5V

USB

Figure 4.1: Hardware flow chart.

The RC car equipped a brushless DC motor which allowed sensor- less odometry by measuring the back EMF [80], achieved using the Vedder Electronic Speed Controller (vESC) [81] which also controls both motor and steering servo. Movement was also measured using an Inertial Measurement Unit (IMU), the nine degree of freedom MPU- 9250 [82], where the accelerometer and gyroscope were used. Visual data was obtained using a Stereolabs ZED stereo camera [83] and the RPLidar A2 from Slamtec [84] provided 360° 2D laser scan data. All sensors were sampled at 15Hz to match the upper limit of the LiDAR.

The final demonstrator can be seen in Figure 4.2.

(53)

Figure 4.2: The Demonstrator.

4.2 Software Architecture

The software architecture was constructed as shown in Figure 4.3. The dotted lines represent data streams which were used in different SLAM algorithms.