Large Scale SLAM in an Urban Environment

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Large Scale SLAM in an Urban Environment

Examensarbete utfört i Reglerteknik vid Tekniska högskolan i Linköping

av

Jonas Callmer & Karl Granström

LITH-ISY-EX--08/4115--SE Linköping 2008

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Large Scale SLAM in an Urban Environment

Examensarbete utfört i Reglerteknik

vid Tekniska högskolan i Linköping

av

Jonas Callmer & Karl Granström

LITH-ISY-EX--08/4115--SE

Handledare: Juan Nieto

acfr_{, University of Sydney}

Fabio Ramos

acfr_{, University of Sydney}

David Törnqvist

isy_{, Linköpings universitet}

Examinator: Thomas Schön

isy_{, Linköpings universitet}

(4)

(5)

Avdelning, Institution

Division, Department

Division of Automatic Control Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2008-05-16 Språk Language Svenska/Swedish Engelska/English ⊠ Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ⊠

URL för elektronisk version

http://www.control.isy.liu.se http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-11833 ISBN — ISRN LITH-ISY-EX--08/4115--SE

Serietitel och serienummer

Title of series, numbering ISSN_—

Titel

Title Storskaligt SLAM i Stadsmiljö_{Large Scale SLAM in an Urban Environment}

Författare

Author Jonas Callmer & Karl Granström

Sammanfattning

Abstract

In robotics, the Simultaneous Localisation And Mapping (SLAM) problem consists of letting a robot map a previously unknown environment, while simultaneously localising the robot in the same map. In this thesis, an attempt to solve the SLAM problem in constant time in a complex environment, such as a suburban area, is made. Such a solution must handle increasing amounts of data without significant increase in computation time.

A delayed state information filter is used to estimate the robot’s trajectory, and camera and laser range sensors are used to acquire spatial information about the environment along the trajectory. Two approaches to loop closure detection are proposed. The first is image based using Tree of Words for image comparison. The second is laser based using a trained classifier for laser scan comparison. The relative pose, the difference in position and heading, of two poses matched in loop closure is calculated with laser scan alignment using a combination of Conditional Random Field-Match and Iterative Closest Point.

Experiments show that both image and laser based loop closure detection works well in a suburban area, and results in good estimation of the map as well as the robot’s trajectory.

Nyckelord

Keywords SLAM, Exactly Sparse Delayed State Filters, Tree of Words, CRF-match, ICP, binary classifier, Adaboost

(6)

(7)

Abstract

In robotics, the Simultaneous Localisation And Mapping (SLAM) problem consists of letting a robot map a previously unknown environment, while simultaneously localising the robot in the same map. In this thesis, an attempt to solve the SLAM problem in constant time in a complex environment, such as a suburban area, is made. Such a solution must handle increasing amounts of data without significant increase in computation time.

A delayed state information filter is used to estimate the robot’s trajectory, and camera and laser range sensors are used to acquire spatial information about the environment along the trajectory. Two approaches to loop closure detection are proposed. The first is image based using Tree of Words for image comparison. The second is laser based using a trained classifier for laser scan comparison. The relative pose, the difference in position and heading, of two poses matched in loop closure is calculated with laser scan alignment using a combination of Conditional Random Field-Match and Iterative Closest Point.

Experiments show that both image and laser based loop closure detection works well in a suburban area, and results in good estimation of the map as well as the robot’s trajectory.

Sammanfattning

Simultaneous Localisation And Mapping (SLAM)-problemet är ett robotikpro-blem som består av att låta en robot kartlägga ett tidigare okänt område, och samtidigt lokalisera sig i den skapade kartan. Det här exjobbet presenterar ett försök till en lösning på SLAM-problemet som fungerar i konstant tid i en urban miljö. En sådan lösning måste hantera en datamängd som ständigt ökar, utan att beräkningskomplexiteten ökar signifikant.

Ett informationsfilter på fördröjd tillståndsform används för estimering av ro-botens trajektoria, och kamera och laseravståndssensorer används för att samla spatial information om omgivningarna längs färdvägen. Två olika metoder för att detektera loopslutningar föreslås. Den första är bildbaserad och använder Tree of Words för jämförelse av bilder. Den andra metoden är laserbaserad och använ-der en tränad klassificerare för att jämföra laserscans. När två posar, position och riktning, kopplats samman i en loopslutning beräknas den relativa posen med laser-scansinriktning med hjälp av en kombination av Conditional Random Field-Match och Iterative Closest Point.

(8)

Experiment visar att både bild- och laserscansbaserad loopslutningsdetektion fungerar bra i stadsmiljö, och resulterar i good estimering av kartan såväl som robotens trajektoria.

(9)

Acknowledgments

In doing this work, we have recieved invaluable help from a number of persons. First we would like to thank Prof. Hugh Durrant-Whyte for inviting us to work at ACFR. It has been a truly inspiring, challenging and fun time for us. Thank you for reminding us that above all, the most important thing is to learn.

Our supervisors at ACFR, Dr. Juan Nieto and Dr. Fabio Ramos, have been a great support. Thank you for helping us find an interesting project to work on, for getting us out of all the problems we got ourselves into, and for showing us the good life that is academic research. Without your great ideas, delicious barbeques and exquisite wines it would have taken twice the time.

A thanks also go out to Dr. Ian Mahon at ACFR, for assisting us with infor-mation filters, image processing and Linux problems. We were lucky to find you had plenty of time to spend with a couple of Swedes lost in the world of SLAM.

Dr. Thomas Schön and Ph.D. student David Törnqvist, our examiner and supervisor at Linköpings universitet, have both been helpful, especially with the writing of this report. You managed to provide us with great input, despite the vast geographical distance and our rather scarce attempts to acctually communicate what we were doing down there.

Thanks to all the friends (and enemies) we made during our years as masters students. Those long days deep within the realms of the occult, commonly known as Campus Valla, would not have been nearly as fun without you.

And finally, to our families. Thank you for supporting us through good times and bad, and for believing in us despite all the shenanigans we got up to as students.

Karl and Jonas

Linköping, May 2008

(10)

(11)

Contents xi 7.5.3 Data set 3 . . . 83 7.6 Discussion . . . 90 7.7 Future Work . . . 91 8 Conclusions 93 A K Means Clustering 95 B The Compounding Operations 97 B.1 Compounding . . . 97 B.2 Inverse Relationship . . . 98 B.3 Composite Relationships . . . 98 B.4 Robot Examples . . . 98 C Abbreviations 101 Bibliography 103

(14)

(15)

Chapter 1

Introduction

The simultaneous localisation and mapping (SLAM) problem is one of the most central in robotics research. It asks if a mobile robot, put in an unknown location in an unknown environment, can incrementally build a consistent map of the environment and simultaneously determine its location within this map. A general solution would enable truly autonomous robots.

There are several areas in which autonomous vehicles of different sorts would be of great aid. Some examples are working in dangerous environments such as mines, herding of animals on remote farms with planes/ helicopters, inspection and monitoring of coral reefs with submarines and space excursions to other planets.

1.1 Problem Formulation

The mathematics behind the SLAM problem are today more or less fully explained [9]. Further, solutions to the SLAM problem in simplified laboratory settings have been suggested and proved functional. Among those are the Extended Kalman Filter solution, called EKF-SLAM, and the particle filter solution, called Fast-SLAM [29]. The properties of these two, and other suggested solutions, have been thoroughly examined.

What remains is a solution to a larger scale problem, set outdoors in a more complex environment. In this thesis the SLAM problem was imagined to cover an area as large as 100 km2 _{of an urban environment. All elements of the solution}

must function equally well and fast in this scale as it would in a laboratory setting. The vehicle used in the experiments is a utility vehicle (UTE), Figure 1.1, equipped with a forward facing camera, one forward facing and one backwards facing laser sensor, wheel encoders, a steering wheel encoder and an inertial mea-surement unit (IMU).

(16)

Figure 1.1. The vehicle used for ground experiments

1.2 Approach

A simple upscaling of a laboratory solution cannot be used since the dimensionality of the filter would soon make each iteration painstakingly slow.

In the ordinary solution, each new landmark is augmented into the state vector causing it to grow throughout the simulation. A state vector in delayed state format saves all previous positions and an Exactly Sparse Delayed State Filter (ESDF) makes the increasing dimensionality manageable due to its exploitation of the sparsity of the information matrix.

For loop closure, i.e. detecting that the robot is revisiting a certain place, two solutions are proposed.

The first approach is image and laser based;

• Images are used to detect if a previous position is revisited.

• Tree of Words [34] performs the image comparison to detect loop closures. • Laser scans from the two positions are aligned to compute the relative

posi-tion of the vehicle.

The second approach is solely based on laser scans;

• Laser scans are used to detect if a previous position is revisited.

• Adaboost is used to train a classifier to detect loop closures from laser scan comparisons.

(17)

1.3 Related Work 3

• If a loop closure is detected, the corresponding laser scans are used to com-pute the relative pose of the vehicle.

1.3 Related Work

The SLAM problem has been worked on since the 80’s, and a vast amount of research effort has been put into the problem. The following are a portion of the previous research that is relevant to the thesis: Eustice et al have examined delayed state filters for view based SLAM [38]. They show that the information matrix is exactly sparse if the state vector is in delayed state format. This results in constant execution time for time and measurement updates, i.e. an upscaling of the experiment does not result in a slower filter.

Newman et al have presented several papers regarding vision based SLAM, both with [31, 15, 32] and without [8] the incorporation of laser range measure-ments. The first paper uses sequences of images to detect loop closures and laser to compute relative pose, the second one uses a combination of laser and vision for loop closure detection while the third one uses images to detect loop closures and laser scans for the relative pose. The vision based paper uses a Bayesian approach to match clusters of image features for safer loop closure detection.

Nieto et al have researched laser based SLAM using Iterative Closest Point (ICP) for scan alignment [33]; Ramos et al suggest doing the scan alignment with Conditional Random Fields (CRF) [35]. In ICP a mean square cost function is minimized to align the scans, while CRF estimates the conditional probability of the joint association of the scans.

Different approaches to solving the SLAM problem using laser has been taken in the past. Nieto et al [33] proposed a solution using scan matching and EKF-SLAM. More recent work includes the histogram approach taken by Bosse et al, [6]. In this paper the authors work within the Atlas framework [5], within which a graph of coordinate frames are maintained. Each vertex contains a local map, and the edges represent the transformations between adjacent maps. During operation, the robot only operates on one local map at the time. According to the authors, their results provided the largest laser map ever created to the date of publication. Tree of Words was first proposed by Nistér and Stewénius [34] as a hierarchical approach to Bag of Words introduced by Sivic and Zisserman [36], which suggests how images can be represented with words for fast database queries. Bag of Words has previously been used in SLAM implementations by Newman et al [31, 15]. In these experiments the loop closure detection is performed without the exploitation of the estimated vehicle location. In the first paper the images aquired during the experiment are also used to create the vocabulary used to detect loop closures, and thus a real time implementation is not possible. An approach with an independent, predefined vocabulary enables real time, and the usage of the estimated location when finding loop closure candidates restricts the number of image comparisons.

(18)

1.4 ACFR

This research project was undertaken at the Australian Centre for Field Robotics (ACFR). ACFR is based in the School of Aerospace, Mechanical and Mechatronic Engineering at The University of Sydney (USYD), Sydney, Australia. It is ded-icated to research, development, application and dissemination of field robotics principles. It forms part of groups like the ARC (Australian Research Council) Centre of Excellence for Autonomous Systems, as well as CRC (Cooperative Re-search Centres) Mining Australia - a major cooperative reRe-search and development program for the mining industry. Major research areas include autonomous vehi-cles on land, in air and underwater used for agricultural, defencive, mining and transport purposes among others.

ACFR is the largest robotics and automation research group in Australia and is also one of the largest of its kind in the world. It has an academic staff of about 25 people, approximately 25 PhD students and a technical staff of 20 people working primarily in its field laboratories.

1.5 Chapter Outline

Chapter 2 explains the state vector and vehicle motion models. Laser sensor sig-nal processing is presented in Chapter 3, followed by image sigsig-nal processing in Chapter 4. Chapter 5 gives a further description of the SLAM problem, along with an overview of the Exactly Sparse Delayed State Filter and simulations us-ing this filter. Results from ground experiments usus-ing Tree of Words and laser scan alignment is provided in Chapter 6. Chapter 7 contains results from ground experiments using laser scan matching to detect loop closures. Finally, Chapter 8 gives the overall conclusions.

(19)

Chapter 2

Dynamic Motion Model

This chapter contains a description of the state vector and the vehicle models. A delayed state vector have been used, containing a history of past vehicle poses (positions and headings), and the conversion from a state vector in covariance form to information form is shown.

Three different vehicle models have been used, their respective mathematical equations and Jacobians are presented. Two of the vehicle models are troubled by singularities, the reason for this problem and a suggested solution is briefly discussed.

2.1 State Vector

The vehicle pose vector is

x(tk) =   x(tk) y(tk) φ(tk)   (2.1)

where x(tk) and y(tk) are the vehicle’s position in a two dimensional plane, and

φ(tk) is its heading in the same plane, see Figure 2.1. The vehicle pose uncertainty

is represented by the covariance matrix

P(tk) =   σxx(tk) σxy(tk) σxφ(tk) σyx(tk) σyy(tk) σyφ(tk) σφx(tk) σφy(tk) σφφ(tk)  . (2.2)

A delayed state vector is used, which contains the current estimated vehicle pose and a history of past estimated vehicle poses. The current delayed state vector X(tk) and its associated covariance matrix P(tk) have the following format

X(tk) = xt(tk) xv(tk) (2.3) P(tk) =  Ptt(tk) Ptv(tk) Pvt(tk) Pvv(tk) (2.4) 5

(20)

0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 8 φ [rad] x [m] y [m]

Figure 2.1. Schematic illustration of the vehicle states, showing x-position, y-position

and heading φ in a global reference frame.

where xt(tk) is the history of past vehicle poses augmented at times t0 to tN,

[xt0(tk) . . . xtN(tk)]

T_{, and x}

v(tk) is the the current vehicle pose. All vehicle poses

and uncertainties, past and current, have the format given by (2.1) and (2.2).

2.1.1 State Vector in Information Form

The estimate of the augmented state vector, ˆX(tk), is assumed to be Gaussian

distributed with mean X(tk) and covariance P(tk). The estimated state vector

and covariance are expressed in information form by the information vector ˆB(tk)

and information matrix Λ(tk)

ˆ B(tk) = Λ(tk) ˆX(tk) = ˆ bt(tk) ˆ bv(tk) , (2.5) Λ(tk) = P−1(tk) = Λtt(tk) Λtv(tk) Λvt(tk) Λvv(tk) . (2.6)

2.1.2 Relationship Between Covariance and Information

Given the state covariance and information matrix in (2.4) and (2.6), the exact relationship between the elements in the two matrices can be expressed using the factorization of inversion of block matrices. For simplicity, time indices are omitted in this section. Λ = Λtt Λtv Λvt Λvv = P−1= Ptt Ptv Pvt Pvv −1 = P −1 tt + P −1 tt Ptv∆−1PvtP−1tt −P −1 tt Ptv∆−1 −∆−1_P vtP−1tt ∆−1 , (2.7)

(21)

2.2 Vehicle Motion Model 7

where

∆ = Pvv− PvtP−1tt Ptv. (2.8)

The relationship between the estimated state vector and the information vector is given by (2.5), which together with the equations above yield

ˆ B = Λ ˆX= Λttxˆt+ Λtvxˆv Λvtxˆt+ Λvvxˆv = P −1 tt xˆt+ P−1tt Ptv∆−1PvtP−1tt ˆxt− P−1tt Ptv∆−1xˆv −∆−1_P vtP−1tt xˆt+ ∆−1xˆv =   P−1_tt xˆt+ Ptv∆−1PvtP−1tt xˆt− Ptv∆−1xˆv −∆−1_P vtP−1tt xˆt+ ˆxv   (2.9)

2.2 Vehicle Motion Model

Three different vehicle models are used for simulations and for experiments with acquired sensor information from the utility vehicle (UTE) operated at ACFR. Different models are utilised due to the varying quality of the sensor signals in the different data sets that have been used. The UTE is fitted with a wheel encoder that measures vehicle speed, a steering wheel encoder that measures the steering angle, and an inertial measurement unit (IMU) that measures turn rates. For data sets with good data from the steering wheel encoder, a steering model is utilised, and for data sets with good data from the IMU, a turn rate model is utilised. Simulations using the the exactly sparse delayed state filter were run in a MatLab environment, in which the steering vehicle model was used.

A third and final model was used for laser based experiments. In those ex-periments, alignment of consecutive laser scans was used to obtain the vehicle’s motion described as a rotation and a translation.

2.2.1 Steering Model

The steering vehicle model has two input signals u(tk) = [u1(tk) u2(tk)]T,

corre-sponding to speed u1(tk) [m/s] and steering angle u2(tk) [rad], respectively. The

coordinate system is vehicle centered, meaning that the vehicle pose corresponds to the center of the vehicle. The model equations are

xv(tk+1) =   xv(tk+1) yv(tk+1) φv(tk+1)  = f(xv(tk), u(tk+1)) =   xv(tk) yv(tk) φv(tk)  + Ts    u1(tk+1) cos (φv(tk) + u2(tk+1)) u1(tk+1) sin (φv(tk) + u2(tk+1)) u1(tk+1) sin (u2(tk+1)) /WB    (2.10)

(22)

Fxv(tk+1) = df(xv(tk), u(tk+1)) d(x(tk), y(tk), φ(tk)) =   1 0 −Tsu1(tk+1) sin(φ(tk) + u2(tk+1)) 0 1 Tsu1(tk+1) cos(φ(tk) + u2(tk+1)) 0 0 1   (2.11) Fu(tk+1) = df(xv(tk), u(tk+1)) d(u1(tk+1), u2(tk+1)) = Ts  

cos(φ(tk) + u2(tk+1)) −u1(tk+1) sin(φ(tk) + u2(tk+1))

sin(φ(tk) + u2(tk+1)) u1(tk+1) cos(φ(tk) + u2(tk+1))

sin(u2(tk+1))/WB u1(tk+1) cos(u2(tk+1))/WB





(2.12)

2.2.2 Turn Rate Model

The turn rate vehicle model has two input signals u(tk) = [u1(tk) u2(tk)]T,

corre-sponding to speed [m/s] and turnrate [rad/s], respectively. The coordinate system for this model is centered at the laser, which is attached to the front of the car. The model equations are

xv(tk+1) =   xv(tk+1) yv(tk+1) φv(tk+1)  = f(xv(tk), u(tk+1)) =   xv(tk) yv(tk) φv(tk)  + Ts   cos(φv(tk)) −h cos(φv(tk)) − T1 sin(φv(tk)) −h sin(φv(tk)) + T2 0 1   u1(tk+1) u2(tk+1) , (2.13a) T1= a sin(φv(tk)) + b cos(φv(tk)), (2.13b) T2= a cos(φv(tk)) − b sin(φv(tk)), (2.13c)

where a, b and h are vehicle specific parameters (constants) explaining the position of the laser sensor, and Ts is the time step.

Fxv(tk+1) = df(xv(tk), u(tk+1)) d(x(tk), y(tk), φ(tk)) =  

1 0 Ts(− sin(φ(tk))u1(tk+1) + (h sin(φ(tk)) − T2)u2(tk+1))

0 1 Ts(cos(φ(tk))u1(tk+1) − (h cos(φ(tk)) + T1)u2(tk+1))

0 0 1   (2.14) Fu(tk+1) = df(xv(tk), u(tk+1)) d(u1(tk+1), u2(tk+1)) = Ts   cos(φ(tk)) −h cos(φ(tk)) − T1 sin(φ(tk)) −h sin(φ(tk)) + T2 0 1   (2.15)

(23)

2.2 Vehicle Motion Model 9

The first two models assume that there is no lateral slip, i.e. the vehicle cannot move sideways. This assumption results in problems with the inversion of the process noise covariance matrix.

2.2.3 Laser Model

When the laser is used to estimate vehicle movement instead of the dead reckoning sensors mentioned above, a motion model with three input signals is used. The reason for this is that laser scan alignment results in a three dimensional transfor-mation; a translation and a rotation. The rotation is comparable to the steering or turn rate, whereas the translation in (x, y)-direction is different from the vehicle speed. The translation could be converted to speed using the time stamps from the two laser scans that have been aligned. However, such a conversion would result in a very noisy signal. Furthermore, there are models available for a translation and rotation input signal, and therefore the conversion from translation to speed is deemed unnecessary and a third vehicle model is presented.

The laser based model has three input signals u(tk) = [u1(tk) u2(tk) u3(tk)]T,

where [u1(tk) u2(tk)]T corresponds to translation and u3(tk) corresponds to

rota-tion. The model equations are

xv(tk+1) =   xv(tk+1) yv(tk+1) φv(tk+1)  = f(xv(tk), u(tk+1)) = xv(tk) ⊕ u(tk+1). (2.16) Fxv(tk+1) = df(xv(tk), u(tk+1)) d(x(tk), y(tk), φ(tk)) = J1⊕(tk+1), (2.17) Fu(tk+1) = df(xv(tk), u(tk+1)) d(u1(tk+1), u2(tk+1)) = J2⊕(tk+1). (2.18)

The compounding operator ⊕ and the compounding operator Jacobian J⊕(tk) =

[J1⊕(tk) J2⊕(tk)] is further explained in Appendix B.

2.2.4 Process Noise

The vehicle motion model’s input signals u are affected by process noise v(tk),

which is assumed to be additive zero mean Gaussian with covariance Qv(tk).

Depending on which vehicle model is being used, v(tk) is either two or three

dimensional. For the two dimensional case, v(tk) = [v1(tk) v2(tk)]T and

Q_v(tk) =

σv1(tk) 0

0 σv2(tk)

. (2.19)

It is further assumed that v(tk) and v(tl) are independent for all k 6= l.

A problem with the first two vehicle models are that the ESDF requires the inverse of the matrix Q(tk) = Fu(tk)Qv(tk)FTu(tk). However, for the first two

models, the matrix Q(tk) is singular, and therefore cannot be inverted. The

(24)

A suggested solution to the problem is to add a stabilising noise matrix Qsto

FuQvFTu. This avoids the inversion of Q, which is shown below via the matrix

inversion lemma (omitting time indices) Q_s+ FuQvF T u −1 = Q−1_s − Q−1s Fu Q−1_v + FT_uQ−1_s Fu −1 FT_uQ−1_s (2.20)

The stabilising noise matrix is chosen to be a diagonal matrix

Qs=   qs1 0 0 0 qs2 0 0 0 qs3   (2.21)

where the numerical values of qsi are chosen as small as possible to avoid adding

(25)

Chapter 3

Laser Signal Processing

A central part of any SLAM solution is the detection of loop closures. The loop closure lets the filter adjust its entire estimate of the vehicle’s trajectory. False loop closure can be devastating to the estimate while missed loop closures leave out a lot of information. Therefore, a safe and reliable way to detect loop closures is critical.

Loop closures can be detected in many ways. Two ways have been developed; one image based and one laser scan based. This chapter describes loop closure detection using laser range measurements.

This chapter contains information about the SICK laser sensor that was used, as well as the mathematical representation of the laser range measurements and their respective uncertainty. For detection of loop closure, a feature based laser scan matching method has been developed.

The features are described, some illustrated, and the training procedure used to obtain a reliable match detector is explained. Results from training of the loop closure detection are presented and discussed.

Finally, two methods for alignment of matching scans are presented, along with a discussion of how to estimate the covariance of laser scan alignment.

3.1 The Laser Sensor

SICK LMS 291-S05 laser range sensors, see Figure 3.1, were used to acquire range measurements. The sensors have a 180 degree field of view, and take range mea-surements every half a degree, i.e. each laser sweep consists of 361 range measure-ments. Laser sweeps are taken at 5 Hz. The laser range sensors time out after a certain distance. Therefore a maximum range gate, rmax, is used in some features

to filter out those measurements whose range are considered too long. Further information about the sensor may be found at www.sick.com.

(26)

Figure 3.1. SICK LMS 291-S05

3.1.1 Laser Scans

A laser measurement Liconsist of a range measurement riand an angle

measure-ment αi Li = ri αi . (3.1)

Each laser scan Lj _{consists of N measurements of range r and angle α, i.e.}

Lj _{= {r}_ij, αj_i}N

i=1. The range and angle measurement can be transformed into

Cartesian coordinates according to xi yi =ricos(αi) risin(αi) (3.2) As with any other sensor, there is uncertainty involved in laser measurements. The measurement noise for the laser sensor is assumed to be zero mean Gaussian with covariance RL= " σr 1 + ri rmax 2 0 0 σ2 α # , (3.3)

i.e. the standard deviation in range is larger for longer range measurements. The uncertainty in Cartesian coordinates associated with a laser measurement Li is

given by

RL,cart= JLRLJLT, (3.4)

where JL is the Jacobian of (3.2),

JL=

cos(αi) −risin(αi)

sin(αi) ricos(αi)

(27)

3.2 Laser Matching 13

Figure 3.2 shows a laser scan where uncertainty ellipses are shown for every fifteenth laser point, σr= 1 and σα= 1◦.

−30 −20 −10 0 10 20 30 40 50 −50 −40 −30 −20 −10 0 10 x [m] y [m]

Figure 3.2. Illustration of a laser scan. Uncertainty ellipses are shown for every fifteenth

point. The robot’s position is indicated by the circle and the robot’s heading is indicated by the line.

3.2 Laser Matching

In order to close a loop, the robot must be able to detect that it has returned to a previously visited position. Here that is done by comparing the spatial appearance of two laser scans to each other, and classifying them as either matching or non-matching. A number of features, explained below, create weak classifiers which combined can create a strong classifier. A properly trained strong classifier is capable of robust and reliable classification.

3.2.1 Features

Features refers to spatial properties of laser scans that are of certain interest, and that are easy to compare. These include, among others, the area covered by the laser scan and the distance covered by the scan. All features are meant to be utilised in loop closure detection, and it is therefore important that they are

(28)

invariant to rotation. Rotation invariance means that two laser scans obtained in the same (x, y) position should give the same feature values, regardless of the heading of the robot when the scans were obtained. If the features are not invariant to rotation, loop closure cannot be detected from arbitrary direction. 26 different features were implemented, 6 of those were discarded due to not being invariant to rotation. All remaining 20 features are described below.

Some of the features are illustrated with figures, Figure 3.3(a) shows the raw laser scan that was used for those illustrations.

Area Feature

The area feature measures the area covered by a laser scan. Points whose range is greater than rmax have their range set to rmax. Figure 3.3(b) illustrates the area

feature.

farea= N −1

X

i=1

min (ri, rmax) min (ri+1, rmax) sin αi+1− αi

2

(3.6)

Average Range Feature

The average range feature measures the average range of the individual scans in the laser scan. Ranges greater than or equal to rmax are set equal to rmax.

faverage range= 1 N N X i=1 min (ri, rmax) (3.7) Centroid Features

The centroid features measures the distance from the origin to the mean position. The mean position is calculated as

xmean ymean = 1 N P i: ri<rmaxxi 1 N P i: ri<rmaxyi . (3.8)

The distance to the origin is then calculated as

fmean centroid=px2mean+ y2mean (3.9)

Figure 3.3(c) illustrates the centroid feature.

Close Area Feature

The close area feature measures the area covered by the laser scan, excluding the area covered by range measurements whose range is greater than or equal to rmax.

fclose area= X i: ri<rmax r2_i sin δα 2 , (3.10)

(29)

3.2 Laser Matching 15 −80 −60 −40 −20 0 20 40 60 80 −80 −60 −40 −20 0 20 Raw Scan

(a) Raw data

−50 −40 −30 −20 −10 0 10 20 30 40 −50 −40 −30 −20 −10 0 10 20 Area: 2315.1097 (b) Area feature −30 −20 −10 0 10 20 30 40 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 Centroid: 5.5172 (c) Centroid feature −30 −20 −10 0 10 20 30 40 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 Close Area: 1108.4697

(d) Close area feature

Figure 3.3. Feature illustrations

where δα is the angle interval at which range measurements are taken. If 361

range measurements are acquired in a 180 degree field of view, then δα= π361−1180

radians. Figure 3.3(d) illustrates the close area feature.

Close Distance Feature

The close distance feature measures the sum of the distances between consecutive points whose range is smaller than rmax, excluding distances that are larger than

a maximum distance gate, gmax dist. Figure 3.4(a) illustrates the close distance

feature.

di= p(xi− xi+1)2+ (yi− yi+1)2 i : ri < rmax (3.11a)

fclose dist= X i: di<gmax dist di (3.11b)

(30)

Circularity Feature

The circularity feature fits a circle to the points in the laser scan in a least squares sense. Two features are then given as the residual sum of squares and the radius of the circle.

A circle can be described by its centre coordinates (xc, yc) and radius r. For

all points (xe,i, ye,i) along the edge of the circle the following holds:

r2= (xc− xe,i)2+ (yc− ye,i)2= x2c− 2xcxe,i+ x2e,i+ y 2

c − 2ycye,i+ y2e,i. (3.12)

A circle can be fitted to the laser scan L = {xi, yi}Ni=1 by solving the N

equations

r2_{− x}2

c− y2c+ 2xcxi+ 2ycyi = x2i + yi2, i = 1, . . . , N, (3.13)

in the least squares sense. Figure 3.4(b) illustrates the circularity feature.

−30 −20 −10 0 10 20 30 40 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 Close Distance: 131.6171

(a) Close distance feature

−30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 Circularity: 102740.1758 Range: 20.9698 (b) Circularity feature −30 −20 −10 0 10 20 30 40 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 Distance: 291.9556 (c) Distance feature −50 −40 −30 −20 −10 0 10 20 30 40 −50 −40 −30 −20 −10 0 10 20 Far distance: 1365.8857

(d) Far distance feature Figure 3.4.

Curvature Features

The curvature features are based on the curvature along the points in the laser scan. Let xa = [xa, ya]T, xb = [xb, yb]T and xc = [xc, yc]T be three consecutive

(31)

3.2 Laser Matching 17

points, let A be the area covered by the triangle with corners in xa, xb and xc,

and let da, db and dcbe the distances between the points. Then, the curvature of

the boundary at xb is calculated as

k = 4A

dadbdc

(3.14) The curvatures over all points, excluding points whose range is greater than or equal to rmax, are calculated and two features are then given as the mean curvature

and the standard deviation of the curvatures.

Distance Feature

The distance feature measures the sum of the distances between consecutive points, excluding points whose range is greater than or equal to rmax. Figure 3.4(c)

illustrates the distance feature.

fdist = X

i: ri<rmax

p(xi− xi+1)2+ (yi− yi+1)2 (3.15)

Far Distance Feature

The far distance feature measures the sum of the distances between all consec-utive points, i.e. including points whose range is greater than or equal to rmax.

Figure 3.4(d) illustrates the far distance feature. ffar dist= N −1 X i=1 p(xi− xi+1)2+ (yi− yi+1)2 (3.16) Group Features

The group features are two features that measure the number of groups and the mean group size. A group is defined as a cluster of laser points, in which the distance between consecutive points is less than a maximum distance gate gmax dist.

To be considered a group, the cluster has to contain more than a certain number of points specified by the minimum group size gate gmin size.

Maximum Range Feature

The maximum range feature measures the number of points in the laser scan whose range is greater than or equal to the maximum range gate.

fmax range=X i 1_{ri≥ rmax}, (3.17) where 1_{ri≥ rmax} = ( 1 if ri≥ rmax 0 otherwise (3.18)

(32)

Mean Angular Difference Feature

The mean angular difference (MAD) feature measures the sum of the angles be-tween consecutive point to point vectors. Given two consecutive laser points Liand

Li+1, a vector that connects the points is given as ¯xi,i+1= [xi+1− xi, yi+1− yi]T.

The feature is calculated as

fMAD= X i:ri<rmax arccos _¯ xi,i+1• ¯xi+1,i+2 ||¯xi,i+1|| ||¯xi+1,i+2|| , (3.19)

where • is scalar product.

Mean Deviation Feature

The mean deviation feature calculates the mean deviation from the mean of the laser scan. The feature is calculated as

fmean deviation= 1

N X

i:ri<rmax

p(xi− xmean)2+ (yi− ymean)2, (3.20)

where xmean and ymeanis calculated as in (3.8).

Regularity Feature

The regularity feature measures the regularity of the laser scan, which is defined as the standard deviation of the distances between consecutive points in the laser scan. Laser points whose range is greater than or equal to rmaxare not considered.

Let di,i+1 be the distance between the laser points with indices i and i + 1,

and let ¯d be the mean value of di, ∀ i : ri< rmax. The regularity feature is then

calculated as fregularity= s 1 N − 1 X i: ri<rmax di,i+1− ¯d 2 (3.21) Size Feature

The size feature measures the number of points which has a range shorter than rmax. fSize=X i 1_{ri< rmax}, (3.22) where 1_{ri < rmax} = ( 1 if ri< rmax 0 otherwise (3.23)

(33)

3.2 Laser Matching 19 Standard Deviation of Range Feature

The standard deviation of range feature measures the standard deviation of all the ranges that are less than or equal to rmax. The feature is calculated as

fstd range= 1

N − 1 X

i:ri<rmax

p(ri− rmean), (3.24)

where rmeanis the mean of all the ranges that are less than or equal to rmax.

3.2.2 Classifiers

A classifier is an operator that classifies its input arguments as belonging to a certain class. Binary classifiers classify the arguments as belonging to one of two classes. In this chapter, the binary classifiers sort laser scan pairs into the classes match and no match.

A weak classifier is a classifier that has a low correlation with the true classi-fication. On their own, weak classifiers are not sufficient for robust and reliable classification. A combination of weak classifiers, called a strong classifier, could perform that task though.

For all features, weak classifiers are constructed as ci Lj, Lk= fi(L j ) − fi(Lk) . (3.25)

ci is then tested against a threshold to determine the classification. A small ci

means that the values of fj

i and fik are similar, which implies that the laser scans

were taken in the same position. Similarly, a large ci implies that the laser scans

were not taken in the same position. Choosing an appropriate threshold to test wheter ci is large or small is both important and difficult.

One way to find a good combination of weak classifiers and appropriate thresh-olds for each one them is to use boosting.

3.2.3 Boosting

Boosting refers to machine learning algorithms that combine a set of weak clas-sifiers into a strong classifier. For the boosting algorithm, given are the example laser pairs [L11, L 2 1], y1 , . . . , [L1n, L 2 n], yn , (3.26)

where yi is a binary variable, yi = 0, 1 for negative (non-matching) and positive

(matching) laser pairs respectively. The type of boosting used here is Adaboost, which was first presented in [12]. Adaboost, short for adaptive boosting, is an iterative procedure that, given the training data, finds a good combination of weak classifiers.

The training data is classified with all previously added weak classifiers, and the weak classifier that improves the classification most is added to the set of weak classifiers. The training data weights are then updated, and misclassified data is given a higher weight so that the following classifiers focus more on the misclassified

(34)

data. The procedure is then repeated until T weak classifiers have been added. The set of T weak classifiers together create a strong classifier. Adaboost is given in Algorithm 1 below.

Algorithm 1Adaboost

Initialize weights for all example pairs, Wi

1= 2N1n for negative pairs and W

i 1= 2N1p

for positive pairs, where Nn and Np are the number of negative pairs and positive

pairs respectively. 1: fort = 1, . . . , T do 2: Normalize the weights,

˜ Wi t = Wi t PNn+Np j=1 W j t (3.27) 3: For each feature ff, train a weak classifier cf and evaluate the error with

respect to the weights, ǫf = X i ˜ Wti cf L1i, L2i − yi (3.28)

4: Given the previously selected weak classifiers, select the best weak classifier ct, i.e. the one that minimizes the classification error ǫt.

5: Update the weights:

Wt+1i = ˜Wtiβ 1−ei

t (3.29)

where ei= 0 if [L1i, L 2

i] is classified correctly and 1 otherwise, and βt= _1−ǫǫt_t.

6: end for

The strong classifier is:

c L1_i, L2_i = 1 PT t=1αtct L1i, L 2 i ≤ 0.5 P T t=1αt 0 otherwise (3.30) where αt= log_β1_t.

The factor 0.5 in (3.30) can be increased or decreased in order to change the detection rate and false alarm rate of the classifier.

3.2.4 Training with Tenfold Cross Validation

To decide the optimal value for the parameter T , strong classifiers were trained for different values of T , T = 1, . . . , Tn, and then tested with tenfold cross validation.

Tenfold cross validation is performed as follows.

The order of the positive and negative laser pairs is shuffled, and the complete shuffled set of laser pairs is split into ten subsets. For each of the ten subsets, the remaining nine subsets are used for training, and the parameters are then validated on the tenth subset. The validation results from all ten subsets are then pooled into a total result. Cross validation was performed for each T = 1, . . . , Tn,

(35)

3.3 Laser Scan Alignment 21

The validation was evaluated by a comparison of the number of classification errors. Three types of errors were considered, False Alarm (FA), Missed Detection (MD) and Total Error. False Alarm is when a non-matching pair is incorrectly classified as matching, and Missed Detection is when a matching pair is incorrectly classified as non-matching. Total Error is the sum of the number of False Alarms and the number of Missed Detections.

Results from cross validation for T = 1, . . . , 150 is presented in Figure 3.5. From the cross validation it was decided that T = 50 would be a good choice.

0 50 100 150 0 0.2 0.4 T FA Error 0 50 100 150 0 0.1 0.2 T MD Error 0 50 100 150 0 0.1 0.2 T Total Error

Figure 3.5. Tenfold cross validation of adaboost

3.3 Laser Scan Alignment

When loop closure has been detected, the matched laser scans must be aligned. Two different methods for laser scan alignment have been used, ICP and CRF-match.

3.3.1 Iterative Closest Point

ICP, first introduced in [7], finds, for every point in scan i, the nearest point in scan j. Any point to point distance larger than a certain distance gate is disregarded.

(36)

From the point to point associations, a rotation and translation that minimizes a square cost function is calculated. Laser scan i is translated and rotated, and the procedure is repeated with a progressively smaller distance gate.

ICP is a rather naïve approach that only searches for the nearest neighbour, and the risk of an incorrect alignment is therefore large if the initial misalignment between the two scans is big. Because of this, an initial guess for the rotation and translation is needed in most cases.

3.3.2 Conditional Random Field Matching

CRF-match is a probabilistic method that considers all possible point to point associations by comparing the geometric features in both laser scans. This makes it a more sophisticated approach to laser scan alignment than ICP. For a thorough explanation of CRF-match, refer to [35]. CRF-match is very heavy in terms of computation, where the computation time scales quadratically with the number of points in the laser scans. Therefore, subsampling of the points in the laser scan is needed.

The subsampling procedure is performed in four steps. First, all points with range larger than rmax are removed from the laser scan. Second, the remaining

points are separated into segments. A segment is defined as a group of laser points, where the distance between consecutive points is shorter than a distance gate. The third step is to remove all segments with fewer points than a minimum size gate. The final step, which consist of removing every few points (e.g. every third), is only performed if the difference between the target number of points and the remaining number of points is too large.

Subsampling is illustrated in Figure 3.6. The raw laser scan contains 722 points, after the first step 604 points remain, after the second and third 496, and after the fourth step 124 points remain. The target number of points was 100.

−20 0 20 40 −40 −30 −20 −10 0 10 604 points −20 −10 0 10 20 30 −30 −20 −10 0 10 496 points −20 −10 0 10 20 30 −30 −20 −10 0 10 124 points

Figure 3.6. Subsampling of a laser scan.

3.3.3 Covariance of Laser Scan Alignment

It is very difficult to precisely determine the correct translation and rotation, even with an advanced method such as CRF-match, and it is important to correctly

(37)

3.3 Laser Scan Alignment 23

estimate this uncertainty of the translation and rotation. Initially, a constant diagonal covariance matrix was used for experiments. Although the results were not poor, it was evident that they could be improved with an estimate of the covariance matrix of the alingment.

In [23] the authors linearized the error function that is minimized by the align-ment procedure, and then an analytical solution of the covariance matrix could be found through linear regression. Another approach was taken in [4], where the authors computed the covariance using the Hessian of the error function.

Neither of these approaches are satisfactory though, as the former method tends to underestimate the uncertainty and the latter method tends to overestimate the uncertainty. Another drawback of the methods are that neither one takes measurement noise into account.

The method used in this thesis, called line variance, is based on the covariances of the individual laser points in the two scans that are aligned. These covariances are combined using a nearest neighbour approach, giving a covariance of each point to point translation and rotation. The covariance of the total translation and rotation is then calculated from the point to point covariances.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 x [m] y [m] Line Var. Sample Line Var. Hessian

Sample Rel. pos.

Figure 3.7. Estimated covariance of laser scan alignment. The ring and straight line

represent vehicle relative pose. Results from the Hessian and sample method are too small to be seen in this figure.

(38)

standard deviation was set to 1 degree. This did, however, result in covariances that were considered too optimistic. As a solution to this, range and bearing standard deviations were increased. This gave realistic results, although the in-creased standard deviations for the measurement noise are much higher than the manufacturers specifications (statistical error (1-sigma) for range is 5 to 10 mm). For comparison, an implementation of the Hessian method in [4] was tested, along with a sample algorithm based on a method that was suggested in [39]. The Hessian method was found to yield covariances that were far too small, with standard deviations in the range of 1 to 2 centimeters in the x and y directions. The sample algorithm randomly samples initial translation and rotation, and then runs ICP to align the scans. The covariance matrix is then calculated as the covariance of the sample ICP results. However, this method was also found to yield covariances that were far too small.

As a final comparison, a method that combined the sampling method with the line variance method, was tested. Samples were generated in the same way as above, and the covariance of each sample was then computed with line variance. This method gave approximately the same results as line variance on its own, but took longer time to compute. Therefore, line variance was used without sampling. Results for covariance computation, using the four methods described, is presented in Figure 3.7. The figure only shows the uncertainty in translation and not the uncertainty in the third dimension, rotation.

A method for covariance computation that is reliable and robust would be preferred to a method with inflated range and bearing noise. However, designing such a method is beyond the scope of this thesis.

(39)

Chapter 4

Image Processing

This chapter is about image comparison. It starts by describing the concept of extracting feature descriptors from images. Features are certain areas of interest in an image described in vector form. The Tree of Words algorithm is thereafter introduced as a way of describing and compressing an image and comparing it to a database of images, using predefined descriptors.

4.1 Feature Extraction

A feature extractor detects and describes interesting areas of an image in vector form. The area to be described, usually an interest point surrounded by an ellip-tical area, is selected by a region detector. The detector must show a high degree of repeatability, meaning its ability to detect the same area in similar images. The area is described using a region descriptor that must be able to describe similar regions in a similar way for later comparison. There are many different ways of doing this and some common ones are briefly described. Refer to the citations for a more thorough presentation of the methods.

4.1.1 Region Detectors

Region detectors usually either detect regional changes in an image, or regions that are homogeneous in some way. Imagine an image of a blue painting hanging on a yellow wall, the former detects the change from blue to yellow while the latter detects the blue and the yellow regions separately.

One of the oldest and most widely used region detectors is Harris corner de-tector. It was first proposed by Harris and Stephens [14] as a way of using the eigenvalues of the second order moment matrix, also known as the autocorrelation matrix, applied to an area smoothed by a Gaussian window of scale σl, to detect

corners and edges. The scale-adapted second moment matrix in a point x = (x, y) is defined by

(40)

µ(x, σl, σD) = σD2g(σl) ∗ l2 x(x, σD) lxly(x, σD) lylx(x, σD) ly2(x, σD) (4.1) where σD is the differentiation scale, ∗ is convolution and lx(x, σ) is the Gaussian

derivative computed in the x direction [26].

The Harris corner detector’s inability to handle scale differences led to the in-troduction of automatic scale selection as a means of choosing the ideal σD.

Char-acteristic scale is selected on maximum similarity between the feature detector operator and the image based structure around the interest point [18]. Differenti-ation scale is chosen as characteristic scale. The characteristic scale is preferably determined using the Laplacian operator [25]. An affine transformation between two images would make a circular region around an interest point cover different areas. To handle this, an elliptical region around the interest point is selected. The shape and direction of the elliptical region is based on the eigenvalues of the second moment matrix [19, 2]. It has been shown to make the interest regions fairly sta-ble to affine transformation up to an angular difference of about 30◦_{[28]. Finally,}

rotation invariance was introduced using normalization based on the orientation of the dominant gradient of the area of interest [26].

The Hessian matrix provides an alternative way of finding interest points. It is defined as

H(x, σ) =L_Lxx(x, σ) Lxy(x, σ)

yx(x, σ) Lyy(x, σ)

(4.2) where σ is scale and Lxx(x, σ) is the convolution of the Gaussian second order

derivative ∂2

∂x2g(σ) with the image I in point x [3]. The location and scale of

interest points are selected by studying the determinant of the Hessian. Detectors using the Hessian matrix have shown to respond strongly to shapes like blobs and ridges [28]. Scale, rotation and affine transformation are handled in the same way as for Harris points.

Maximally stable extremal regions, MSER [24], are homogeneous regions of the same shade. To compute them, the image is first turned into greyscale and a blank canvas the size of the image is created. First the darkest pixels of the image are added to the canvas. All connected and single pixels are considered to be separate regions. Then the second darkest pixels are added. If a new pixel has no neighbours it forms a new region. If it is connected to a previous region, that region grows. If it connects two regions they are turned into one region. The whole scale of pixel intensities is consecutively added and the growth rate of all regions is tracked throughout the session. An MSER is a region which growth rate has a local minimum, i.e. a region much darker than all pixels around it so the growth of that region pauses. After all such regions have been detected, the whole procedure is repeated starting with adding the brightest pixels to the blank canvas. This produces regions of arbitrary shape with a distinct difference in pixel intensity to the background. For a fast FPGA implementation, see [17].

(41)

4.1 Feature Extraction 27

4.1.2 Region Descriptors

Feature descriptors describe the region specified by the region detector in vector form. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector.

The simplest descriptor is a vector of image pixels around the interest point. These descriptors usually suffer from high dimensionality causing complex com-putations.

Distribution based descriptors use histograms to represent the region. One such desciptor describes the distribution of the pixel intensities in a histogram. This solution obviously suffers from illumination changes. An alternative approach is to look at the reciprocal relations between pixel intensities, rather than pixel intesities, i.e. studying the shape of the histogram rather than focusing on the values [40]. Lowe [21] divided the region into n subregions and then divided each subregion into m subsubregions in which the gradient is computed. The subregion uses a histogram to classify its m gradients into 8 predefined orientations. The descriptor vector of the whole region is the n appended histograms [22].

Other approaches include Spatial-Frequency techniques that use the Gabor transform [13] to describe the frequency of the region, and differential descriptors that compute local derivatives up to a given order to approximate the region [27]. For a performance evaluation of region descriptors, see Mikolajczyk and Schmid [27].

4.1.3 Implemented Extractors

Implementations of some of the above mentioned region detectors and descriptors can be downloaded for free.

The robotics group at the University of Oxford offer binaries for both detectors and descriptors which can be used in any combination; http://www.robots.ox. ac.uk/~vgg/research/affine/index.html.

For MatLab implementations of some extractors, see Andrea Vedaldi’s page; http://www.vision.ucla.edu/~vedaldi/index.html.

Scale Invariant Feature Transform

Possibly the most commonly used feature extractor today, is the Scale Invariant Feature Transform (SIFT) [21]. It is a rotation and affine invariant Harris point extractor that uses a difference-of-Gaussian function instead of a Laplacian, to determine scale. It alternates between smoothing with a Gaussian kernel and subsampling, and each representation is obtained by subtracting two successive smoothed images. Local extrema in the image pyramid 3D space represent the position and scale of an interest point. Difference-of-Gaussian is computationally advantageous compared to Laplacian methods. For region description SIFT uses gradients in subspaces. It uses a 4 x 4 subspace, each represented by the gradients in 8 orientations. Each feature is therefore described using a 4 x 4 x 8 = 128 byte vector. The implementation used is made by Lowe and can be downloaded from http://www.cs.ubc.ca/~lowe/keypoints/

(42)

SIFT has proven to be an excellent feature extractor - fast, robust and with a high degree of repeatability [27].

Feature extraction of a 480 x 640 pixel greyscale image of an urban environment takes about 2 - 3.5 seconds and returns 1000 to 2500 features, see Figure 4.1. A 1024 x 1360 greyscale underwater environment takes about 8 - 12 seconds to process and returns 3000 to 6000 features. Both feature extractions were run on a 2.8 GHz Intel Pentium 4 with 2.5 GB RAM running Ubuntu.

A large portion of the features, ∼50-70%, are high frequency features not ap-pearing in a similar image. High frequency features are features describing small objects such as a leaf in a picture of a tree. This is estimated from feature matching of similar images.

Figure 4.1. Extracted SIFT descriptors. The position, size and direction of an arrow

illustrate the size and the direction of a region of interest.

Speeded Up Robust Features

Speeded Up Robust Features (SURF) was developed to be a fast, yet reliable, extractor, taking calculation simplification to the next level [3, 30]. It uses Hessian box filters to detect regions of interest, which are precomputed approximations of common Gaussian second order partial derivatives. These filters come in different sizes to determine scale and can be convoluted with the image very fast using so called integral images.

The region of interest is cicular and the orientation is calculated using Haar wavelets. For region description, a square region with the same orientation as the interest region is selected and divided into 4 x 4 subspaces. In each one of these

(43)

4.2 Tree of Words 29

subspaces orientation is calculated in x and y directions, dx and dy, using Haar wavelets. Subsequently, dx, dy, |dx| and |dy| are used to describe each subspace, resulting in the final length of the descriptor vector being 4 x 4 x 4 = 64 bytes.

A binary implementation of SURF can be downloaded from http://www. vision.ee.ethz.ch/~surf/index.html. Lowe’s implementation of SIFT con-tains MatLab code for calling a feature extracting binary. It can easily be modi-fied to extract SURF features.

Feature extraction of a 480 x 640 greyscale image of an urban environment, takes about 0.4 - 1.0 seconds and returns 300 to 1000 features, Figure 4.2. As for SIFT, plenty of features are high frequency features not appearing on a similar image. A 1024 x 1360 greyscale underwater environment takes about 1.0 - 2.0 seconds to process and returns 400 to 1000 features. Execution time is mostly depending on image size and number of extracted features.

Figure 4.2. Extracted SURF descriptors. The position and size of a circle illustrates a

region of interest.

4.2 Tree of Words

For the image based loop closure detection we have used the Tree of Words algo-rithm. Tree of Words has previously been used to find the closest matches of an image in a large database, ∼1 million images [34]. The algorithm is simple and fast which suits our SLAM application. An introduction to Tree of Words and a modified algorithm for safer loop closure detection is given below.

(44)

4.2.1 Background

Tree of Words and Bag of Words use the same basic principle for image compres-sion. Feature descriptors are extracted from an image using SIFT or SURF, see Section 4.1. Each descriptor is compared to a large number of predefined descrip-tor vecdescrip-tors, called words, to find its nearest match. If descripdescrip-tor a is classified as word x, the image is said to contain word x, no matter exactly how well it matches. The feature descriptor itself is thereafter dismissed, all that remains is the presence of word x in the image.

The set of predefined words is called the vocabulary. Tree of Words intro-duced a hierarchical tree search for fast classification of the descriptors, called the vocabulary tree.

After all extracted descriptors from an image has been classified as different words, the image can be described as a list of these words. This list is subse-quently saved, since this highly compressed representation of the image can be easily compared to a different image’s list.

This list of words is constructed by perfect indices. A perfect index is an index where each item has a complete list of its occurences in, for example, a text. In Tree of Words, each word in the vocabulary is linked to its own index containing all the images in the database where the word has been present.

When a query image is compared to all previously classified images in a database to find the closest match, its features are classified as above and each word it con-tains returns its perfect index. By comparing the indicies returned by the words, a list of similar images can easily be retrieved.

For the SLAM solution, this enables a fast way to compare a new scene to a large number of other scenes saved in the experiments, in order to detect loop closures, see Algorithm 2.

Algorithm 2Tree of Words - basic database query of image I Require: A precomputed tree with a database of classified images

1: Extract feature descriptors from image I

2: Classify each descriptor as a word using the tree

3: Use inverted indices of each word to find most likely match, Isimilar

4: Return Isimilar

4.2.2 Building a Vocabulary Tree

The vocabulary is built using a large number of feature descriptors extracted from an image database. This database does not necessarily need to contain images of the environment in which the experiment will take place. These descriptors are clustered into a tree structure using k-means clustering, see Appendix A.

The rootnode uses all descriptor vectors to compute k clusters. Each cluster, described with a cluster centre vector, is represented in the tree by a node. These nodes use all the descriptors assigned to their cluster, i.e. all descriptors that are most similar to the node’s cluster vector, to create k new child nodes each.

(45)

4.2 Tree of Words 31

This procedure continues down to a predefined depth, to the tree’s leafs which represent a word each. All leafs cluster vectors constitute the vocabulary. Each node contains its child nodes and their k-means cluster vectors for easy comparison. 10 branches in each node and 5 or 6 levels gives a vocabulary of 100’000 or 1’000’000 words.

Algorithm 3Build a tree with depth 2 1: Collect a lot of images

2: Extract ∼500’000 - 1’000’000 feature descriptors

3: Compute k cluster center vectors, µi, using k-means clustering, see Appendix

A

4: Create rootnode with k child nodes, each described by a µi

5: fori = 1, . . . , k do

6: Find set Si; all initial descriptors most similar to µi

7: Calculate k number of µj of Si

8: Create node with k child nodes, each described by its µj

9: forj = 1, . . . , k do

10: Create leaf with µj as its word

11: end for 12: end for

The tree is constructed in MatLab using struct. Each node is a struct containing its child nodes’ cluster vectors, the k child nodes and the node ID. The leafs of the tree represent the words and are identified by the node’s ID.

k-means clustering is performed using MatLab’s function kmeans.

The rootnode also contains the words’ weights, normalizers for scoring, number of added images, feature extractor used, etc.

4.2.3 Descriptor Classification

A fixed vocabulary makes classification of an extracted feature easy. For each level of the tree the descriptor vector is compared to the child nodes’ cluster vectors, to find its nearest match. Nearest match is here defined as the cluster vector to which the descriptor vector has the smallest angular difference. That brings the descriptor to the cluster’s child node where it is compared to this node’s child nodes’ cluster vectors. For each descriptor vector the most similar leaf, i.e. word, is detected and the image’s ID, Section 4.2.4, is added to the word’s perfect index. The tree structure keeps the number of comparisons needed to classify a feature descriptor as a word very low. For each level, k comparisons are needed. With a tree-depth d, a total of kd comparisons are needed for every descriptor vector. This makes it possible to use a larger vocabulary which gives a more reliable matching procedure.

The angle between two vectors can be calculated using scalar product