Simultaneous Localisation and Mapping of Indoor Environments Using a Stereo Camera and a Laser Camera

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Simultaneous Localisation and Mapping of Indoor

Environments Using a Stereo Camera and a Laser

Camera

Examensarbete utfört i Reglerteknik vid Tekniska högskolan vid Linköpings Univeristet

av

Jon Bjärkefur and Anders Karlsson

LiTH-ISY-EX--10/4427--SE

Linköping 2010

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Simultaneous Localisation and Mapping of Indoor

Environments Using a Stereo Camera and a Laser

Camera

Examensarbete utfört i Reglerteknik

vid Tekniska högskolan i Linköping

av

Jon Bjärkefur and Anders Karlsson

LiTH-ISY-EX--10/4427--SE

Handledare: Fredrik Lindsten

isy_{, Linköpings universitet}

Zoran Sjanic

isy, Linköpings universitet

Christina Grönwall

FOI

Joakim Rydell

FOI

Examinator: Thomas Schön

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution

Division, Department

Division of Automatic Control Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2010-06-20 Språk Language Svenska/Swedish Engelska/English ⊠ Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ⊠

URL för elektronisk version http://www.control.isy.liu.se http://www.ep.liu.se ISBN — ISRN LiTH-ISY-EX--10/4427--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title

Simultan lokalisering och kartering av inomhusmiljöer med en stereokamera och en laserkamera

Simultaneous Localisation and Mapping of Indoor Environments Using a Stereo Camera and a Laser Camera

Författare

Author

Jon Bjärkefur and Anders Karlsson

Sammanfattning

Abstract

This thesis describes and investigates different approaches to indoor mapping and navigation. A system capable of mapping large indoor areas with a stereo camera and/or a laser camera mounted to e.g. a robot or a human is developed. The approaches investigated in this report are either based on SLAM (Simultane-ous Localisation and Mapping) techniques, e.g. Extended Kalman Filter-SLAM (EKF-SLAM) and Smoothing and Mapping (SAM), or registration techniques, e.g. Iterated Closest Point (ICP) and Normal Distributions Transform (NDT).

In SLAM, it is demonstrated that the laser camera can contribute to the stereo camera by providing accurate distance estimates. By combining these sensors in EKF-SLAM, it is possible to obtain more accurate maps and trajectories compared to if the stereo camera is used alone.

It is also demonstrated that by dividing the environment into smaller ones, i.e. submaps, it is possible to build large maps in close to linear time. A new approach to SLAM based on EKF-SLAM and SAM, called Submap Joining Smoothing and Mapping (SJSAM), is implemented to demonstrate this.

NDT is also implemented and the objective is to register two point clouds from the laser camera to each other so that the relative motion can be obtained. The NDT implementation is compared to ICP and the results show that NDT performs better at estimating the angular difference between the point clouds.

Nyckelord

(6)

(7)

Abstract

This thesis describes and investigates different approaches to indoor mapping and navigation. A system capable of mapping large indoor areas with a stereo camera and/or a laser camera mounted to e.g. a robot or a human is developed. The approaches investigated in this report are either based on SLAM (Simultane-ous Localisation and Mapping) techniques, e.g. Extended Kalman Filter-SLAM (EKF-SLAM) and Smoothing and Mapping (SAM), or registration techniques, e.g. Iterated Closest Point (ICP) and Normal Distributions Transform (NDT).

In SLAM, it is demonstrated that the laser camera can contribute to the stereo camera by providing accurate distance estimates. By combining these sensors in EKF-SLAM, it is possible to obtain more accurate maps and trajectories compared to if the stereo camera is used alone.

It is also demonstrated that by dividing the environment into smaller ones, i.e. submaps, it is possible to build large maps in close to linear time. A new approach to SLAM based on EKF-SLAM and SAM, called Submap Joining Smoothing and Mapping (SJSAM), is implemented to demonstrate this.

NDT is also implemented and the objective is to register two point clouds from the laser camera to each other so that the relative motion can be obtained. The NDT implementation is compared to ICP and the results show that NDT performs better at estimating the angular difference between the point clouds.

(8)

(9)

Acknowledgments

A lot of people have been helpful during this work. When we started the work, we soon realised that mapping and navigation is a very wide and at the same time deep topic. However, thanks to good guidance in the beginning, especially from our examiner Dr Thomas Schön, the project got a good and smooth start.

This master thesis has been performed at the Swedish Defence Research Agency, FOI. Our supervisors at FOI, Dr Joakim Rydell and Dr Christina Grönwall, have been very helpful. We have had many interesting discussions and lately they have proofread our report in a rigorously and helpful way. We would also like to thank our supervisors at ISY, Zoran Sjanic and Fredrik Lindsten. Dr. Jonas Nygårds helped us with interesting discussions when we defined the main part of the project. Also a big thanks from Anders Karlsson to Ulrike Münzner, whose happy heart has power to make a stone a flower. Her support and inspiration made this work much easier.

(10)

(11)

Introduction

The available technologies of the twenty-first century offers great possibilities in the art of making maps. While the first maps where created with information from travelling merchants and explorers, modern maps are made from satellite images and aerial photographs. The advanced technologies have also made navigation much easier. It is no longer necessary to use sextants, stars and other tricks. Satellites orbit the earth and can provide anyone using a Global Positioning System (GPS) with an accurate position estimate in seconds.

A major shortcoming of the GPS system is that it does not work indoors. Neither is it possible to take overview photos of building interiors. Indoor mapping and localisation requires other techniques, and remains an open and active research area.

1.1 Simultaneous Localisation and Mapping

Simultaneous Localisation and Mapping (SLAM) deals with the problem of mak-ing maps of previously unknown environments while simultaneously performmak-ing localising within these maps. SLAM is a very active research area and thus there exist a great number of different solutions to the problem.

A SLAM algorithm will need to keep track of both the moving platform itself and the world around it. The world is usually modelled using landmarks. A landmark is a distinct feature in the world, at a specific location, that can be used for navigation. An example of a real world landmark is a lighthouse.

There are many different parts in SLAM that need to be solved. Figure 1.1 illustrates an example that will be used to describe these parts. The figure shows a robot that moves through a corridor between time t = 1 and t = 2. The robot is equipped with a camera, so that it can see the environment in front of it. For each captured image, the following steps need to be performed:

• Feature extraction which means extracting interesting points from the newly acquired images.

(16)

2 Introduction

Figure 1.1: A black robot moves in an unknown corridor and observes corners (marked with red circles).

• Data association, that is, matching the new features to corresponding land-mark if possible.

• Pose estimation which means that a new pose is calculated using the infor-mation from the new features and the data association.

At t = 1, an image is captured by the robot. This image is processed with a feature extraction system that finds corners, marked with red circles in Figure 1.1. Three corners are found. Since the world is completely unknown at t = 1, i.e. the robots map is empty, these landmarks can be added to the map right away and no data association is needed. The pose estimation is also not needed now, since the robot has not moved yet.

Another image is captured at t = 2. The feature extraction system finds four corners. The data association system must now find out if any of these observed corners have been seen before, i.e. in t = 1. It realises that two of the corners have been seen before and returns two matches. The pose estimation system can now use these matches to estimate the relative movement between t = 1 and t = 2. This will also give an estimate of the speed of the robot and improved estimates of the matched corners positions.

1.2 A Current Application of SLAM

SLAM technology is already utilised around the world. The most common usage is in the field of autonomous robotic vacuum cleaners, where the robots are equipped with a wide-angle camera looking upwards. The SLAM systems are then extracting features from the ceiling and tracking these. It is then possible to map a whole

(17)

1.3 Motivation for Indoor SLAM 3

apartment/house so that the robots can avoid to clean the same spot twice during the same session.

Ceiling based visual SLAM (CV-SLAM) was first introduced in [20]. The somewhat fascinating results can be seen at http://www.youtube.com/watch?v= bq5HZzGF3vQ. A vacuum cleaner is mapping an entire apartment while at the same time cleaning it more or less optimally.

One of the most recent CV-SLAM vacuum cleaners that are available to cus-tomers is the Samsung Navibot. It has a wide-angle camera with a field of view of 167 degrees, capturing images at 30 frames per second. The robot makes a map and uses it to localise itself. It is possible for the robot to go home to its charging station and then go back to work at the place where it was before.

1.3 Motivation for Indoor SLAM

Imagine that a large building is on fire. Several firemen run around looking for people to rescue. A task force leader is standing outside of the building and needs to have an overview of where people are found and what the structure of the building is. There has been no time to get the blue-prints of the building.

Each fireman has a small stereo camera attached to his or her helmet. All cameras send live video footage to a main computer in the fire truck. The main computer processes the image flows and calculates how each fireman is moving and how the corridors and rooms that he or her see is looking. A combined map is built on the go in real time and the force leader can see it while it is created. When a fireman sees a person in need of rescue, he communicates this to the task force leader. The leader can now push a button and the main computer marks the location in the map.

The combined map and the trajectories of all the firemen can be used to see where no one has looked. It can also be used to plan the ongoing mission. Fur-thermore, it is easy to see that the same technology can also be used by e.g. the police or the military.

1.4 Laser Scan Registration

Two cameras are available for the SLAM systems in this thesis. The first one is a stereo camera. A stereo camera consists of two cameras mounted a bit away from each other, but looking in the same direction. The benefit of this is the same that humans have with their two eyes - it is possible to know approximately how far away an object is.

The second camera is a laser camera. This camera gives accurate estimates of the range to each part in the observed scene. Chapter 2 contains more information about these sensors. The laser camera is a relatively new sensor. The resolution in pixels is very limited compared to a new modern digital camera. The big difference for the laser camera that each pixel measures the intensity, and the amplitude of the reflected light from objects in the world. Its biggest advantage is that it can also measure the distance to objects in the world.

(18)

4 Introduction

These measurements are performed by illuminating the environment with light of a known wavelength. By checking the phase of the incoming light the distance to the object depicted at a certain pixel can be calculated. A measurement of this type gives the ability to perform a different form of navigation.

Imagine that the camera looks at a completely flat wall 2.5 m away. In a perfect world this means that each pixel in the range image will measure a distance of 2.5 m. When moving the camera forward 0.5 m each pixel will measure a distance of 2.0 m. Now say that the movement of 0.5 m is unknown, this will be the case when just moving the camera around. Then there exists methods for calculating this motion in an efficient way.

It can be thought of as walking in a room and looking around. For each step that is taken the change in environment is used to deduce how long the step was and how the viewing angle changed. By taking a step forward and noting the length and direction each time the complete path can be recreated easily. Doing this with a laser camera is called laser scan registration and for this there exist multiple methods. In this thesis ICP and NDT will be used for laser scan registration.

1.5 Goal and Problem Definition

The goal of this thesis is to investigate different aspects of indoor mapping using the stereo camera and the laser camera. The outcome shall be a report of these aspects along with a SLAM system capable of making large indoor maps. More specifically the following questions shall be answered in this thesis:

1. Which SLAM techniques are suitable for making large high-quality maps of indoor environments?

2. How can the laser camera complement the stereo camera when making indoor maps?

3. What will the quality of the map be if the cameras are moved systematically compared to if they are moved more randomly?

4. What will the quality of the map be if the cameras are mounted to a robot compared to if they are handheld?

5. How does the two laser scan registration methods compare to each other?

1.6 Limitations

In this thesis there are two notable limitations. The first one is that all data from the stereo camera and the laser camera will be collected indoors. Collecting data indoors reduces problems with illumination, especially for the laser camera.

The second limitation is that there are no moving objects present in the images. Moving objects can be a problem since it affect the images in an unpredictable

(19)

1.7 Method 5

way. This means that there will be no people or moving machines etc. in the room when the data is collected.

Another limitation is that EKF-SLAM will not be used for the largest maps. The reason for this is that the computational time becomes a limiting factor. Computational time is also a limiting factor for the laser registrations algorithms and therefore these algorithms will only be applied on smaller data sets.

1.7 Method

Implementation of the SLAM algorithms is performed in steps of gradually increas-ing complexity. Gradually increasincreas-ing the functionality of the algorithms makes it easier to spot errors that can be hard to detect later on. The implementation is done with the following steps:

1. 2D simulation with known data association. 2. 2D simulation with unknown data association.

3. Increase the dimensionality to three dimensions and repeat the two previous steps.

4. Extend the algorithm to use real world data.

This is an efficient way to detect problem with data association and other minor problems that can be almost impossible to detect with noisy data from the real world. Initial testing is done with the well known EKF-SLAM method. Since SLAM algorithms becomes computationally demanding for large maps a submap based approach will be used for the largest maps. The submaps will then be merged into one single map in an efficient way.

Real world data consists of features in the stereo images extracted by SURF [5]. Image processing is not the goal with this thesis and therefore an existing imple-mentation of SURF will be used. The data is if possible collected with ground truth available. Data association is then performed with either of the following two methods:

• Nearest neighbor association. • Mahalanobis distance association.

Laser scan registration is tested by using a set of positions marked with tape on the floor. The laser camera is moved between the points and collects a set of images at each. The two algorithms for registration are then compared to each other. As a final comparison, the laser camera will be moved along a trajectory to see which algorithm that gives the most accurate result.

(20)

(21)

Chapter 2

Sensors

In this chapter the equipment used for the experiments will be described. The equipment consists of two cameras, the first one is a Bumblebee stereo camera and the second one is a 3D camera called CamCube. This chapter will describe both cameras and provide some sample images from them. Furthermore the rig used for experiments will be described and illustrated using photos.

2.1 Bumblebee Stereo Camera

The Bumblebee stereo camera is developed by Point Grey Research. The camera is depicted in Figure 2.1. It has two sensors with a maximum resolution of 640 × 480 pixels. The resolution used in this thesis is 320 × 240 pixels since it provides sufficient data and the image processing time is shorter. Bumblebee is able to acquire images at 48 FPS according to the specifications. This is however not the case when using the Matlab interface on the current hardware. At best the camera reaches about 15 FPS.

The field of view (FOV) for Bumblebee is about 97◦ _{horizontally. This causes}

some rather serious fisheye effects on the images, however, the Software Develop-ment Kit (SDK) delivered with the camera can correct this distortion. Figure 2.2 provides an example of an image that is distorted together with an image that has been rectified using the SDK.

2.2 PMD Laser Camera

The camera used for obtaining 3D-scans is a PMD[vision] CamCube that can be seen in Figure 2.6. This camera has a resolution of 204 × 204 pixels which gives at total of 41616 points in each scan. The SDK provided with the camera makes it possible to directly acquire multiple image types such as range image, intensity image, amplitude image and also a calibrated 3D point cloud. Figure 2.3 shows the three images and a 3D-scan of the same scene.

(22)

8 Sensors

Figure 2.1: Point Grey Research Bumblebee stereo camera.

(a) Distorted image. (b) Rectified image.

(23)

2.2 PMD Laser Camera 9 0 50 100 150 200 20 40 60 80 100 120 140 160 180 200

(a) Range image.

0 50 100 150 200 20 40 60 80 100 120 140 160 180 200 (b) Amplitude image. 0 50 100 150 200 20 40 60 80 100 120 140 160 180 200 (c) Intensity image. −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 −1 −0.5 0 0.5 1 1.5 (d) Point cloud.

(24)

10 Sensors

Figure 2.4: Bumblebee mounted on a table.

According to specifications the standard deviation is less than 3 mm at 2 m distance and the frame rate is specified to 25 FPS. In the Matlab implementa-tion used to acquire images the frame rate seldom exceeded 12 FPS with current hardware. The field of view is specified to 40◦ _{in horizontal and vertical direction.}

The 3D-scans are used for creating a trajectory. It is also possible to use any of the other images to detect points of interest and use the range image to acquire the distance. This approach will be tested in Chapter 11. CamCube have some advantages over an ordinary camera. One of them is being able to acquire images in total darkness.

2.3 The Rig

When using only the stereo camera a table with wheels was used. The camera was mounted on an arm attached to the table. For collecting data an ordinary laptop was used. This rig is depicted in Figure 2.4.

To be able to better follow ground truth the camera was mounted on a tripod as well, the tripod was then mounted on a frame with four wheels. Beneath the camera a rod that almost reached the floor was mounted. The only usage for the rod was to make it easier to follow a predefined trajectory taped on the floor. Using this rig the camera could be moved along a trajectory with quite good precision. Figure 2.5 shows the stereo camera mounted on the tripod.

When both cameras were used they were mounted on a tripod. In this setup the CamCube was mounted beneath the stereo camera. The stereo camera was approximately centered on top of the laser camera. The tripod with both cameras mounted can be seen in Figure 2.6

Using this setup the cameras can see almost the same scene apart from the fact that the stereo camera has 97◦ _{FOV compared to 40}◦_{for the laser camera. Figure}

(25)

2.3 The Rig 11

(26)

12 Sensors

(27)

2.4 Camera Calibration 13

Figure 2.7: Image from Bumblebee’s left camera in grey. The red overlay in the middle is what the CamCube laser camera see. The image from the CamCube has been scaled to fit in the image from Bumblebee.

2.7 shows the view from the laser camera as an overlay on the left image from the stereo camera. As can be seen in Figure 2.7 there is quite a large difference in FOV.

2.4 Camera Calibration

To be able to use the cameras together the cameras they have to be calibrated. For this the Camera Calibration Toolbox [33] for Matlab is used. This toolbox can be used to calibrate a single camera and a stereo pair. In this case the two cameras are a Bumblebee stereo camera and a CamCube laser camera. This results in a total of three cameras so because of this only the left camera in the stereo camera is utilised.

Calibration is done by first calibrating the cameras individually. The calibra-tion is performed using a chessboard. The chessboard is tilted in various direccalibra-tions and for each pose an image from each of the cameras is saved. The four outer cor-ners on the chessboard are then marked manually. The remaining corcor-ners are then marked automatically after specifying the number of squares on the board. Figure 2.8 shows how the corner extraction is performed. The green circles represent the four corners that are manually selected. Red crosses are the corners extracted by the calibration toolbox.

From the corners extracted from the images the toolbox then calculates the parameters for the camera. Table 2.1 shows the parameters for the Bumblebee camera together with the parameters given by specifications. Since the images

(28)

14 Sensors

X Y

O

The red crosses should be close to the image corners

50 100 150 200 250 300 50

100

150

200

Figure 2.8: Corner extraction in an image from Bumblebee.

from Bumblebee are rectified by the SDK, distortion is set to zero. The results from the calibration shows very promising results for Bumblebee. All values are close to the specified values and if the uncertainties are included the specified value lie in the interval.

For CamCube there are no specifications available. The sensor have 204 × 204 pixels so the principal point can be approximated with 102 in both directions. Table 2.2 shows the parameters from calibration of CamCube. As expected the principal point is close to 102. The other parameters are hard to verify but it seems reasonable that f has the same value in both directions. Also there is some distortion visible in the images acquired from Camcube. However, the distortion parameters are a bit uncertain and should not be relied on.

Calibrating the left stereo camera and the laser camera as a stereo pair gives the extrinsic parameters in Table 2.3. Measurements on the rig confirms that the parameters are accurate. It is of course hard to verify since it is unknown exactly where in the camera the sensor is located. Since the rotation is close to zero it is for simplicity approximated by zero.

Figure 2.9 shows the setup together with the different estimated poses of the chessboard. In total the chessboard have 14 different poses. Each pose is rep-resented by a numbered plane in the figure. It is also obvious from the figure that the left camera has a wider field of view which is correct according to the specifications.

(29)

2.4 Camera Calibration 15

Table 2.1: Parameters for Bumblebee.

Parameter Specification Estimation Uncertainty Focal Length +142.7997 +142.4245 ±2.0323 +142.7997 +141.4339 ±2.0474 Principal Point +160.4014 +159.7758 ±2.2912 +123.5766 +123.6706 ±2.1572 Skew +0 +0 ±0 Distortion +0 −0.0169 ±0.0571 +0 +0.1047 ±0.2649 +0 +0.0018 ±0.0057 +0 −0.0008 ±0.0058 +0 +0 ±0

Table 2.2: Parameters for CamCube. Parameter Estimation Uncertainty Focal Length +283.8085 ±3.2114 +283.6723 ±3.2145 Principal Point +101.0804 ±3.5411 +102.6485 ±3.2315 Skew +0 ±0 Distortion −0.3820 ±0.0650 −0.1316 ±0.6374 +0.0008 ±0.0021 +0.0004 ±0.0019 +0 ±0

Table 2.3: Position of right camera with respect to left camera.

x [mm] y [mm] z [mm] ψ [rad] θ [rad] φ [rad]

(30)

16 Sensors −100 0 100 200 0 200 400 600 800 −250 −200 −150 −100 −50 0 50 10 11 5 13 14 1 8 7 [mm] 2 12 9 3 15 6 Extrinsic parameters 4 Z X Y Right Camera Z [mm] X Y Left Camera [mm]

(31)

Chapter 3

Feature Extraction

Feature extraction refers to methods for detecting interesting points in an image and describing them. There is no universal definition of what constitutes a feature, since this depends on the application. A loose definition is an “interesting” part of the image which means a part that can be identified multiple times. First in this chapter, stereo vision, which uses traditional image processing, is treated and then feature extraction in laser data is described. The concept with features will be described for the respective cameras and then methods for extraction and matching features will be presented. As a brief summary the different methods will be tested and compared for a typical image or laser scan.

3.1 Feature Detection in Stereo Vision

In stereo vision a feature is usually a point of interest surrounded by a region that is selected by a region detector. A point of interest can be a corner or any other small distinctive object. It is important that the detector has a high ability to detect the same features in similar images i.e. high repeatability, a such detector is crucial for the SLAM problem when using stereo vision. Some feature extractors will be briefly described in the next section and later on compared.

In the stereo vision case edges and corners are the most valuable features since they can easily be tracked between consecutive images. A very common detector is Harris corner detector by Chris Harris et. al. [17]. Other common modern detectors are SIFT and SURF which are also invariant to scaling and rotation. More recent detectors are CenSurE [1] and SUSurE [15] which aims at real-time applications.

3.1.1 Scale Invariant Feature Transform (SIFT)

SIFT might be one of the most used feature detectors today. As the name suggests this method (SIFT) [23] detects scale invariant interest points from images that can be used for reliable matching between images. Interest points extracted with

(32)

18 Feature Extraction

SIFT are also invariant to rotation and quite robust against change in illumination and viewpoint angle.

Interest points detected by SIFT are mostly corners. Another common type of interest points is single coloured surfaces. For a single coloured surface SIFT selects the centre as an interest point. This can be a problem for specular surfaces. Reflections on a specular surface can create an area that is lighter or in some other way distinctive. The problem with this is that this area moves when the camera moves which results in an interest point that is “floating”. As an example, consider a whiteboard. This is a specular surface and a reflection from e.g. a lamp in the room will move over the whiteboard as the position of the camera changes. This requires the measurement to be modelled with rather much noise.

With SIFT the initial image is convolved with Gaussians to create images separated by a scale factor k. These images are gathered in octaves. Each octave represents a doubling of the standard deviation σ. The number of pictures in each octave is s+3 where s is an arbitrary number of intervals in each octave. From this the scale factor is computed as k = 21/s. Let the difference-of-Gaussian function convolved with the image be

D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y)

= L(x, y, kσ) − L(x, y, σ),

where D(x, y, σ) is the difference, G(x, y, σ) is Gaussian, I is the image, and

L(x, y, σ) is the image I convolved with the Gaussian G. When finding a point

of interest, the interest point in D(x, y, σ) is compared to its eight neighbours in the current image and to the nine neighbours in the scales above and below. It is selected to be a feature only if it has the largest or smallest value among these 27 points. Points representing local extrema are called interest points. Each interest point is then assigned a orientation and a descriptor, refer to [23] for details. The descriptor is a 128-dimensional vector for each key point.

Since SIFT was introduced in 2004 there have been various implementations available in addition to Lowe’s original implementation. In this thesis the im-plementation available in the open source library VLFeat [9] is used. VLFeat is implemented in C for efficiency and comes with a Matlab interface. Figure 3.1 shows a comparison between Lowe’s and VLFeat’s implementation.

3.1.2 Speeded Up Robust Features (SURF)

The goal with SURF [5] was to develop a novel scale- and rotation invariant inter-est point detector and descriptor which outperforms previously proposed schemes with respect to repeatability, distinctiveness and robustness. This while still main-taining a low or even lower computational cost than SIFT. SURF can be used with different types of descriptors. For feature extraction in the stereo images the SURF-128 descriptor is used, this is an extended descriptor that is much more distinctive and not much slower to compute compared to the standard SURF-64 descriptor. Matching with SURF-128 is slower than with SURF-64 due to the double dimensionality. Another but not so useful descriptor is U-SURF which is faster but not invariant to rotation.

(33)

3.1 Feature Detection in Stereo Vision 19

Figure 3.1: VLFeat keypoints (blue) superimposed to D. Lowe’s key points (red). As can be seen the results are almost identical. The image is retrieved from www.vlfeat.org.

SURF uses precomputed Hessian box filters to approximate second order Gaus-sian derivatives. The filters come in different sizes, e.g. 9 × 9, 15 × 15, 21 × 21 etc. Because of the use of integral images and box filters there is no need to iteratively apply the same filter to the output of a previously filtered layer. Instead a filter of any size can be applied directly and even in parallel. The difference between octaves are here the increase in size between filters e.g. for the first octave, the filter size increases with six, for the next octave it is increasing with twelve etc.

Descriptors used by SURF are very similar to those used by SIFT, but with a little less complexity. The first step to reduce complexity is by creating a repro-ducible orientation based on information from a circular region around the interest point, by calculating the Haar-wavelet response in x and y directions. Then a square region is aligned to the selected orientation before the SURF descriptor is extracted from it. This square is divided into 4 × 4 subspaces and for every sub-space, orientation is calculated using Haar-wavelets once again. This gives a four dimensional subspace descriptor in the format v = (P dx P dy P |dx| P |dy|)

where dx and dy are the wavelet responses over a subregion. This results in a

descriptor that is 4 × 4 × 4 for each interest point. For the extended SURF-128 descriptor the sums are computed separately depending on sign and the absolute values are split up according to sign which gives a descriptor with the size 4×4×8. There are various implementations available for SURF. One of them is a part of the OpenCV Library which is open source. Two other implementations are [6, 32] which are closed source, but available for almost any platform. One of these implementations are Graphics Processing Unit (GPU) based. The GPU SURF implementation is described in [11]. This implementation runs entirely on the GPU which leaves the CPU free for other computations. This makes it almost perfect for real-time SLAM.

(34)

3.1.3 CenSurE and SUSurE

Center Surround Extrema (CenSurE) as proposed by Agrawal et al. in [1] is yet another feature extractor. CenSurE is suited for real-time applications since it outperforms SIFT in repeatability and performance. The CenSurE detector consists of three steps. The first step is to compute the response to a bilevel Laplacian of Gaussian and filter weak responses. The second step is detection of local extrema and finally local extrema with strong corner response are detected using the Harris measure.

Speeded Up Surround Extrema (SUSurE) by Ebrahimi et al. [15] is based upon CenSurE, but takes simplifications even further to reduce computational cost without noticeable loss in performance. This makes SUSurE suitable for applications where computational power and memory are very limited, e.g. an unmanned aerial vehicle.

Due to the lack of an implementation of CenSurE and only one implementa-tion of SUSurE available these extractors will not be compared with SIFT and SURF. The implementation of SUSurE is available at http://www.cs.bris.ac. uk/Publications/pub_master.jsp?id=2001018. This is only a sample imple-mentation, but an optimised version should be released soon according to the author. At current state the application is only able to read a portable network graphics (PNG) file and write a new file with the extracted features as overlay. It is therefore not possible to use this implementation for SLAM.

3.2 Comparison of SIFT and SURF

In this thesis the SURF-128 descriptor is used. The main reason for this is that the SIFT-descriptor is 128-dimensional. Using the same length for SURF-descriptors makes it possible to use VLFeat for matching between images. There is also a matching function that comes bundled with the implementation that works for standard and extended descriptors. This method is very simple and using VLFeat is preferable. Another reason for using the 128 descriptor is that the SURF-128 descriptor have the best recognition rate with an average recognition rate of 85.7 %, as shown in [5].

Both SIFT and SURF were run on a 3.00 GHz Pentium D with 2 GB of RAM using Windows XP. The number of features returned in each image is about 40−60. Since it is more important with stable features than many features when navigating the thresholds were quite high. When using the default threshold for the SURF implementation it returns 3000 to 4000 features per image. Image resolution in this case is 640×480 pixels. Figure 3.2 shows the same image processed with SIFT and SURF. Both SIFT and SURF performs well when matching images from the stereo camera. Feature extraction in one image took about 30 ms with SIFT and 18 ms with SURF.

Another GPU-based implementation has also been tested and it yields about 85 frames per second (FPS) for an image with 640 × 480 pixels. This however, shows that by running image processing on a GPU it is possible to perform feature extraction in real-time. GPU-SURF depends on Nvidia’s Compute Unified Device

(35)

3.2 Comparison of SIFT and SURF 21 100 200 300 400 500 600 50 100 150 200 250 300 350 400 450 (a) SIFT 100 200 300 400 500 600 50 100 150 200 250 300 350 400 450 (b) GPU-SURF

Figure 3.2: Same image processed with SIFT from VLFeat and GPU-SURF. The image processed with GPU-SURF has an overlay and therefore uses a grey colour map.

Architecture (CUDA) [22]. The GPU-implementation was run on a 3.0 GHz Core 2 Quad with 8 GB of RAM and a Nvidia 9600 GT graphics card using 64-bit Gentoo Linux with Nvidia CUDA Toolkit 3.0. Since the GPU-implementation uses a different data structure for descriptors this implementation was only tested for performance comparison. A brief summary can be seen in Table 3.1. In one case SIFT extracted 151 features. This was because of the difficulty of determining a threshold that results in exactly 150 features.

From the computational time listed in the table it is obvious that SURF per-forms better than SIFT for a larger image. The time to extract features with SIFT seems to be proportional to the image size. Most notable is the GPU-SURF implementation that has almost the same computational time for an image with twice the x and y resolution. This is because of the GPU’s ability to perform parallel computations. The bottleneck when performing computations on a GPU is the time it takes for the data to reach the graphics card. This means that when changing size of an image the texture data is handled differently to maintain per-formance. Also the number of threads can vary which also affects the perper-formance. The card used for this test has 64 cores, but newer cards from Nvidia has as many as 480 cores which should be enough for quite large images.

3.2.1 Stereo Pair Features

For the stereo camera there are two images where features can be extracted. Since the images differ the features are not necessarily the same for the two images, which means features have to be matched between the stereo pair. The previous mentioned package for feature extraction, VLFeat, comes with a function that identifies features that are present in both images. There is also a function for matching bundled with the SURF implementation. This matching function uses

(36)

Table 3.1: Computational time for different feature extractors.

Extractor Time (ms) Features Image Resolution

SIFT 350 150 320 × 240 SURF-64 62 150 320 × 240 SURF-128 65 150 320 × 240 GPU-SURF 9.5 150 320 × 240 SIFT 1088 151 640 × 480 SURF-64 94 150 640 × 480 SURF-128 94 150 640 × 480 GPU-SURF 12 150 640 × 480 SIFT 4002 150 1280 × 960 SURF-64 219 150 1280 × 960 SURF-128 229 150 1280 × 960 GPU-SURF 20 150 1280 × 960

the descriptors to perform the matching. Let δl and δr be the descriptors for all

features in the left and right image. By comparing each descriptor in the left image to all descriptors in the right image the best corresponding feature can be found. Finding the match for feature mican be described as

mi= arg max j6=i 1 ||δi|| · ||δj|| hδi|δji , (3.1)

which means finding the descriptor δj that maximises the normalised scalar

prod-uct hδi|δji. A feature is only considered to have a match if

max ₁ ||δi|| · ||δj|| hδi|δji ≥ α,

where α ∈ [0, 1]. Choosing a good value for α can be quite tricky since the descriptors are only rotation and scale invariant. This implies that there are other environmental conditions that affects the descriptors e.g. illumination and viewing angle. Setting α ≈ 1 makes it hard to recognise features in different images and using a low value for α increases the possibility of false matches. This will be discussed further in Chapter 4.

Figure 3.3 shows the result after matching features in the left image to features in the right image. Each feature is represented by a yellow circle. The yellow circle represent the scaling and rotation and is centred at the feature position.

3.3 Laser Vision

There are various ways of extracting features from the CamCube. Since the camera provides several images e.g. range and amplitude images, these images can be seen as ordinary images with a different scaling. It is therefore possible to extract

(37)

3.3 Laser Vision 23 100 200 300 400 500 600 −100 −50 0 50 100 150 200 250 300 350

Figure 3.3: Features matched in a stereo pair.

features with SIFT or SURF from these images one at a time or use a combination of the different image types. Another way to extract features from the PMD camera is to use the point clouds to extract features in 3D, e.g. average range, volume, sphere approximation etc.

Although since there is not much difference between the amplitude image and the intensity image the features can differ greatly. Initial test showed that SURF found zero features in the intensity image but quite many in the amplitude image. As a result of this, SIFT was applied to the images. This gave zero features in the amplitude image and hundreds in the intensity image.

By normalising the intensity image it was possible to extract features from the intensity image with SURF as well. Let Ii denote the intensity image,

normalisa-tion is then performed using

II = Ii− min(Ii) max(Ii− min(Ii))

.

Using this normalisation gives an intensity image with II ∈ [0, 1]. Figure 3.4 shows

SURF features in the three images acquired from CamCube together with features in the normalised intensity image. It can be seen that there are no features in the intensity image. In the normalised intensity image there are some features but these seem to be placed at random. The range image gives the best features. The features are mainly placed on single coloured surfaces and edges.

Features in the amplitude image may also appear random at first. By differen-tiating the image in the x- and y-direction it can be seen that features are placed near sharp edges. The differentiated images together with the extracted features is shown in Figure 3.5. In the intensity image the features still appear to be quite

(38)

24 Feature Extraction 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

(a) Intensity image.

20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

(b) Normalized intensity image.

20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200 (c) Amplitude image. 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200 (d) Range image.

Figure 3.4: Features in the different images from CamCube. Features are marked with a red star.

random and the reason for this is unknown.

3.4 Conclusion

Both SIFT and SURF gives almost identical interest points when applied to stereo images although SIFT is somewhat slower which can clearly be seen in Table 3.1. This leads to the conclusion that SURF is more suitable for the problem at hand and will therefore be used for feature extraction. Most preferably would be to use GPU-SURF but since there is no need for real-time SLAM at this point and there is no hardware available for mobile usage the choice falls on SURF with extended descriptors, i.e. SURF-128. The extended descriptors offer a higher recognition rate at a slightly higher computational cost.

For features in the images from CamCube only the distance image can be considered to give stable features. The amplitude image will also be used, but

(39)

3.4 Conclusion 25 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

(a) Intensity image differentiated in x-direction.

20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

(b) Intensity image differentiated in y-direction.

20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

(c) Amplitude image differentiated in x

-direction. 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

(d) Amplitude image differentiated in y

-direction.

Figure 3.5: Features overlayed on the differentiated version of the amplitude and intensity image.

(40)

only in combination with the distance image. The intensity image however, will be discarded. The intensity image is discarded due to the fact that the features seems to be unstable. It is also the case that it is hard to determine a threshold for feature extraction in the intensity image.

(41)

Chapter 4

Data Association

This chapter describes the data association part of the SLAM problem. Data association is an important part of all SLAM algorithms, from the simple EKF-SLAM to the more advanced Tectonic SAM [26]. It is important to have a robust method for data association. This basically means being able to identify the same features in a sequence of images, see Chapter 3. By extracting the same feature in a sequence of images, rotation and translation of the camera can be calculated if the parameters of the camera are known. Section 4.1 will define a measurement and will be followed by two sections describing methods for associating measurements to landmarks.

4.1 Measurements

The definition of a measurement, z, is something that depends on the type of sensors available. It can for example be

z = (ax ay az)T (4.1)

where a∗is the acceleration measured by an Inertial Measurement Unit (IMU). In

this case the sensors are a stereo camera and a laser camera. This gives the pos-sibility to take measurements only from the stereo camera, only the laser camera or a combination of both.

Given that only the stereo camera is used, two measurements are acquired in each sample, one for each camera in the stereo pair. A sketch of the geometry of a stereo camera is shown in Figure 4.2. Let a measurements from a stereo pair be

z1 and z2. The measurement will then be

z1=u_v1 1 , z2=u_v2 2 , (4.2)

where u and v denotes the pixel coordinate of the sensor for each extracted feature. Figure 4.1 shows how the coordinates (u, v) are represented in the image sensor.

(42)

28 Data Association

u

v

Figure 4.1: Representation of pixels in the sensor.

The measurements from each sensor are then combined into one measurement and used in EKF-SLAM. The first sensor, the left camera, is chosen as the primary sensor. Measurements from the second sensor, the right camera, is only used to calculate the depth to the features present in both images. A measurement is then defined as z =   u v d  , (4.3)

where u and v represent the pixel coordinates in the image from the primary sensor for features present in both images. For each feature the depth d can be estimated given that the parameters of the stereo camera are known. In this case the baseline

b and the focal length f are known. Hence the depth can be calculated as d = f · b

u1− u2. (4.4)

For the data association the measurement is then transformed into a measurement in world coordinates by using

m = K−1   u v 1  d, (4.5)

which gives the feature position in Cartesian world coordinates. Section 5.1.3 describes this more extensively. The data association will be performed in world coordinates.

4.2 Nearest Neighbour Association

Using the shortest Euclidean distance for data association is the simplest way to associate features. Association works by finding the distance between a measure-ment and a landmark. A landmark is defined as a distinct point in the world that corresponds to a feature in an image. For nearest neighbour association the

(43)

4.2 Nearest Neighbour Association 29

Image plane Image plane

f d

Left camera b Right camera Scene point

u1 u2

Figure 4.2: A sketch of the stereo camera.

−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 −2 −1 0 1 2 3 4 5 Measurement Landmarks

(44)

gA gR

Figure 4.4: Gates used during association.

influence of measurement noise is not considered so every measurement is regarded as being perfect. This is illustrated by Figure 4.3. When performing association the distance, ∆ik, from the measurement, mi, to all landmarks, lk, is calculated

according to

∆ik= ||mi− lk||.

The distance ∆ikis then compared to two gates. The first is gAwhich determines

if the measurement is close enough to another landmark to be associated. The other gate, gR decide whether the distance to all landmarks is large enough for

the measurement to be a new landmark. Figure 4.4 shows how the gates work. The red dot is a landmark in the database. A measurement closer than gA is

associated to the landmark. If the measurement is outside gRa new landmark will

be created. All measurements in the grey area are discarded.

Figure 4.5 shows the map from a run with EKF-SLAM using shortest distance. Landmarks are visualised by red stars. The surrounding red ellipse illustrates the 99.5 % confidence interval. The trajectory is given by the black dots and the blue line represents the final orientation. The result is good although there is no ground truth for verification. The result is quite sensitive to the parameters gAand gR. A

large value for gAincreases the possibility of false associations so it is preferable to

use a small value although this makes it harder to detect loop closure after drift. Drift means the accumulation of small errors in pose during a long sequence of measurements.

A good estimation for gA is the movement between two consecutive frames.

Choosing a good value for gR can also be quite tricky since a large difference

|gR− gA| leaves a big “dead zone” where features are discarded. Choosing a small

value results in the introduction of landmarks close to already existing landmarks which may lead to a larger risk of incorrect association.

Association according to this implementation is described in Algorithm 1. The algorithm starts by finding the distance between a measurement mi and all

land-marks lk. The distances are then sorted so that the shortest distance is the first.

Then for all sorted distances, find the shortest distance that correspond to a land-mark that is not associated to a measurement. If no association is made and the smallest distance is larger than gR a new landmark has been found and the

(45)

4.2 Nearest Neighbour Association 31 −2 0 2 4 6 8 −5 −4 −3 −2 −1 0 1 2 3 y [m] x [m]

(46)

Algorithm 1: Associate measurements M to landmarks L using nearest

neighbor

Data: Current measurements M and all landmarks L

Result: Best matching landmarks X and new landmarks Xnew

1 begin

2 forall the measurements m_i∈ M do 3 ∆_ik= ||m_i− l_k|| for l_k∈ L 4 dik= sort(∆_ik)

5 forall the sorted distances d_ikdo 6 Find the first k 6∈ X

7 if k 6∈ X & dik≤ gAthen 8 X = X ∪ k 9 break 10 end 11 end 12 if d_i1≥ g_Rthen 13 X_new= X_new∪ m_i 14 end 15 end 16 end

4.3 Mahalanobis Association

One of the biggest problems with nearest neighbour association is that the nearest feature does not have to be the most likely. Figure 4.6 shows an example of this. The figure is the same as Figure 4.3, but now includes landmark uncertainty represented by the ellipses. As can be seen in the figure the landmark closest to the measurement is the one in (1, −1), but it can also be seen that the measurement is more likely to correspond to the landmark in (2, 2).

This can be solved by using the normalised innovation squared (NIS) also known as the Mahalanobis distance. Association using Mahalanobis distance is performed using a measurement zi and the predicted observation ˆzk = h(x, lk).

The innovation can then be calculated as iik = zi − ˆzk. Normalising with the

covariance of the landmark, P, and squaring the innovation gives the NIS according to

Mik= iTikP−1iik. (4.6)

If the innovation has a Gaussian distribution, which normally is the case for EKF, the NIS forms a χ2 distribution. The degrees of freedom for the distribution is equivalent to the dimension of the innovation vector. Association is performed by finding Mik≤ gAwhere gAis an arbitrary gate. The integral of the χ2distribution

specifies the probability that association will be performed if zi is an observation

of lk.

Applying the gate means computing the integral of the χ2 distribution, Mik

(47)

4.3 Mahalanobis Association 33 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 −2 −1 0 1 2 3 4 5 Measurement Landmarks

Figure 4.6: Illustration of a measurement and two landmarks with landmark un-certainty ellipses.

be performed. Figure 4.7 shows the χ2 distribution for three different degrees of freedom, k. Common values for gA are 4, 9, and 16. Setting the gate, gA to e.g. gA = 4 for three degrees of freedom means that the gate will accept 74 % of the associations, as 74 % of the χ2 probability mass lies between 0 and 4.

Table 4.1 shows the probability that association will be accepted for some dif-ferent values of gAwith three degrees of freedom. Figure 4.8 shows the probability

as a continuous function of gA.

Table 4.1: Probability that association will be accepted.

gA Probability [%]

2 42.76

4 73.07

9 97.07

16 99.89

In Figure 4.6 the ellipses represent a 97.5 % confidence interval. Trying to perform data association with gA ≤ 9 would therefore result in the measurement

not being associated to any of the landmarks. Increasing gA to 16 makes the

landmark in (2, 2) the most likely association.

When there are multiple measurements, z = {z1. . . zn} within the gate of a

landmark, l, a likelihood of association Λi can be defined for each zi∈ z as

Λi= 1 (2π)n/2_pdet(P)exp −1 2i T iP−1ii , (4.7)

(48)

34 Data Association 0 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 0.6 g A k = 1 k = 2 k = 3

Figure 4.7: Probability density function (PDF) for the χ2distribution. The PDF is plotted for one, two, and three degrees of freedom. The integral of a χ2distribution from 0 to gA is equivalent to the probability of association.

0 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 g A k = 1 k = 2 k = 3

Figure 4.8: Cumulative distribution function (CDF) for the χ2 distribution. The CDF is plotted for one, two, and three degrees of freedom. The value at gA is

(49)

4.3 Mahalanobis Association 35

where n, is the dimension of a measurement. By taking the negative log of Equa-tion (4.7), the normalized distance Ni is obtained according to

Ni= iTiPii+ log(det(P)). (4.8)

The goal is then to select the observation that minimises Ni since this maximises

the likelihood of association Λi. The normalised distance determines which

mea-surement that correspond to which landmark. The NIS is then used to determine whether association should be accepted or rejected by comparing it to the gate

gA.

Since the database with landmarks can become quite large the association can be sped up by first reducing the number of possible landmarks. Intuitively this can be thought of as walking along a trajectory. If the current position is approx-imately known there is no need to check if the current measurements correspond to landmarks that were seen 10 m away.

Using the Euclidian distance to reduce the number of landmarks can speed up the association since it is a fast method compared to calculating the Mahalanobis distance. In the current implementation the max radius, rmax, for which the

search for suitable landmarks is performed can be chosen arbitrary. Using a large value just increases computational time since the motion between two consecutive frames is quite small. In Algorithm 2 that describes Mahalanobis association the max association distance is called rmax.

An additional extension to the matching algorithm is the use of descriptors. The descriptors are used for association in the same way as they are used for matching between stereo pairs in Section 3.2.1. In this case the descriptor of the measurement is compared to the descriptors of the landmarks. If a measurement is close enough to be associated to a landmark the normalised scalar product between the descriptor of the measurement and the landmark is calculated. If it is larger than a predefined threshold, α, the association is considered to be correct. This is a verification that the measurement and the landmark really are the same.

Algorithm 2 describes data association with Mahalanobis distance. The algo-rithm includes the use of descriptors and the ability to replace an existing associ-ation if a better match is found.

Using this method for association in the EKF-SLAM algorithm yields the result displayed in Figure 4.9 which is very similar to the previous test in Figure 4.5. Once again red stars represent landmarks, red ellipses represent the uncertainty, black dots is the trajectory, and the blue line shows the final orientation. Due to the lack of ground truth there is no way to tell which trajectory is the most accurate. What can be said is that the association with Mahalanobis distances requires less tuning and is better for loop closure.

Mahalanobis distance association is applicable for low and moderate noise lev-els. If the measurements are too accurate there will be problem with association due to the inverse of the covariance matrix. No noise will result in the Maha-lanobis distance becoming infinite. Too much noise will create landmarks with very big uncertainty and this will result in very few landmarks in total. Having few landmarks will create problems when trying to track position and movement.

(50)

Algorithm 2: Associate measurements M to landmarks L using

Maha-lanobis distances

Data: Current measurements M and landmarks L

Result: Best matching landmarks X and new landmarks Xnew

1 begin

2 forall the measurements m_i∈ M do 3 Find all ||m_i− l_k|| ≤ r_max for l_k∈ L

4 forall the landmarks l_k closer than r_maxdo

5 xi← 0

6 Mmin← ∞

7 Nmin ← ∞

8 Find and normalize the descriptors δ_m,i and δ_l,k 9 Calculate normalized innovation M (see Eq. 4.6) 10 Calculate normalized distance N (see Eq. 4.8) 11 if M ≤ g_A & N ≤ N_min & hδ_m,i|δ_l,ki ≥ α then

12 Nmin= N

13 xi = k

14 else if M ≤ Mmin then

15 Mmin = M 16 end 17 end 18 if x_i6= 0 then 19 if xi6∈ X then 20 Not associated 21 X = X ∪ xi 22 else

23 Check if better than previous association 24 Find index l for which x_i= X_l

25 Find the normalized distance N_lcorresponding to X_l

26 if N_min≤ N_lthen

27 Xl= xi

28 Nl= Nmin

29 end

30 end

31 else if M_min≥ g_R then 32 This is a new landmark 33 X_new= X_new∪ x_i

34 end

35 end

(51)

4.4 Removal of Unstable Landmarks 37 −2 0 2 4 6 8 −6 −5 −4 −3 −2 −1 0 1 2 3 x [m] y [m]

Figure 4.9: EKF-SLAM with association using shortest distance.

4.4 Removal of Unstable Landmarks

When extracting features from images it often happens that a feature is present in just one single image. To reduce the number of landmarks introduced in the landmark database L a score is maintained for each landmark. For each landmark there is also a value defining in which sample the landmark was seen the first time. When a new landmark is introduced in the database L it receives the score s = 1. The score function is then monitored for a total of N samples. When the landmark is seen again the score is updated by s = s + 1.

If the score is lower than smin after N consecutive samples the landmark is

removed from the database L. If the score is higher than smin the landmark is

considered stable and cannot be removed from the database. The values of smin

and N can be chosen arbitrary as long as smin≤ N .

Choosing smin = N when N is large reduces the number of landmarks

effi-ciently since a landmark must be seen in N consecutive samples. If it is undesir-able to remove landmarks this can be performed by setting smin = N = 1 which

is the same as consider a landmark stable when it has been seen only once. The benefit of removing unstable landmarks is that it reduces the risk of in-correct association. It also speeds up the SLAM algorithm since it reduces the dimensionality of the problem. Some testing has shown that smin = 3 and N = 6

reduces the amount of unstable landmarks efficiently while still keeping enough landmarks for the SLAM algorithm to succeed.

(52)

4.5 Conclusion

Both methods for data association gives good results, but since the method using the Mahalanobis distance is more robust especially when it comes to detect loop closure after drift. Because of this Mahalanobis is the method that is preferred. Using only the shortest distance association is by far the fastest method of the two.

The nearest neighbour association is a very simple version and it is possible that this method could be improved. Introducing descriptors to this method should yield a significant improvement, first of all by reducing the number of incorrect associations. However, since nearest neighbour is faster than Mahalanobis this method is a good solution for reducing the number of landmarks before doing data association with Mahalanobis.

Simultaneous Localisation and Mapping of Indoor Environments Using a Stereo Camera and a Laser Camera

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Simultaneous Localisation and Mapping of Indoor

Environments Using a Stereo Camera and a Laser

Camera

Simultaneous Localisation and Mapping of Indoor

Environments Using a Stereo Camera and a Laser

Camera

Examensarbete utfört i Reglerteknik

vid Tekniska högskolan i Linköping

av

Abstract

Acknowledgments

Contents

Chapter 1

Introduction

1.1

Simultaneous Localisation and Mapping

1.2

A Current Application of SLAM

1.3

Motivation for Indoor SLAM

1.4

Laser Scan Registration

1.5

Goal and Problem Definition

1.6

Limitations

1.7

Method

Chapter 2

Sensors

2.1

Bumblebee Stereo Camera

2.2

PMD Laser Camera

2.3

The Rig

2.4

Camera Calibration

Chapter 3

Feature Extraction

3.1

Feature Detection in Stereo Vision

3.1.1

Scale Invariant Feature Transform (SIFT)

3.1.2

Speeded Up Robust Features (SURF)

3.1.3

CenSurE and SUSurE

3.2

Comparison of SIFT and SURF

3.2.1

Stereo Pair Features

3.3

Laser Vision

3.4

Conclusion

Chapter 4

Data Association

4.1

Measurements

4.2

Nearest Neighbour Association

4.3

Mahalanobis Association

4.4

Removal of Unstable Landmarks

4.5

Conclusion