Visual Stereo Odometry for Indoor Positioning

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Visual Stereo Odometry for Indoor Positioning

Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet

av

Fredrik Johansson LiTH-ISY-EX--12/4621--SE

Linköping 2012

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Visual Stereo Odometry for Indoor Positioning

Examensarbete utfört i Datorseende

vid Tekniska högskolan i Linköping

av

Fredrik Johansson LiTH-ISY-EX--12/4621--SE

Handledare: Vasileios Zografos

isy, Linköpings universitet

Gert Johansson

Combitech

Examinator: Michael Felsberg

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Division of Automatic Control Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2012-09-06 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.control.isy.liu.se http://www.ep.liu.se ISBN — ISRN LiTH-ISY-EX--12/4621--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Visuell stereoodometri för inomhuspositionering Visual Stereo Odometry for Indoor Positioning

Författare Author

Fredrik Johansson

Sammanfattning Abstract

In this master thesis a visual odometry system is implemented and explained. Visual odometry is a technique, which could be used on autonomous vehicles to determine its current position and is preferably used indoors when GPS is not working. The only input to the system are the images from a stereo camera and the output is the current location given in relative position.

In the C++ implementation, image features are found and matched between the stereo images and the previous stereo pair, which gives a range of 150-250 verified feature matchings. The image coordinates are triangulated into a 3D-point cloud. The distance between two subsequent point clouds is minimized with respect to rigid transformations, which gives the motion described with six parameters, three for the translation and three for the rotation.

Noise in the image coordinates gives reconstruction errors which makes the motion estimation very sensitive. The results from six experiments show that the weakness of the system is the ability to distinguish rotations from translations. However, if the system has additional knowledge of how it is moving, the mini-mization can be done with only three parameters and the system can estimate its position with less than 5 % error.

Nyckelord

Keywords computer vision, visual odometry, stereo camera, feature detection, feature match-ing, triangulation, lens distortion, minimization problem

(6)

(7)

Abstract

In this master thesis a visual odometry system is implemented and explained. Visual odometry is a technique, which could be used on autonomous vehicles to determine its current position and is preferably used indoors when GPS is not working. The only input to the system are the images from a stereo camera and the output is the current location given in relative position.

In the C++ implementation, image features are found and matched between the stereo images and the previous stereo pair, which gives a range of 150-250 verified feature matchings. The image coordinates are triangulated into a 3D-point cloud. The distance between two subsequent point clouds is minimized with respect to rigid transformations, which gives the motion described with six parameters, three for the translation and three for the rotation.

Noise in the image coordinates gives reconstruction errors which makes the motion estimation very sensitive. The results from six experiments show that the weakness of the system is the ability to distinguish rotations from translations. However, if the system has additional knowledge of how it is moving, the mini-mization can be done with only three parameters and the system can estimate its position with less than 5 % error.

Sammanfattning

Visuell odometri är en teknik som kan användas på autonoma farkoster för att bestämma dess position i miljöer där GPS inte fungerar. Under examensarbetet har en prototyp av ett sådant system implementerats i C++ och i denna rapport redovisas vilka tekniker som använts och hur experimenten föll ut. Indata till systemet är bilderna från en stereokamera och utdata är dess nuvarande position relativt positionen då estimeringen påbörjades.

Intressepunkter i stereobilderna samt med föregående bildpar matchas ihop, och bildar i storleksordningen 150-250 stycken matchningar. Motsvarande 3D-punkter trianguleras fram och bildar punktmoln. Minimering av distansen av två på varandra följande punktmoln med hänsyn till stelkroppsrörelse ger förflyttning-en. Minimeringen görs med hjälp av sex parametrar, tre för rotationen och tre för translationen.

Ur resultaten från de sex experimenten kan utläsas att systemet har svårt att skilja på rotationer och translationer. Med ytterligare kunskap om systemets rörelse kan man emellertid göra minimeringen med bara tre parametrar. Då blir estimeringen av systemets position bra med ett fel på mindre än 5 %.

(8)

(9)

Acknowledgments

First of all I would like to thank Jonny Larsson who gave me the opportunity to do my master thesis at Combitech.

In addition to my helpful supervisor Gert Johansson at Combitech, I would also like to thank Anders Modén who provided the camera equipment and inspired and guided me throughout the thesis.

Furthermore, I would like to thank my supervisor Vasileios Zografos and ex-aminer Michael Felsberg both at Linköping University.

Finally, I would like to thank my family for your support not only during this master thesis but throughout my education.

(10)

(11)

Introduction

1.1 Background

Visual Odometry is the process of determining the position and orientation of autonomous vehicles and robots from a continuous image sequence. There are several advantages with autonomous vehicles compared to human navigated vehi-cles, which are often grounded on either safety or economical reasons. For example robots could be used in nuclear disaster rescue operations or autonomous trucks may be used in mines and in warehouses. In order to know how to navigate cor-rectly, the vehicles need to determine their current position. This is defined as the localization problem.

GPS is successfully used as position sensor outdoors. However, GPS-based nav-igation systems are not feasible indoors due to either large obstacles preventing the signals to reach the receivers or reflections that make the GPS-signals unre-liable. Other solutions to the localization problem include wheel speed sensors and inertial measurement units (IMUs). Nevertheless, wheel speed sensors require that the vehicle they are mounted on, drives on a slip-free surface and accurate IMUs are relatively expensive. However, as high performance cameras have be-come cheaper and computational power has increased significantly in recent years, visual odometry has become a good alternative to already existing methods. [16] Ego-motion estimation is the estimation of the position and the rotation of the vehicle itself. Using a camera to determine the ego-motion has several advantages. Visual odometry can handle both the relative position and direction simultaneously in free space. In contrast to GPS, it works indoors and is not dependent on a flat, slip-free surface like wheel speed sensors. Furthermore, the cameras may be used in many more applications than just odometry.

There are two main drawbacks with visual odometry. First, good light condi-tions are needed and second, the drift problem. Since the method relies only on previous measurements, the accuracy of the actual position decreases with time. That is, the estimated position will start to drift from the actual correct position. The drift problem could be handled with either synchronization points with known positions or by fusion with other sensors.

(14)

2 Introduction

1.2 Combitech

Combitech AB is an independent consultant company in the Saab industry which combines technology with environmental and security awareness. On behalf of a client they are studying the possibilities of letting vehicles, preferably trucks in warehouses, to work autonomously and are therefore interested in techniques that could be useful within such a project. Visual odometry is one possibility that is studied more closely. In the master thesis they are interested in learning more about the technology and hopefully be able to have a working prototype.

1.3 Thesis Objectives

This Master thesis will attempt to demonstrate how odometry could be solved successfully with a stereo-pair camera setup. The two cameras are facing approxi-mately the same scene, but from slightly different angles. Based on point matches in the left and the right cameras it is possible to form a 3D-point cloud via triangu-lation. For each image pair, the rotation and translation matrices which minimize the distance between two subsequent point clouds are estimated. The accumulated sum of rotations and translations will be the current estimate of the position. The current position is given relative to the position where the process was started.

The approach will be evaluated by implementing a prototype of the system. Then the system will be evaluated and the strengths and weaknesses of the method will be discussed in this report and the theoretical background will be explained.

1.4 Thesis Outline

1. Introduction

The first chapter will give the reader some background information and ex-plain the purpose of the thesis. In addition, notation and abbreviations used throughout the thesis will be explained.

2. Related Research and Proposed Method

The second chapter will briefly go through related research and briefly ex-plain the proposed method, how the problem was solved and implemented. 3. Background Theory and System Description

This chapter will explain methods, concepts and algorithms used in the pro-posed method.

4. Experiments and Results

This chapter contains the experimental set up, on which the entire method has been evaluated and the results from it.

5. Discussion and Conclusion

In the final chapter a discussion on the results and a conclusion are presented and future work suggested.

(15)

1.5 Symbols and Notation 3

1.5 Symbols and Notation

Letters with subscripts l and r refer specifically to the left and the right images. Subscripts with t and t-1 refer to the time instances t and t-1 respectively.

x = image coordinates

X = (X,Y,Z) = world coordinates

P = camera matrix, projecting 3D-points to image points x ∼ P X, where P = K[R|t] and X is in homogeneous coordinates

K = intrinsic camera parameters R = rotation matrix

t = translation vector C = camera center F = fundamental matrix e = epipolar point in an image l = epipolar line in an image a × b = [a]×b If a = [a1a2a3]T then [a]×=   0 −a3 a2 a3 0 −a1 −a2 a1 0  

(16)

(17)

Chapter 2

Related Research and

Proposed Method

2.1 Introduction

In order to solve the localization problem, a sequence of combined sub-steps will constitute the complete localization algorithm. All sub-steps have their pros and cons either when it comes to speed or robustness. In some cases, several different methods have been tested. For a system such as this, it is desirable to be able to run in real-time, the faster alternative is evaluated and if not sufficient accuracy is met a more robust method is chosen.

2.2 Related Research

Visual odometry is an exciting field of research which has resulted in a few real applications. One example is NASA’s two Mars Exploration Rovers that provide position and orientation to each rover. In steep slopes and sandy terrain it is correcting the wheel speed sensors, which have reduced capabilities in such areas. [19].

Research based on one, two, three and even omnidirectional cameras has been carried out with and without Simultaneous Localization And Mapping (SLAM) both indoors and outdoors [16, 20, 5]. More cameras give additional constraints, however all methods suffer from the fact that small errors accumulate over time, unless other additional inputs of the change of movement is registered and com-pensated for. Loop closing or using a digital map are two common ways to avoid that the estimated position is drifting away from the correct one and is used in many papers for instance in [20, 5].

Another approach to avoid drift within the system is to use Global Positioning System (GPS) or Inertial Measurement Units (IMU) as additional input sensors. Cheaper IMUs are less accurate at small changes, but the output from the IMU could successfully be used if the changes are distinct. Also, if the rotation of the

(18)

6 Related Research and Proposed Method

camera is so large that too few correlated features are visible at two subsequent time instances, the IMU is preferable over visual odometry. Visual odometry is very good at handling small changes and therefore it is a good complement to IMUs.

GPS performs very well outdoors when no obstacles are present. However, if objects do exist, visual odometry could preferably be used to help the GPS at such positions and for shorter distances function as the prime localization estimator, until the GPS signal is good again.

There is a large variety of methods that are used within the context of vi-sual odometry. For instance, there are a number of different feature extraction and feature tracking methods. Then, there are, a number of different approaches on how to estimate motions. One is to extract the motion from the estimated essential matrix like in [8] and another is to first triangulate points and solve a minimisation problem. The minimisation problem could be solved with respect to the homography between two subsequent frames as in [12]. Finally, to stabilize the measurements different kinds of Kalman filters are used [8, 22].

Relatively much research has been made in the area of visual odometry. State of art in visual odometry is to start with some kind of feature extraction method, namely Harris Corner detection, FAST, SIFT etc. Widely, a RANSAC or a RANSAC-like approach is used in almost all papers. Either to early discard falsely matched features as in [12, 8] or in the final estimation of the movement as in [7]. For instance, in [5] is an LKT-tracker used to estimate a optical flow field on which the movement estimation is based upon. The article also provides an example of how a camera can be used for other things than just odometry. Here it detects and avoids hazards like obstacles and precipices.

2.3 My Contributions

The purpose of this master thesis project is to construct a visual odometry-algorithm and evaluate how well it works. The odometry-algorithm is composed by simply putting together standard methods. Even if other prototypes have been con-structed and described briefly in research papers, Combitech wants to earn deeper knowledge about the technology by doing one prototype of their own. My super-visor at Combitech Gert Johansson, initially suggested a slight different approach than presented in most of the research papers. The final proposed algorithm is presented in next section and was the method we found most probable to produce the best results.

(19)

2.4 Algorithm Overview 7

2.4 Algorithm Overview

Figure 2.1. An overview of the algorithm executed for each frame

Figure 2.1 depicts the pipeline for how the localization problem is solved. Features found in the stereo camera pair at one time instance are compared to features found in the stereo camera pair at an earlier time instance. A non-linear optimisation algorithm solves the minimization problem of how the point cloud transforms from one time instance to another. Therefore it is essential that the reconstructed 3D-points at each time are correct in order to estimate the total movement. In chapter 3 the used methods are explained in more detail.

Lens distortion is present in all the captured images. This makes features detected near the edges of the images untrustworthy. We need to compensate for the distortion by correcting the detected feature-coordinates to their undistorted position. Therefore the image is remapped with Brown’s distortion model [3], which is further explained on page 14.

Features points are found by the FAST detector [13] in each frame. In this context, a feature is something easily distinguishable for example corners, lines or other natural variations in the images. How the FAST-algorithm works is explained on page 16. Each frame is divided into sub frames and each sub frame has its own unique threshold. The threshold is adjusted so that the features are approximately equally spread over the frames and the total number of features are approximately constant. This allows for changes in light conditions globally or in parts of the image without algorithm failure.

The feature points are then characterized with the BRIEF descriptor [4] which describes the image structure in the neighbourhood of the feature point. Matching in both space, between the two stereo cameras, and in time, between two subse-quent frames, are carried out. Features visible in both frames at two subsesubse-quent time instances are linked together for further analysis as seen in Figure 2.2.

(20)

Figure 2.2. An illustration of how the coordinate system is defined for two matched

stereo images at time instances t and t-1.

It is then essential to verify that the found matches are valid before the next position estimate is generated from them. As seen in in Figure 2.1, two different types of verifying methods are applied: geometric and epipolar verification. Ge-ometric verification involves two constraints. First, where features from the left camera are located in the right camera and vice versa, and second, assumptions of how the vehicle with the stereo camera moves, which enforces constraints on the location of the features at the next time instance. Epipolar verification means that the epipolar constraint, xT_rlr= 0, must be approximately fulfilled if we have

a valid match. This is true for matches in space as well as matches in time. The fundamental matrix F, is constant in the first case, since the left and the right cam-eras are fixed. The fundamental matrix can be derived from the camera matrices and their relative positions. In the second case, when matches are made between subsequent time instances, F has to be estimated and therefore RANSAC [6] is used to find F. RANSAC is based on the assumption that most of the matches are correct and discards the falsely matched features.

The 3D-points are reconstructed by singular value decomposition. However, first are the features minimally corrected such that the epipolar constraint xT

rlr= 0

is fulfilled. By fulfilled means that xT

rlr < 10−6 is considered as a sufficient

accuracy. This adjusts the features in the two images with subpixel accuracy, and their 3D-coordinates could then be reconstructed. Finally, the rotation R and translation t are found by solving the non-linear minimization problem, which uses the current and previous reconstructed 3D-points.

(21)

2.5 Hardware Platform 9

2.5 Hardware Platform

Figure 2.3. The stereo camera with a gyroscope mounted on top.

The stereo camera system that is used in this project consists of two Firefly MV cameras from Pointgray. Each monochrome camera has a resolution of 752*480 pixels, a global shutter and a frame rate up to 60 Hz. Two 6.0 mm lenses are attached to the cameras which are mounted next to each other 60 mm apart. The cameras are considered as approximately parallel, since alignment errors are below one degree. The cameras are synchronized and provided with an accurate IMU which measures how much the camera has moved between the images. Everything is saved into an .ibg file which could be read either from disk or directly into the system as the images are produced. However, in this project the ground truth of the travelled distance was determined by using a ruler with mm precision for experiments on the sliding rail and with cm precision for the Free Float sequence.

Figure 2.4. Schematic image of the hardware.

2.6 Software Platform

The visual odometry application is implemented in the C++ programming lan-guage. The computer vision open source library (used version OpenCV 2.4.0) [25],

(22)

is used. It was originally developed by Intel and has become a standard tool in computer vision and is free for both academic and commercial use. Other used li-braries include eigen[2], clapack[1] and levmar[17] which is used to solve numerical equations and minimization problems.

The system is now running on a standard laptop but could easily be ported to an embedded platform. In the future, an embedded platform would be preferable with respect to speed and portability. The development environment was Visual C++ 2010 Ultimate on a windows laptop equipped with a Intel pentium dual core 2.20 GHz processor and 4 GB of RAM.

(23)

Chapter 3

Backgound Theory and

System Description

3.1 Pinhole Camera Model and Camera

Calibra-tion

The pinhole camera model is a simple way to describe the projection of a 3D-world point onto an image plane. In the model a pin hole is used to focus light and is a first order approximation of projecting 3D-coordinates onto image coordinates using homogeneous coordinates.

x ∼ P X ⇐⇒ x ∼ K[R|t]X (3.1)

World 3D-coordinates are mapped to image coordinates by the camera matrix,

P = K[R|t]. The camera matrix consists of both intrinsic parameters K, and

extrinsic parameters [R|t]. The intrinsic parameters map the 3D-point onto pixels, while the extrinsic parameters describe the camera position relative to a fixed coordinate system [11, 25]. (3.2) shows the projection of a 3D-point onto image coordinates with pixel size 1 without scew parameter.

  x y 1  =   fx 0 cx 0 fy cy 0 0 1     r11 r12 r13 t1 r21 r22 r23 t2 r31 r32 r33 t3       X Y Z 1     (3.2)

X, Y, Z = are the coordinates of a 3D point in the world coordinate system. x, y = image coordinates of a projected 3D point in pixels.

cx, cy = location of principal point, close to image center.

fx, fy = focal lengths expressed in pixel-related units.

The intrinsic parameters are determined from a camera calibration procedure. This was done using OpenCV’s tools for calibrating the camera including cal-ibrateCamera() which also gave the lens distortion parameters. However, with

(24)

12 Backgound Theory and System Description

OpenCV algorithm stereoCalibrate() unrealisticly large translation parameters (a factor 10+) was received. Since it was known what the translation parameters should be close to T = [0.00,0.00,6.00] cm and the rotation matrix in the extrinsic parameters similar to the identity matrix, these estimations were not what was expected. Afterwards questions if the translation parameters simply was badly scaled was raised, which cannot be excluded. Anyway, instead the external pa-rameters were derived through calculations involving the intrinsic papa-rameters and corresponding points, visible in both images.

The the world coordinate system is defined with its center at the left camera center. The left and the right cameras are separated with translation T and with rotation R relative to the left camera. R is approximately the identity matrix and T = [0.00,0.00,6.00] cm, but is determined more accurately by observing image features located far away from the camera in the real world. Then, T becomes relatively small compared to Z and the backprojected rays of the image coordinates x to the features becomes approximately parallel. With this approximation it is possible to determine the angles, φ, θ and ψ which describe the angles around the x, y and the z-axis in the rotation matrix using equation (3.2) in a right hand system.

For the left camera the extrinsic parameters equal to the identity matrix con-catenated with a zero vector since there is no translation. For a image point located far away (x, y, 1)T _{with known intrinsic parameters, expressions for X,Y}

and Z can be determined. These are plugged into (3.2) but now instead with the unknown extrinsic parameters for the right camera. Three unknowns φ, θ and ψ are found by solving the equation system for X,Y and Z.

In the theory presented here the cameras will be considered as parallel since the difference in R is less than 1◦, but in the implementation of the system the real translation and rotation is used.

(25)

3.2 Stereo Vision 13

3.2 Stereo Vision

In figure 3.1 is the geometry for two parallel stereo cameras using the pinhole camera model separated by a baseline distance T is presented. Both cameras with centres Cl and Cr are viewing the same point X at 2D-coordinates xl and xr

respectively, at a distance Z from the baseline.

Figure 3.1. A picture of two parallel stereo cameras.

A triangle’s uniformity gives the relations

Z Xl = f xl =⇒ Xl= Zxl f (3.3) Z Xr = f xr =⇒ Xr= Zxr f (3.4)

and given that

Xl− Xr= T (3.5)

the relation between the two cameras can be written as

Zxl f − Zxr f = T (3.6) which gives Z = T f us (3.7)

where s is called disparity calculated as s = xl− xr, and u is the size of one

(26)

disparity s is located in the 3D-world. Objects located close to the camera have got a greater disparity than objects far away. The camera is equipped with two lenses with focal length (f) 6 mm separated with distance T = 60 mm at the baseline. The pixel size is u = 6 · 10−3 _{and therefore the disparity at a distance Z}

= 3 meters is 20.0 pixels. From (3.7) the following error function is derived

e = dZ

ds =

T f

us2 (3.8)

We know from (3.7) that s = T f_Zu which gives

e = dZ

ds =

Z2_u

T f (3.9)

Plotting the error function shows an quadratic increase of uncertainty in relation to the distance Z. See Figure 3.2.

For example, the camera mentioned above gives 20.0 pixels disparity at a dis-tance of 3 meters. However, if the correct disparity is 20.0 pixels but the estimate is 19.0 pixels then this corresponds to an error of 15 cm in Z. The uncertainty increases quadraticly with the distance to the object.

Figure 3.2. Uncertainity, e, at distance Z to the object.

3.3 Lens Distortion

It is often assumed that there is a linear mapping between the world coordinates and the projected image coordinates. However, this is only true if there is no lens distortion. Radial lens distortion increases with shorter focal lengths, which gives a larger field of view. In Figure 3.3 the radial distortion is apparent. Luckily, lens

(27)

3.3 Lens Distortion 15

distortion can be modelled and compensated for with Browns distortion model [25, 11].

Figure 3.3. Lens distortion on a checkerboard.

According to Browns distortion model a distorted pixel’s correct undistorted position can be be calculated as

xu= xd+ xd(xd− cx)(Q1r2+ Q2r4+ Q3r6+ O(r8))

+2(L1(xd− cx)(yd− cy)) + (L2(r2+ 2(xd− cx)2) (3.10)

yu= yd+ yd(yd− cy)(Q1r2+ Q2r4+ Q3r6+ O(r8))

+(L1(r2+ 2(yd− cy)2) + 2(L2(yd− cy)(xd− cx)) (3.11)

(xd, yd) is the distorted image position.

(xu, yu) is the undistorted image position.

(cx, cy) is the center of the distortion.

r is the radial distance,p(xd− cx)2+ (yd− cd)2

Qn and Ln are coefficients describing the distortion at distance r from the

dis-tortion center, which is commonly located close to the middle of the image. The coefficients are determined offline by first solving (3.10) and (3.11) with measured distorted corner points and undistorted points. The undistorted points are deter-mined with a checkerboard pattern present in the scene and the image is rectified in such a way that the corner positions are aligned into a square grid. Finally, the known coefficients are plugged into (3.10) and (3.11) and are used online to rectify the images [11].

To be able to distinguish a rotation from a translation it is important that the 3D-points are spread out as much as possible in the 3D-world. This is achieved when a lens with large field of view is used, but with broader field of view comes increased amount of lens distortion which must be compensated for. The system uses fairly narrow lenses with focal lengths of 6 mm, so the lens distortion is noticeable when viewed on a checker board but on a normal image it is hard to detect with the human eye.

(28)

3.4 Detecting Features

Working with the entire images would slow down the system a lot so a common procedure is to detect parts of the image that are descriptive. This is done using a feature extraction method.

3.4.1 FAST

FAST (Features from Accelerated Segment Test) [13] is a feature extraction method. In this context a feature indicates some point of interest in the image such as cor-ners. Compared to other detecting algorithms, FAST is faster than traditional detecting methods like SIFT [18] and Harris Corner detection [9]. Experiments in [13] have shown that using FAST in combination with sub-pixel accuracy gives ten times faster calculations compared to the methods mentioned above. It is therefore particularly suited for real time applications [13].

FAST works by evaluating pixel values on a circle of radius r as seen in Figure 3.4. The circle is centred around the point of interest. Typically, r is 3 which gives a circle of 16 pixels numbered like in Figure 3.4. Instead of evaluating all pixels in the neighbourhood around the point of interest, just pixels of radius r are evaluated. The method has an early exit, meaning that if the current point is not likely to be descriptive after having examined a few points, it will be discarded. If at least three of the examined pixels 1,5,9 and 13 are brighter/darker than p±t, where t is a threshold, the algorithm will proceed to also check the remaining twelve pixels. If at least n adjoining pixels are brighter/darker than p±t the point of interest p, is considered a feature. Non-maximal suppression is used to spread and limit the number of interest points examined. FAST is deterministic and gives the same feature points every time under the same circumstances at different time instances. However, if for instance the light conditions are changed slightly, FAST will give another set of features with most of the former found features probable preserved [13].

Figure 3.4. FAST corner detector: The point p is in the center of a circle of 16 pixels.

(29)

3.5 Feature Descriptor 17

3.5 Feature Descriptor

In order to match extracted features there is a need to describe the features. Matching features is one of the most time consuming steps, and therefore it is crucial to optimize this step.

3.5.1 BRIEF

BRIEF is an efficient feature point descriptor which uses pixel intensity difference tests in a small patch around the feature centre. The individual bits are obtained by comparing pairs of points on the same line. The results are saved in rela-tively small number of bits and comparisons between features are made using the Hamming distance. Only 32 bytes are used, which in most situations sufficient to describe the feature uniquely [4].

Hamming distance between two features is the minimum number of bit sub-stitutions needed to convert one feature to another. The bit comparison can be improved even further if XOR or bit count operation is enabled in the CPU in-struction set [4].

A disadvantage with BRIEF is that it is not scale or rotation invariant and therefore the differences between two subsequent images may not be too large. But due to low computational time per feature it was chosen anyway.

3.6 Match Features

For each time instance the descriptors of the features in the current left image are matched with the corresponding descriptors of the features in the other three images: current right, previous left and previous right. The number of features in each of the four images are N with minor variations. If all N features are matched with all features in the another three frames the possible number of matches then becomes N2 for each of the three matched images. Typically the matching procedure is one of the most time consuming steps. However, if the number of matches do not grow quadraticly but approximately linearly much time is gained. This is achieved by setting restrictions on how far objects in the images have moved to the next image simply by only matching features within the same rectangular box. In this way only the descriptors of the features most likely to match are compared.

(30)

3.7 Verify Correct Matches

Verification of matches is essential before proceeding to the triangulation. Falsely matched features will limit the chances of correctly estimating the movement be-tween two time instances. Therefore the matches are verified by two methods starting with a simple geometric verification and continuing with an epipolar ver-ification step.

3.7.1 Geometric Verification

When matching two stereo images, features are found approximately on the same y-coordinate, while the x-coordinate varies with the distance to the object. In applications where the movement is slow relative the frame rate, a harder restric-tion on the acceptable range can be applied. For instance is the difference in the y-coordinates in the stereo pair always less than 15 pixels, and in the image sequences are features never more than 50 pixels away from its position in the previous frame. Geometric verification is applied both in the spatial domain as in the time domain. In the latter case this is done to limit the number of outliers in the the epipolar verification, since the verification method uses RANSAC [6].

3.7.2 Epipolar Verification

The geometry in stereo vision is called epipolar geometry and is based on the pin hole camera model. When two cameras are facing the same scene, there is a number of geometric relations between the 3D-point and its projections onto the images.

Figure 3.5. Epipolar geometry explained. Slightly modified image, original image from

(31)

3.7 Verify Correct Matches 19

A point X, in the 3D world is imaged in two views xl in the left image and xr

in the right image. The camera centres Cland Cr define together with X a plane,

π.

The epipoles eland erare located at the intersection of the baseline joining the

two camera centres and the image planes. An epipolar plane is a plane that holds the baseline and the intersection of such a plane with the image planes defines the epipolar lines. The epipolar lines intersect their respective epipolar points and can be expressed as

lr= er× xr= [er]×xr (3.12)

in Plücker coordinates. Epipolar geometry can algebraicly be represented by the fundamental matrix, F. According to [11] is any point X on a plane Π, which is not passing through the camera centres projected onto the 2D-points xl and xr,

such that there exists a homography mapping HΠfrom xl to xr,

xr= HΠxl (3.13)

(3.12) and (3.13) give

lr= [er]×xr= [er]×HΠxl= F xl, (3.14)

where F = [er]×HΠ. For each point in the left image, xl, there exists an

epipolar line in the right image, lr, and vice versa.

A point xron the line lrsatisfies the relation xTrlr= 0. Therefore, the epipolar

constraint (3.15) is derived from (3.14)

xT_rF xl= 0 (3.15)

The epipolar constraint is used to verify that two image points correspond to the same 3D-point. If the relation xT_rFl/rxl < is met, where is a small

number, they correspond to the same 3D-point in the spatial domain. Also if

xTt−1Ft/t−1xt< is met, where is a small number, they correspond to the same

3D-point in the time domain.

The fundamental matrix could either be derived from the camera matrices and their known position or estimated from point correspondences. F is a 3 × 3 matrix with a scale factor, thus eight independent coefficients. Furthermore, F has rank two which removes another degree of freedom and therefore F has seven degrees of freedom and at least seven points are required to estimate F from point correspondences. [11]

(32)

F Derived from the Camera Matrices

The matrix used to verify that all matches between the left/right image-pairs are correct is denoted Fl/r. In the camera calibration, the two camera matrices Pland

Pr are determined. We are now going to derive the fundamental matrix. Since

the calibration is done only once, the fundamental matrix between the left and the right cameras will be the same for all time instances.

A point X is projected into the left image as xlthrough the camera matrix Pl,

xl∼ PlX. The pseudo inverse of Plsuch that PlPl+ = I is denoted P

+

l . Therefore

the back projection of xlis X ∼ Pl+xl.

The 3D-point X viewed from the right camera can then be written as

xr∼ PrX ∼ PrPl+xl (3.16)

The left camera center viewed from the right camera is PrCl. The epipolar line

lrjoining these two points are lr= (PrCl) × (PrPl+). Thus, lr= [er]×(PrPl+)xl=

Fl/rxlwith Fl/r= [er]×PrPl+.

F Estimated with RANSAC

The matrix used to verify that all matches between two in time subsequent frames are correct is denoted Ft/t−1. In the same way as the existence of an epipolar

relation between the features in the current left and right images there exist an epipolar relation between two subsequent images in time from the same camera. However, since the motion between two time instances is unknown the epipolar relation is unknown as well. Thus, to be able to find this relation, RANSAC [6] is used to find the fundamental matrix. Using RANSAC to find point correspon-dences requires that most of the features are inliers to be able to converge within reasonable short amount of time.

The system considers features in one image that has a corresponding feature located within 2 pixels from the corresponding epipolar line in the other image as inliers. All other correspondences are rejected. The epipolar relation between current left and previous left is equivalent to the relation between current right and previous right and image points in these images are connected with the fun-damental matrix denoted Ft/t−1.

Moreover, it is possible to match features diagonally, meaning that for in-stance features in current left and previous right are connected with the relation

xT_r/t−1Fl,t/r,t−1xl/t = 0. This means that not only features between current left

and previous left and are used in the RANSAC verification, but from all four combinations.

(33)

3.8 Triangulation 21

3.8 Triangulation

Triangulation involves both reconstruction of 3D-points and the correction of the 2D-features. The correction is made by a method called optimal correction.

3.8.1 Optimal Correction

Back-projecting the image coordinates xland xrcreates rays starting at the camera

centres and passes through the 2D-features in the projected images. Assuming that there is no error in the 2D-points the rays will meet in a 3D-point in the world. However, if there are errors in the measurements, a correction method is needed to find out where the rays meet. For the mid-point method, in the case that the two lines do not intersect, the reconstructed point is placed at the midpoint of a line orthogonal to the two back-projected rays [15]. See Figure 3.6.

Figure 3.6. Trianglulation (a) mid-point method and (b) optimal correction. Image

from [15] with permisson.

Hartley-Sturm [10]came up with a better method that became the standard tool for triangulation. In Hartley-Sturm xl and xr are adjusted slightly so that

the epipolar constraint, x0rFl/rxl= 0, is fulfilled for all points and the rays meet

in 3D-points in the world.

Kanatani introduced a third method, which works in a similar way, gives the same results as Hartley-Sturm for all points, except from the epipoles which for which Hartley-Strum has singularities while Kanatani’s method does not. Kanatani’s optimal correction is much faster than Hartley-Strum and is there-fore more suited for practical use. The points are corrected minimally, in other words with

 = ||xl− ¯xl||2+ ||xr− ¯xr||2 (3.17)

(34)

3.8.2 Reconstruction

In the reconstruction is the 3D-points calculated given their image coordinates and camera matrices. A 3D-point is projected into an image point as x = P X. Given x × P X = 0 it can be rewritten with the cross product operator

[x]×= 0 −1 x2 1 0 −x1 −x2 x1 0 (3.18) as [xl]×Pl [xr]×PrX = 0 =⇒ AX = 0 (3.19)

The equation AX = 0 is solved by singular value decomposition (SVD) to find the nearest approximation of A. X is found in the last column of the orthogonal matrix V from the SVD.

(35)

3.9 Point Cloud Fitting 23

3.9 Point Cloud Fitting

All reconstructed features at a time instance form a 3D-cloud of points. By mini-mizing the motion between current reconstructed points with the previous recon-structed 3D-cloud of points, the searched parameters are found.

The parameters describe the translation and the rotation between subsequent image pairs. Three parameters for the translation and three for the rotation. The rotation is described with Euler angles φ, θ,ψ which describe the rotation around the x, y and z-axis respectively as in (3.20).

R =   cos ψ sin ψ 0 − sin ψ cos ψ 0 0 0 1     cos θ 0 − sin θ 0 1 0 sin θ 0 cos θ     1 0 0 0 cos φ sin φ 0 − sin φ cos φ  =  

cos θ cos ψ sin φ sin θ cos ψ + cos φ sin ψ − cos φ sin θ cos ψ + sin φ sin ψ − cos θ sin ψ − sin φ sin θ sin ψ + cos φ cos ψ cos φ sin θ sin ψ + sin φ cos ψ

sin θ − sin φ cos θ cos φ cos θ





(3.20) The minimization is made using the Levenberg-Marquardt (LM) algorithm, a standard tool for minimizing the sum of squares of non-linear functions. Its strength is that it works as a steepest descent when the current solution is far away from a minimum one and behaves like a Gauss-Newton method when the current solution is close to a minimum [17]. There is no guarantee that the method will find the global minimum.

Like all other non-linear optimization methods, the LM algorithm is iterative and tries to find the parameter vector p that minimizes = ||x− ˆx||2_{with ˆ}_{x = f (p).}

Six parameters are found for each image, which together form the update matrix

Mupdate as in (3.21). Mupdate= R t 0 1 (3.21)

Mcurrent= MpreviousMupdate (3.22)

The current position matrix Mcurrent is obtained when the update matrix is

multiplied with the previous position matrix Mprevious. The matrices are given in

(36)

(37)

Chapter 4

Experiments and Results

4.1 Overview

In order to evaluate the system it has to be tested rigorously. The system is tested in four different isolated experiments, all limited to at most one motion: standing still, rotation, forward motion and sideways motion. In addition to this, a free float motion, meaning that it is not restricted to any axis but moving free in space, is tested and a motion based only on synthetic data is tested. All six experiments are listed in Table 4.1 on page 26.

Since the cameras are taking images with a high frame rate and the movements of the stereo camera are slow during the recording, it is possible to skip images and for instance only just use every 25th or 5th image. The travelled distance and rotation for each image then becomes approximately 1 cm / 0.5 degrees instead of just a fraction of that. Otherwise all experiments are made with the same settings unless something else is mentioned for that specific experiment.

The number of found features are usually in the range 400-700. Of these a great amount is discarded and in most cases 150-250 verified matches are used per image frame. The number of processed images per second varies with the number of features but is about one second per frame. However, this number could be improved significantly by doing fewer comparisons, avoid printing feedback in the console window and not drawing visual feedback on the screen.

The results in the experiments does have any randomness, except for the syn-thetic data in which random errors were generated. Therefore, all experiments will give the same results each time if they are evaluated several times.

(38)

26 Experiments and Results Num. Exp. name Num. of images Description

1 Standing Still 1493 Standing still with minor light fluctuation and shakes.

2 Rotation Motion 860 Rotation 90 degrees around the y-axis (counter clockwise). 3 Forward Motion 2669 Moving 0.9 m forward in the

di-rection of the camera.

4 Sideways Motion 2143 Moving 0.9 m sideways with camera facing forward.

5 Free Float 2364 A walk with the camera involv-ing all four motions. ∆x=0.23, ∆y=0.10, ∆z=-4.01

6 Synthetic Data 150 20 cm forward, 45◦ around y, 20 cm forward, 45◦ around y, 20 cm sideways to the right.

Table 4.1. A table over the experiments made in the master thesis.

4.2 Error Measure and Expected Error

The error measure is given as percentage of the travelled distance in each direction. A low value indicate that the system works well. An estimation of the expected error is derived from the geometric properties in the stereo camera system. It can be used to weigh different 3D-points, but is currently only used to get an under-standing of which features are good and which are not. The average geometric error gives an indication of how well the system is expected to work.

4.2.1 Error of Travelled Distance

To evaluate how close the estimated final position is to the correct final position is the absolute difference between the estimated distance and the actual distance divided by the travelled distance computed.

distance = |x − x0|/ t

X

i=1

|xi+1− xi| (4.1)

where x is actual position and x0 is the estimated position in the x-direction. Corresponding values for y and z are calculated in the same way. The mean,

distance over all three directions is used when for instance Moving Forward is

compared with Moving Sideways. Unfortunately this error measure cannot be applied to the first two experiment since the actual travelled distance there is 0.00 m.

(39)

4.2 Error Measure and Expected Error 27

4.2.2 Estimation of Expected Error

The geometric error in the stereo camera system is calculated to determine the certainty of each 3D-point. It is denoted geometric and is defined as the distance

of how much each 3D-point is expected to vary along the z-axis when it is re-constructed. geometric is the average of all points over all time instances in a

sequence.

Before two corresponding features are reconstructed into 3D-points they are minimally corrected with the optimal correction method from x to x0 as described in 3.8.1 on page 21. Equation (3.8) on page 14 gives the depth error (along the z-axis) of the stereo system given a certain disparity s. Since we know how much each feature was corrected before it was reconstructed it is possible to determine how exact each and every 3D-point is reconstructed. Reconstructed points far away are more likely to have a greater geometric error than closer reconstructed points.

geometric=

T f u(s1+ s2)2

(4.2) where s1 is the corrected distance of a point in the left image and s2 is the

equiv-alent corrected distance in the right image [11].

Typically, a reconstructed point located 2 meters away is given by an uncer-tainty of 6 cm. This makes the measurements itself pretty uncertain. Remember that the moved distance between two frames commonly is a little bit less than 1 cm. However, since the error in the image space is assumed to be approximately Gaussian the total average error will decrease as the number of 3D-points increase. Currently the geometric error is not used to weigh the different 3D-points, but it could easily be done so in the future.

(40)

28 Experiments and Results

4.3 Experiment 1 - Standing Still

The first experiment is called Standing Still and evaluates how sensitive the system is to noise.

4.3.1 Experiment Setup

The camera is mounted facing a static scene as seen in Figure 4.1. The camera is not moving, however, the camera is vibrating slightly and small light fluctuation is present in the scene. The results of this image sequence shows how sensitive system is to measurement noise. The ground truth movement is no movement at all. Every 25th image in the sequence is used which gives a total number of 59 images.

Figure 4.1. A picture of the scene in Standing Still.

4.3.2 Results

The first column Ground Truth displays the correct value. The second column Rotation and Translation is the result when no prior guess how it is moving is made. In the third column Translation is just the translation estimated, while the rotation is fixed, and in the fourth column is the translation fixed and the rotation is estimated. These settings are applied on all the experiments. The results from the first experiment are presented in Table 4.2.

Ground Truth Rotation and Transl. Translation Rotation x 0.00 m (0.00◦) -0.049 m (4.82◦) -0.003 m (0.00◦) 0.00 m (-0.03◦) y 0.00 m (0.00◦) 0.102 m (2.41◦) -0.005 m (0.00◦) 0.00 m (-0.03◦) z 0.00 m (0.00◦) -0.024 m (-2.57◦) -0.017 m (0.00◦) 0.00 m (0.10◦)

(41)

4.4 Experiment 2 - Rotation Motion 29

4.4 Experiment 2 - Rotation Motion

4.4.1 Experiment Setup

In the rotation sequence is the camera first making a 90 degrees rotation counter clockwise around the y-axis. Within the scene is a window with bright light present, which complicates things for the feature extraction algorithm. The Rota-tion sequence contains 860 images of which four are depicted in Figure 4.2. Every 5th image in the sequence is used to calculations of the rotation. The reason for not using every 25th image is that we get more correctly matched features. There is neither translation nor rotation in another directions than around the y-axis. In the beginning of the sequence most of the features are located near the camera while they are further away elsewhere.

Figure 4.2. Four pictures from the scene in Rotation. The rightmost image is the first

image and the leftmost image is the image at which the rotation is complete.

4.4.2 Results

The results from the second experiment are presented in Table 4.3. The first value is the translation and the values in brackets are the rotation given in degrees around that axis.

Ground Truth Rotation and Translation Rotation x 0.00 m (0.00◦) -1.227 m (7.54◦) 0.00 m (-1.05◦) y 0.00 m (90.00◦) 0.672 m (47.91◦) 0.00 m (90.31◦) z 0.00 m (0.00◦₎ _{2.147 m (10.21}◦₎ _{0.00 m (5.85}◦₎

Table 4.3. Experiment results from Rotation Motion.

Figure 4.3 shows the change in degrees around each axis for the two settings Rotation and Translation, and Rotation. The expected graph appearance for the x- and z-axis is fixed constant at zero degrees, while the expected results for y is an almost linear line from the start of the motion at 0.00 to the end of it at 90.00 degrees.

(42)

(a) Rotation and Translation (b) Rotation

(43)

4.5 Experiment 3 - Forward Motion 31

4.5 Experiment 3 - Forward Motion

4.5.1 Experiment Setup

The camera is mounted on a sliding rail and and is being moved along the rail. The camera is facing the direction of motion. At the beginning it is located on one end of the rail and at the finish it is in the other end. The movement is thus limited to the z-direction. With a coordinate system defined as in Figure 2.2 it is consequently moving in the negative z-direction. It moves forward in an approximately constant velocity making a shorter stop at z=-0.50 m before it continuous again and finally stops at z=-0.90 m. The first and last image of the 2669 frames are seen in Figure 4.4 of which every 25th image is used to calculate the movement between each frame. A moving person is visible in the sequence in the background of the images, but features on him are later rejected by RANSAC and do not affect the measurements.

Figure 4.4. Two pictures from the scene in Forward Motion. The left image is the first

image and the right image is the last image.

4.5.2 Results

The results from the third experiment are presented in Table 4.4. The values in brackets are distance.

Ground Truth Rotation and Translation Translation x 0.00 m 0.056 m (6.26%) -0.024 m (2.72%) y 0.00 m 0.023 m (2.55%) -0.024 m (2.67%) z -0.90 m -0.858 m (4.66%) -0.893 m (0.73%)

Table 4.4. Experiment results from Forward Motion.

Figure 4.5 shows the estimated travelled path for the setting Rotation and Translation displayed.

(44)

(a) Rotation and Translation (b) Translation

Figure 4.5. Estimated travelled path for the different settings viewed from above with

(45)

4.6 Experiment 4 - Sideways Motion 33

4.6 Experiment 4 - Sideways Motion

4.6.1 Experiment Setup

The camera is mounted on the same sliding rail as in Forward Motion, but with the only difference that the camera is now facing sideways as it moves forward along the sliding rail. Since the direction of the camera of the first image defines the world coordinate system, its movement is limited to a movement in negative x-direction. It moves sideways with an approximately constant velocity, makes a shorter stop at x=-0.50 m before it continues again and finally stops completely at x=-0.90 m. The first and last images of the 2143 long image sequence are seen in Figure 4.6. But only every 25th image is used so only 86 small movements are calculated which makes the average movement just over 1 cm per image. In the image is a bookshelf and a person present and most features are found at the distance of 1.5-2.0 m.

Figure 4.6. Two pictures from the scene in Sideways Motion. The left image is the first

image and the right image is the last image.

4.6.2 Results

The results from the fourth experiment is presented in Table 4.5. The values in brackets are distance.

Ground Truth Rotation and Translation Translation x 0.90 m 1.055 m (17.22%) 0.925 m (2.77%) y 0.00 m -0.167 m (18.55%) -0.013 m (1.44%) z 0.00 m 0.120 m (13.33%) 0.184 m (20.04%)

Table 4.5. Experiment results from Sideways Motion.

Figure 4.7 shows the estimated travelled path for the settings Rotation and Translation, and Translation.

(46)

(a) Rotation and Translation (b) Translation

start at x=y=z=0.00 m and short stop at x=0.40 m and finished at x=0.90 m.

In Figure 4.8 is a graph for the geometrical error displayed. The geometrical error is quite stable around 6 cm, but is slightly lowered as the camera finds more features closer the to camera, in particular located on the person at the beginning of the sequence. Even if the geometric error varies a lot between different features there is just small changes in the average geometric error between all five experiments because the average distance to the features are approximately constant. The expected error should be interpreted as "with what certainty is the 3D-points reconstructed". In this case it is within 6 cm, because the image coordinates are quite noisy.

Figure 4.8. The mean geometric error plotted against the image numbers in the

(47)

4.7 Experiment 5 - Free Float 35

4.7 Experiment 5 - Free Float

4.7.1 Experiment Setup

The fifth experiment was a free float movement, a combination of all movements evaluated in the experiments above. Basically it is a forward motion, but the motion is not limited to the z-axis as seen in Figure 4.9 and is rotating at the same time, which makes it very hard for the system to calculate the correct motion. The exact movement is not known, meaning that the correct position is not of sufficient accuracy along the path. However, we know that the starting position is

x=y=z=0.00 m and the final position is x=0.23 m, y=0.10 m and z=-4.01 m. The

camera is held in hand and the images are therefore shaking a lot and therefore is all 2364 images used to estimate the movement.

Figure 4.9. Four pictures from the scene in Free Float. The leftmost image is one of

the first images in the sequence and the rightmost image is one of the last images.

4.7.2 Results

Table 4.6 is the results for the experiment Free Float displayed. Ground Truth Rotation and Translation Translation

x 0.23 m -1.814 m 0.509 m

y 0.10 m -1.286 m -2.748 m

z -4.01 m -9.076 m -15.856 m

Table 4.6. Experiment results from Free Float.

Figure 4.10 shows how the estimated position is updated for the image sequence Free Float.

(48)

(a) Translation (b) Rotation and Translation

(49)

4.8 Experiment 6 - Synthetic Data 37

4.8 Experiment 6 - Synthetic Data

4.8.1 Experiment Setup

In addition to the experiments made with real data, synthetic data was used to verify that the solution worked well with noise free data. A sequence of 150 frames containing forward motion, rotations and sideways motions was made up as depicted in Figure 4.11. The total motion was: 20 forward motions of 1 cm each, followed by a 45 rotations around y of 1◦. This was repeated twice and finally 20 sideways motions to the right of 1 cm each concluded the motion.

Synthetic 3D-points in the world was projected through the camera matrices into image coordinates. In addition to the centre point, located at (0.0,0.0,-2.0), 26 other 3D-points were spread out in all directions forming a 3*3*3-cube with 25 cm distance to each points’ closest neighbours. The 3D-points for the previous position was calculated given a certain motion. The previous position of the 3D-points were calculated by hand in such a way that the motion between two time instances became the wanted one. Except from using synthetic data instead of found feature points the same system was used to determine the motion.

Later, noise with uniform distribution between -0.5 and 0.5 pixels was added on each feature’s image coordinate. Since the noise was random, different paths was obtained for each run. The results from one of them, still representing for all of them, is included in Table 4.7 and its path in Figure 4.11.

4.8.2 Results

Table 4.7 is the results for the experiment Synthetic Data displayed.

Ground Truth Rotation and Translation Rotation and Translation + noise x -0.1421 m -0.1409 m (0.20%) -0.1562 m (2.35%)

y 0 m -0.0022 m (0.37%) -0.0890 m (14.83%)

z -0.5414 m -0.5417 m (0.05%) -0.6335 m (15.35%)

Table 4.7. Experiment results from Free Float.

Figure 4.11 shows how the estimated position is updated for the image sequence based on synthetic data.

(50)

(a) Estimated path for the Ro-tation and Translation setting without noise

(b) Estimated path for the Rota-tion and TranslaRota-tion setting with noise

Figure 4.11. Estimated travelled path. The correct path is within millimetres from

the left image. The right image is the same path but with added noise on the image coordinates.

(51)

Chapter 5

Discussion and Conclusions

5.1 Discussion

The experiments in the previous chapter show different results. In some cases it works well, while it does not in other situations. Thus, the results will be discussed and try to point out where it fails and what parts to improve to get a complete working system.

5.1.1 Synthetic data

Synthetic data was used to verify that the system works well on noise free data. The experiments proved that with synthetic data, the reconstruction of the 3D-points where very exact with an average reconstruction error of 3.1 · 10−6 m. The motion for each frame can therefore be determined with a similar accuracy. For instance, 1 cm forward motion gives an accuracy of 7.23 · 10−6 m. After 150 mixed motions the reconstruction error is less than 0.2 mm.

However, with added noise the reconstruction is significantly worse and a conse-quence is that the motion estimation also is affected. The noise can be interpreted as either reading errors, rounding errors to nearest pixel or even minor matching errors. The added noise is as most half a pixel off the correct value, but several feature point for each estimation gives poor motion estimations. Luckily, the er-rors are in some way cancelling out and the estimation of the final position is not too bad, even if its path reveals that the final result is better than the individual estimations. The error is approximately in the same range as for the experiments with real data.

From the experiments with synthetic data we can conclude two things. Firstly, that the system gives very good results provided that the image coordinates are correct. Secondly, the system is very sensitive to image noise and therefore it is not working as well with real data as it does with noise free data.

Even small errors in the image coordinates gives fairly large errors. A feature detector with sub-pixel accuracy is therefore probably needed.

(52)

40 Discussion and Conclusions

5.1.2 Distinguish Rotation from Translation

Throughout all experiments the setting Translation performs better than the set-ting Translation and Rotation. This is because in Translation is just the translation estimated, while in Translation and Rotation is both the rotation and translation estimated. In the experiments with the setting Translation the rotation is as-sumed to be fixed throughout the sequence and in Translation and Rotation six parameters have do be found instead of just three.

Errors made early in the estimation for both the two settings will never be compensated for, since the current position is just an accumulation of previous estimations. How the error affects the final result varies with what kind of error it is. If it is a translation error, meaning that the position update is wrongly scaled or even in the wrong direction, this will only affect the final result in the same way as it affected that particular position update. However, an error in the rotation parameters will not only affect that particular position update but also all other succeeding updates. So in this case, wrongly estimated rotation parameters in the beginning of an image sequence will affect more than if the same error was done in the end.

The most important thing for the Rotation and Translation setting is that it should be possible for the minimizer to distinguish the rotation a from the translation b as illustrated in Figure 5.1 on page 41 in a correct way. However, since the measurements are noisy, which leads to incorrect reconstructed 3D-points and therefore it happens now and then that translations are taken for rotations or the other way around. The reconstructed 3D-points include noise but the noise level is not larger than that it is still fairly easy to distinguish a translation for instance in x from a translation in z. However, when six rotation and translation parameters are estimated the noise level is to high. The noise in the 3D-points affect the Rotation and Translation setting more in than in the Translation setting. The only difference between the two settings is the number of parameters and it is obvious that the minimizer can not distinguish them at all times.

Even though the setting Translation performs much better than Rotation and Translation it cannot be used alone in practice as the latter can. For most situa-tions the system is dependent on something that keep track of in what direction it is moving. For instance, a gyroscope could be used advantageously to either estimate the direction directly or support the visual estimation of the rotation. Since the Translation part performed well it is probable that the setting Transla-tion and an added gyroscope would improve the system compared to the RotaTransla-tion and Translation setting.

Visual Stereo Odometry for Indoor Positioning

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Visual Stereo Odometry for Indoor Positioning

Visual Stereo Odometry for Indoor Positioning

Examensarbete utfört i Datorseende

vid Tekniska högskolan i Linköping

av

Abstract

Sammanfattning

Acknowledgments

Contents

Chapter 1

Introduction

1.1

Background

1.2

Combitech

1.3

Thesis Objectives

1.4

Thesis Outline

1.5

Symbols and Notation

Chapter 2

Related Research and

Proposed Method

2.1

Introduction

2.2

Related Research

2.3

My Contributions

2.4

Algorithm Overview

2.5

Hardware Platform

2.6

Software Platform

Chapter 3

Backgound Theory and

System Description

3.1

Pinhole Camera Model and Camera

Calibra-tion

3.2

Stereo Vision

3.3

Lens Distortion

3.4

Detecting Features

3.4.1

FAST

3.5

Feature Descriptor

3.5.1

BRIEF

3.6

Match Features

3.7

Verify Correct Matches

3.7.1

Geometric Verification

3.7.2

Epipolar Verification

3.8

Triangulation

3.8.1

Optimal Correction

3.8.2

Reconstruction

3.9

Point Cloud Fitting

Chapter 4

Experiments and Results

4.1

Overview

4.2

Error Measure and Expected Error