Improving a stereo-based visual odometry prototype with global optimization

(1)

19013

Examensarbete 30 hp Maj 2019

Improving a stereo-based visual odometry prototype with global optimization

Felix Verpers

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Improving a stereo-based visual odometry prototype with global optimization

Felix Verpers

In this degree project global optimization methods, for a previously developed software

prototype of a stereo odometry system, were studied. The existing software estimates

the motion between stereo frames and builds up a map of selected stereo frames which accumulates increasing error over time. The aim of the project was to study methods to mitigate the error accumulated over time in the step-wise motion estimation.

One approach based on relative pose estimates and another approach based on reprojection optimization were implemented and evaluated for the existing platform.

The results indicate that optimization based on relative keyframe estimates is promising for real-time usage. The second strategy based on reprojection of stereo- triangulated points proved useful as a refinement step but the relatively small error reduction comes at an increased computational cost. Therefore, this approach requires

further improvements to become applicable in situations where corrections are needed

in real-time, and it is hard to justify the increased computations for the relatively small error reduction.

The results also show that the global optimization primarily improves the absolute trajectory error.

Ämnesgranskare: Nataša Sladoje

Handledare: Jimmy Jonsson

(3)

Sammanfattning

Det senaste årtioendet har sett en tydlig trend mot autonoma fordon som har potential effektiviera processer och eliminera eller minska risker och underlätta för personer som tidigare behövt kontroller fordonen helt manuellt. Detta är något som har möjliggjorts med hjälp av ökad datorkraft och förbättrade algoritmer. En viktig komponent i autonoma fordon är en en korrekt representation av omgivningen för att kunna navigera i densamma. Idag är detta något som erhålls med hjälp av olika sensorer.

Kameror är billiga sensorer som tillsammans med billigare och billigare pro- cessorer har möjliggjort stora framsteg inom forskning och näringsliv. Ett ka- librerat par av kameror, ett så kallat stereopar, kan tillsammans med algoritmer imitera människans synsystem och ge detaljerad djupinformation om omgivningen, något som idag utnyttjas i till exempel drönare, bilar och undervattens- farkoster.

Det här projektet utgick från en befintlig mjukvaru- och hårdvaru-prototyp av ett kamerasystem för 3D-kartering och navigering, som baserar sig på just ett stereopar. Det befintliga systemet uppskattade rörelsen mellan bilder från ka- meraparet samtidigt som djupinformation sparades från utvalda tidssteg för att bygga upp en 3D-karta. Eftersom rörelseuppskattningen mellan bildtidpunkter inte är perfekt ackumuleras ett globalt fel som gör att 3D-kartan blir osamman- hängande över tid.

Arbetet i den här rapporten gjordes parallelt med ett annat arbetet som i kombination hade i syfte hade att förbättra mjukvarudelen av den existerande prototypen. Det här arbetet fokuserar på korrigeringar i 3D-kartan efter att det globala felet på något sätt har kunnat observerats.

Två metoder för att korrigera karterings-resultatet implementerades och un- dersöktes för prototypen. Metoderna utvärderades kvantitativt på öppen data insamlad från dröndare i två olika miljöer och kvalitativt på data insamlad från prototyp-hårdvaran. Konvergeringstid undersöktes också. Resultaten visar att metoden som minimerar felet i relativa position-orienteringar lämpar sig väl för realtid-korrigeringar. En andra metod som grundar sig på återprojicering av triangulerade punkter implementerades och undersöktes också. Den andra metoden förbättrade det absoluta felet ytterligare men är mer beräkningstungt och lämpar sig mindre bra för realtids-korrigeringar av kartan.

(4)

List of Figures

1 An example of a reconstruction of a garage environment. The upper-left corner shows the last image captured in the sequence.

The green outline shows the estimated position trajectory. . . 1

2 The pinhole camera model. . . 5

3 Epipolar geometry. . . 7

4 Stereo triangulation. . . 8

5 Pose graph illustration. . . 19

6 Illustration of the idea beind the reprojection optimization. . . . 20

7 Graph visualization for the reprojection optimization. . . 21

8 Relation between focal length and field of view in the x dimension. 26 9 Camera coordinate system assumed in the SLAM software. . . . 28

10 Camera coordinate system assumed in EuRoC. . . 28

11 Convergence of PGO for V1-02 (66-78). . . 36

12 Convergence of reprojection optimization after PGO for V1-02 (66-78). . . 37

13 Un-optimized MH03 output. . . 38

14 Optimized MH03 output. . . 38

15 Un-optimized V2-01 output. . . 39

16 Optimized V2-01 output. . . 39

17 Sequence 001 un-optimized. . . 40

18 Sequence 001 after PGO. . . 40

20 Sequence 001 after reprojection optimization. . . 41

22 Sequence 007 after a combination of PGO and reprojection optimization. . . 42

24 Sequence 007 after a combination of PGO and reprojection optimization. . . 43

List of Tables

1 ATE and RPE for the trajectories estimated by the existing algorithm. . . 30

2 ATE results for Gauss Newton. . . 31

3 ATE results for Levenberg Marquardt. . . 31

4 ATE results for covariance scaled Gauss Newton. . . 32

5 ATE results for covariance scaled Levenberg Marquardt. . . 32

6 ATE results for RO alone. . . 33

7 RPE results for Gauss Newton. . . 33

8 RPE results for Levenberg Marquardt. . . 34

9 RPE results for covariance scaled Gauss Newton. . . 34

10 RPE results for covariance scaled Levenberg Marquardt. . . 35

11 RPE results for RO alone. . . 35

(7)

Abbreviations

Short Expanded

SLAM Simultaneous Localization and Mapping GPS Global Positioning System

IMU Inertial Measurement Unit FOV Field Of View

PGO Pose Graph Optimization RPE Relative Pose Error ATE Absolute Trajectory Error RMSE Root Mean Squared Error

(8)

1 Introduction

Development of autonomous and unmanned vehicles has been a topic of great interest in academia and industry in recent time. Potential benefits have been recognized in terms of reduced risk and inconvenience for humans and increased operational efficiency.

For these types of vehicles it is of importance to have accurate knowledge of the location of the vehicle and the previously unknown environment in which it is moving. A good representation of the world or environment in which the vehicle is operating is important if the information will be used, for example, in surveying purposes or navigation purposes like path planning. The location can be estimated by incrementally estimating motion from sensor data. This is known as odometry. The incremental estimates will not be perfect and will eventually lead to accumulated error, known as drift.

As a relevant example, Figure 1 shows the effect of drift in one of the maps produced by the system being considered in this work. The figure shows the reconstruction of a right-angled room, shown in the upper-left corner. As can be seen, the reconstruction does not correspond well with the real world.

Figure 1: An example of a reconstruction of a garage environment. The upper- left corner shows the last image captured in the sequence. The green outline shows the estimated position trajectory.

A system which simultanously estimates location and environment structure is known as a Simultaneous Localization and Mapping (SLAM) system. The system being considered in this work is a prototype of such system, developed by SAAB, where the sensors used are a pair of cameras.

The goal of the work is to investigate how global optimization can be utilized to make corrections to the estimated trajectory and reconstruction of the environment, reducing the effects of drift in the said prototype. This contributes to getting the prototype in a state which is closer to a full-fledged SLAM system. More specifically, optimization based on relative poses and optimization based on reprojection errors are considered. An additional challenge is that the system, prior to this work, only had been run and evaluated in demo settings

(9)

and the software only run on data without any ground truth of the estimated motion. Therefore an effort was also put into creating a suitable method for quantitative evaluation, to measure the possible improvements.

The following section provides a context for the work that has been done.

This includes a brief description of visual odometry and visual SLAM and the particular implementation which the work aims to improve. Later sections describe the methods and implementations for global optimization and finally present and evaluate, both quantitatively and qualitatively, the achieved results.

2 Background

This section aims to give relevant background theory to understand the existing prototype and the proposed improvements.

2.1 Rotation representations

Multiple representations of 3D rotations are used in this work. This section describes and relates them to eachother.

2.1.1 Euler angles

One way to represent 3D rotations is with Euler angles, which represent the rotation as a composition of three rotations about the axes of a fixed coordinate system. There are multiple different conventions for Euler angles which are related to what order and about what axes the three rotations are performed.

In this work, the so called yaw-pitch-roll convention for Euler angles is used, which is consistent with what is used in the existing SLAM software. In this convention the rotations are in a z-y-x intrinsic order. This is related to the rotation matrix representation in the next section.

2.1.2 Rotation matrices

The orientation can also be described by a 3×3 rotation matrix R ∈ SO(3) for which R^T = R⁻¹ and det R = 1. This matrix transforms a three dimensional vector without altering its length.

The yaw-pitch-roll convention used in this work can be realized by combining the three basic rotations around the x, y and z axes, whose matrices here are denoted Rx, Ry, Rz. The basic rotations are given by

Rx(γ) =





0 0 1

0 cos γ − sin γ 0 sin γ cos γ





Ry(β) =





cos β 0 sin β

0 1 0

− sin β 0 cos β





Rz(α) =





cos α − sin α 0 sin α cos α 0

0 0 1





(10)

and the yaw-pitch-roll orientation can be expressed with the rotation matrix R(α, β, γ) = Rz(α)Ry(β)Rx(γ)

2.1.3 Quaternions

A third way to represent orientation is by using quaternions, which are an extension to the complex numbers. A quaternion is a number on the form

q = qw+ qxi + qyj + qzk

where i, j and k are the fundamental quaternion units satisfying i²= j²= k²= ijk = −1 .

Quaternions can be denoted (qw, qx, qy, qz) where qw is called the scalar part of the quaternion and (qx, qy, qz) the vector part. The quaternion corresponding to a rotation by angle θ around axis a = (ax, ay, az) can be written

cosθ

2, sinθ 2a

A 3D vector p = (px, py, pz) is rotated by a quaternion according to p⁰ = qqpq^∗

where qpis the quaternion (0, px, py, pz). A rotation matrix can be derived from a quaternion by

R(q) =





(1 − 2q²_y− 2²_z) 2(qxqy+ qwqz) 2(qxqz− qwqy) 2(qxqy− qwqz) (1 − 2q²_x− 2²_z) 2(qyqz− qwqx) 2(qxqz− qwqy) 2(qyqz+ qwqx) (1 − 2q_x²− 2²_y)





2.2 3D poses

A 3D pose refers to the position and orientation of a rigid body in a reference coordinate frame and consists of a rotation and a translation. The rotation and translation make up a transformation which transforms the representation of a point xB expressed in the coordinate frame of the rigid body to a representation of the same point xref expressed in the reference frame

xref= RxB+ t

where R ∈ SO(3) is a rotation matrix and t ∈ R³is a translation vector.

Using homogeneous coordinates, such that points are expressed by a 4- element vector x = (x, y, z, 1)^T, and a 4 × 4 transformation matrix T ∈ SE(3) the same relation can be written as

xref= TxB

where

T =R t 0 1

(11)

In other words, T represents the pose of rigid body B relative to the reference frame.

Here, in most cases, the pose of interest is a camera pose. The reference frame is either a world frame or the coordinate frame of another camera.

The term relative pose is most often used to refer to the pose of a rigid body in one time instance relative to the pose of the same rigid body in a different time instance.

For example, if the pose of a camera, relative to the world frame W, is known in two time instances k and k + 1,

xW= Tkxk

xW= Tk+1xk+1

which implies

Tkxk= Tk+1xk+1

then

xk = T⁻¹_k Tk+1xk+1 .

That is, the pose of the camera in time instance k + 1, relative to the camera in time instance k, can be represented by T⁻¹_k Tk+1.

2.3 Camera projection

Since the system is based on the use of cameras, it is in place to explain the underlying theory of camera projections and image formation. This is helpful in order to understand how the existing prototype works and also how the optimization procedures work. The theory is also underlying the conversion of parameters needed in the use of an open dataset in the evaluation.

The pinhole camera model is one of the most common camera projection models and an illustration of it can be found in Figure 2.

(12)

P = (x, y, z)

z = f

O z

y x y

x

v

u

(u, v)

~ u

~ v Principal point Principal axis

Figure 2: The pinhole camera model.

A commonly used mathematical model is given by

λ



 u v 1



=





fx 0 cx

0 fy cy

0 0 1



R t





 x y z 1







(1)

and describes how a 3D point (x, y, z), expressed in homogeneous coordinates as (x, y, z, 1)^T, is projected into a point (u, v), expressed as λ(u, v, 1)^T, on the image plane. The parameters cxand cyare offset parameters which define where the principal axis intersects the image plane. Ideally, the intersection is in the center of the image plane but due to not perfectly centered imaging sensors it is likely to be slightly off-centered. The parameters fx, fy are the focal length parameters and take the distance from the principal point to the optical center, as well as the photoreceptor spacing, into account. The offset and focal length parameters together make up what is known as the intrinsic parameters of the modeled camera. The illustration in Figure 2 is simplified and shows the case for which the imaging sensor is perfectly centered.

It is common to extend the model by modelling added distortion. One common type of distortion to account for is radial distortion which gives the effect of straight lines appearing bent in the image. The ”true” image position (x, y) can be modeled to be distorted to (x⁰, y⁰) according to

x⁰= xL(r)

y⁰= yL(r) (2)

(13)

where L(r) is the distortion factor which is a function of the radial distance r from the center of distortion. A common choice for the distortion factor, which is used in this work, is

L(r) = (1 + k1r²+ k2r⁴) The correction can be done using

ˆ

x = xc+ L(r)(x⁰− xc) ˆ

y = yc+ L(r)(y⁰− yc)

where (xc, yc) is the center of distortion [7]. Here the center of distortion is assumed to coincide with the principal point.

The projection model (1) and distortion model (2) can be utilized to predict where a 3D point in the world, expressed in the coordinate frame of the camera, will end up in the image. The parameters making upR t in 1, which define the transformation from world coordinate frame to camera coordinate frame are referred to as the extrinsic parameters of the camera. The resulting transformation is essentially the inverse of the pose of the camera relative to the world.

This means knowledge of the extrinsic parameters is equivalent to knowledge of the camera pose.

The intrinsic, extrinsic and distortion parameters are found through camera calibration, which is often done by capturing many images of a known object, finding corresponding points between the images and solving an optimization problem to find the parameters. The intrinsic, extrinsic and distortion parameters are together referred to as the calibration parameters. Since the quantitative and qualitative evaluation in Section 6 respectively use data obtained from different hardware, different parameters are used in the two cases.

2.4 Image point matching

The correspondence problem is the problem of finding corresponding image points, in two images, originating from the same 3D point. One method of solving this problem is: given an image point in one image, a small image patch, sometimes called template, is created from the image point and its neigh- bouring image points. To find the corresponding point in the second image, the patch is shifted over a search area in the second image computing the similarity between the patch and the image for each location of the patch. The matching image point will be given by the image point where the intensity difference is the smallest.

A simple similarity measure for this is the sum of absolute differences (SAD).

Given an image I0(x, y) and an image patch I1(x, y), the SAD value at location (u, v) in I0is given by

SAD(u, v) =X

i

|I0(xi+ u, yi+ v) − I1(xi, yi)|

where i is an index for the pixels in the image patch.

2.5 Stereo vision

A description of the main ideas utilized in stereo vision is appropriate partly in order to understand the procedure resulting in the trajectory to be corrected, and partly to understand the methods used when correcting the trajectory.

(14)

A stereo vision system typically consists of a pair of time-synced cameras which are fixed relative to each other. Time-syncing means that a pair of images (one image from each camera) are captured at the frame rate of the camera system, such that the two images are from the same time instance. The fact that the cameras are fixed relative to each other can be utilized for simplifying point matching between the cameras and triangulation of 3D points. The geometry defined by the fixed relative position and orientation is known as the epipolar geometry.

Ol Or

P0

P1

el er

pl0

pl1

pr0

pr1

Figure 3: Epipolar geometry.

The epipolar geometry implies that given an image point in the left camera, the corresponding point will be found somewhere along a line in the image from the right camera. This line is the projection, onto the right image plane, of the ray defined by the projected 3D point and the center of projection of the left camera. This relationship can be seen illustrated in Figure 3, where a 3D point P1 is projected into the left image point pl1. The corresponding right image point can be found along a line referred to as the epipolar line.

A transformation of the right and left images can be found such that the projection of a 3D point in one image ends up on the same y image coordinate in the second image. Transforming the images in this way is known as rectification.

The x coordinates of the two projections will be different and the difference is called disparity. That the projections end up in the same y coordinate means that the correspondence search is in one dimension instead of two and allows for quick feature matching. The feature matching problem is constrained further by the fact that a feature should only appear once in each image and the order of features should be the same. This allows for the creation of a dense disparity map, which ideally contains the disparity for each pixel in an image captured by the left camera.

Producing accurate disparity maps is a research topic on its own and is often referred to as stereo matching.

2.5.1 Triangulation

Triangulation is used in the underlying software and in the reprojection optimization to get information about where a feature seen in a stereo frame is in

(15)

3D space. The idea in stereo triangulation is that matching image points can be used to infer the 3D point which was projected in the two images, resulting in the image points. The projected image points, together with the centers of projection define two ray lines, which ideally intersect.

A line can be represented by an equation which relates a point r ∈ R³, on the line, to a scalar λ

r(λ) = p + λv (3)

where p is another point on the line and v is a vector specifying the direction of the line.

Denote two lines r0(λ) = p0+ λ0v0 and r1(λ) = p1+ λ1v1. If the rays intersect, the 3D intersection point can be retrieved by knowing that there exist λ0 and λ1 such that

r0(λ0) = r1(λ1) p0+ λ0v0= p1+ λ1v1 .

z x y x

z

y

Figure 4: Stereo triangulation.

Figure 4 shows an illustration of stereo triangulation where the ray lines intersect at the world point of interest.

Generally the rays do however not intersect and the system of equations above then has no solution. Instead the closest points between the non-intersecting lines are sought. That is, the scalars λ0 and λ1 that minimize the squared distance between the points on the line, defined by the scalars, are sought.

λmin₀,λ₁kr0(λ0) − r1(λ1)k²≡ min

λ₀,λ₁d^Td where

d = r0− r1

= (p0+ λ0v0) − (p1+ λ1v1)

= p0− p1+ λ0v0− λ1v1

Setting the partial derivatives with respect to λ0 and λ1to zero,

∂d^Td

∂λ0

= 2∂d

∂λ0

d

= 2v^T₀d

(16)

and ∂d^Td

∂λ1 = 2∂d

∂λ1d

= −2v^T₁d Setting the derivatives to zeros gives

v₀^Td = 0 v₁^Td = 0

which is equivalent to both rays being perpendicular to the distance vector.

This is equivalent to

v^T₀(p0− p1) + λ0v^T₀v0− λ1v^T₀v1= 0 v^T₁(p0− p1) + λ0v^T₁v0− λ1v^T₁v1= 0 Introducing new notation for simplicity,

p = v^T₀v1= v^T₁v0

q = v^T₀(p0− p1) r = v^T₁(p0− p1) s = v^T₀v0

t = v^T₁v1

we can write

q + λ0s − λ1p = 0 r + λ0p − λ1t = 0 which gives

λ0=λ1p − q s λ1=λ0p + r

t .

Substitution gives

λ0=λ0p²+ rp − qt st λ1=λ1p²− qp + rs

st and finally

λ0= rp − qt st − p² λ1= rs − qp

st − p² .

(4)

Alternatively we can write the distance vector, by concatenating v0and v1and introducing λ ≡ (λ0, λ1)^T,

d = p0− p1+v0 −v1λ0

λ1

≡ p0− p1+v0 −v1 λ

(17)

Taking the derivative with respect to λ,

∂d^Td

∂λ = 2∂d

∂λd

= 2v0 −v1T

d

≡ 2v^Td Setting the derivative to zero gives

2v^T(p0− p1+ vλ) = 0 which is equivalent to

λ = v^Tv⁻¹

v^T(p1− p0) (5)

When λ0and λ1have been found using either Equation (4) or (5), the points on the lines which minimize the distance can then be obtained using Equation (3).

One way to determine the triangulation point is to let it be the midpoint between the found points. However, in the triangulation procedure used in this work, the point found on the left ray line is used.

2.6 Visual SLAM overview

This section gives an overview of the different parts and problems of a typical visual SLAM system, such that the actual system being considered can be put into context. This helps in categorizing the system and understanding motivations for decisions being made as well as the discussion in the end.

The general problem of simultaneously estimating motion and environment structure using internal sensors is known as Simultaneous Localization and Mapping (SLAM). The fact that information is inferred from internal sensor means the method is desirable in environments where external systems, such as a Global Positioning System (GPS), are not available. Examples of such situations include autonomous vacuum cleaners, vehicles operating on other planets and underwater vehicles.

In the past, the main focus was on SLAM methods based on laser sensors, but in the 2000s there have been major advancements in the field of visual SLAM. The term ”visual” in visual odometry [11] and visual SLAM is used to convey that estimates are made from data obtained using RGB cameras. The main reason why cameras are an attractive sensor alternative is that they are cheaper and lighter than laser sensor systems. Notable works in this sub-field include MonoSLAM [4], LSD-SLAM [5] and ORB-SLAM [10]. The advancements can partly be attributed to increasing computation power, which increases the applicability of more computationally heavy algorithms.

One example of an application where visual odometry is used is the Mars rovers [9]. In this particular case the visual-based method has an advantage over performing odometry estimates using wheel rotation information, which results in incorrect estimates when the wheels slip.

The foundation in visual SLAM is a method for estimating motion from frames captured in two different time instances. Multiple estimations of motion

(18)

add up to a global trajectory of the motion of the system the camera(s) are part of. Estimating motion using cameras in this way is commonly referred to as visual odometry and is explained briefly in the next section. The difference between visual odometry and visual SLAM is discussed in a later section.

Two main categories of visual SLAM are mono-based methods and stereo- based methods. In the stereo case, two cameras fixed relative to eachother are used, while mono-based methods assume a single camera.

There are methods combining RGB data with Inertial Measurement Unit (IMU) data and depth data from infrared sensors, which are sometimes referred to as visual inertial SLAM. However, this work focuses on RGB-based data exclusively.

2.6.1 Motion estimation

The aim in motion estimation in visual SLAM is to estimate the position and orientation, relative to the previously estimated pose of the sensor system for each new measurement. The relative pose is typically estimated between a pair of frames captured relatively close in time, using detected and matched features between the frames. Feature-based methods are sometimes also referred to as indirect methods. Another category of methods is referred to as direct methods.

For these methods the motion is estimated by minimizing the photometric error between images [5] instead of looking at a sparse set of features.

There are three main methods within the indirect approach category:

• 2D-to-2D: esimation of motion from corresponding 2D image points

• 3D-to-3D: minimization of distance between corresponding 3D points

• 3D-to-2D: minimization of reprojection errors of corresponding 3D points Individual motion estimates are concatenated to make up the estimated trajectory and to have an estimate of the pose in the time instance of the last obtained measurement.

Regardless of the method used, there are uncertainties in the camera mea- surements together with possible errors in the feature matching methods that make the individual motion estimates noisy. Since estimates are added up, the error is also added up, resulting in so called drift of the estimated position and orientation.

2.6.2 Loop closure detection and optimization

If there was no drift, visual odometry could be solely relied on to get an accurate trajectory. However, in the situation described in the previous section errors are continously accumulated in the estimate of the position and orientation.

Visual SLAM aims to overcome this problem by using information additional to the odometry to improve the estimates. Whereas visual odometry is only concerned with the current and previous pose, visual SLAM uses information from earlier time instances as well.

The concept of detecting where the sensor system is, relative to the map, is sometimes referred to as place recognition. The part of SLAM known as loop closure detection and relocalization deals with the problem of automatic place recognition, by estimating the position and orientation, relative to the estimated

(19)

map. This can be used to evaluate the amount of drift and to correct the current estimate of the pose. A comparison of different methods for place recognition can be found in [17] and a popular method is explained in [12].

After the loop detection step, the pose relative to the map, is typically estimated using a similar procedure to the one used when estimating the relative pose in the motion estimation.

Global optimization in SLAM is applied to not only correct the current estimate but also to correct previously estimated poses in the trajectory and the map. When performing global optimization after loop closure detection, it is often referred to as loop closing.

2.6.3 Visual SLAM vs odometry

With the previous section in mind, a distinction can be made between pure visual odometry methods and visual SLAM methods. Visual SLAM methods can be described as extensions of visual odometry methods, including loop closure detection and global optimization as additional features. The visual odometry part of the system may be referred to as the front-end part and the loop closure detection and optimization be referred to as belonging to the back-end part of the system.

2.7 Global optimization

As global optimization is this work’s main focus, this section aims to describe it more completely than the brief description provided in previous sections.

The optimization being performed to obtain the odometry estimates can be seen as local optimization, where the focus is on motion estimation in a small time frame involving only the most recent poses. Global optimization can be seen in contrast to this local optimization. The intended meaning of global here is that the optimization is on a larger scale and involves a larger number of poses. The purpose of the optimization is to change the parameters of the poses in the trajectory to make the trajectory better. Whether or not the result is a better trajectory is discussed in Section 6.

Two main methods for global optimization are studied. One is optimizing for the relative pose error which is commonly referred to as Pose Graph Optimiza- tion (PGO) and one is optimizing for the reprojection error of covisible points between estimated poses. In this work this will be referred to as reprojection optimization. With the assumptions being stated in later sections, PGO can be seen as a nonlinear weighted least squares and the reprojection optimization can be seen as nonlinear ordinary least squares problem.

The main case being considered is a single loop situation where the stereo pair starts in a pose, moves in a trajectory which is estimated through odometry, and returns to a pose close to the intial one, such that the relative pose between the starting and ending pose is known or can be estimated (hopefully with a higher certainty than the relative pose given the odometry estimates). Here, the said relative pose is called the loop closure constraint or loop closure estimate.

Since both methods are based on non-linear least squares, the following section contains a general explanation of non-linear least squares. In later sections, connections between the general description and the specific optimization schemes are made.

(20)

2.7.1 Maximum likelihood and least squares

The optimization schemes used are based on non-linear least squares. This section provides a quite general explanation of the least squares problem, based on a maximum likelihood point of view. The ultimate goal in maximum likelihood estimation is to find the set of model parameters which make a set of observations the most likely.

Assume that the i:th observation zi ∈ R^D, follows the model x ∈ R^p but also includes random noise εi∈ R^D

zi= hi(x) + εi

where the observation model hi : R^p→ R^Ddefines how observation zi is related to the model x. By assuming Gaussian distributed noise with zero mean

εi∼ N (0, Σi) ,

the equation for the multivariate Gaussian distribution (see Appendix A), gives the conditional probability of observation i

p (zi|x) ∼ N (hi(x), Σi)

= 1

(2π)^n/2|Σi|^1/2exp

−1

2(zi− hi(x))^TΣ⁻¹_i (zi− hi(x))

≡ 1

(2π)^n/2|Σi|^1/2exp

−1

2ei(x)^TΣ⁻¹_i ei(x)

where zi− hi(x) has been denoted by ei(x). What is sought is the model that maximizes the joint probability of the observations, which in turn are assumed to be independent, so

ˆ

x = arg max

x

p (z|x)

= arg max

x

Y

i

p (zi|x)

= arg max

x

logY

i

p (zi|x)

= arg max

x

X

i

log p (zi|x)

= arg max

x

−X

i

ei(x)^TΣ⁻¹_i ei(x)

!

= arg min

x

X

i

Our cost function can then be written f (x) ≡X

i

=X

i

fi(x) (6)

(21)

2.7.2 Solving non-linear least squares

In case the observation model hi(x) is linear in the parameters x the problem is what is known as a linear least squares problem and the optimization problem then has a solution in closed form.

In this work the observation model hi(x) and thus the measurement error ei(x) is nonlinear in the parameters and the optimization problem needs to be solved iteratively. Therefore ei(x) is linearized around a point x0 in the model parameter space using a first-order Taylor expansion for vector fields

ei(x0+ ∆x) ≈ ei(x0) + Ji∆x

The Jacobian Ji represents how measurement error ei changes when changing the parameters of the model x. If x ∈ R^p and ei ∈ R^D where p is the number of parameters in the model and D is the dimension of an observation,

Ji(x) =







∂e1

∂x1 · · · ^∂e_∂x¹ .. p

. ...

∂eD

∂x1 · · · ^∂e_∂x^D

p







i

∈ R^D×p

and is in this work computed numerically. Using the properties of the transpose for matrices and scalars, each term in the sum of (6) can now be written

fi(x0+ ∆x) = ei(x0+ ∆x)^TΣ⁻¹_i ei(x0+ ∆x)

≈ (ei(x0) + Ji∆x)^TΣ⁻¹_i (ei(x0) + Ji∆x)

= ei(x0)^TΣ⁻¹_i ei(x0) + ei(x0)^TΣ⁻¹_i Ji∆x + ∆x^TJ^T_iΣ⁻¹_i ei(x0) + ∆x^TJ^T_iΣ⁻¹_ij Ji∆x

= ei(x0)^TΣ⁻¹_i ei(x0)

| {z }

ci

+ 2 ei(x0)^TΣ⁻¹_i Ji

| {z }

b^T_i

∆x

+ ∆x^TJ^T_iΣ⁻¹_i Ji

| {z }

Hi

∆x

= ci+ 2b^T_i ∆x + ∆x^THi∆x The cost function can then be written

f (x0+ ∆x) ≈X

i

ci+ 2b^T_i∆x + ∆x^THi∆x

=X

i

ci+X

i

2b^T_i∆x +X

i

∆x^THi∆x

≡ c + 2b^T∆x + ∆x^TH∆x Taking the derivative

df (x0+ ∆x)

d∆x = 2b + 2H∆x

(22)

and setting it to zero, gives the linearized system to be solved for ∆x

H∆x = −b (7)

Choosing an initial solution for x and iteratively linearizing and solving for the increment in parameter space gives what is known as the Gauss-Newton method for solving non-linear least squares, where H in (7) corresponds to the approximate Hessian of the cost function. In each iteration, the linearized system is constructed using the Jacobian evaluated at the point of the current solution, to find the parameter increment to be used as solution in the next iteration. An alternative method called the Levenberg-Marquardt method solves a damped version of this

(H + λI)∆x = −b

This avoids the problem of rank-deficiency of the Jacobian, which can occur for the Gauss-Newton method and which can result in a parameter increment in a direction giving an increase, instead of decrease, in the cost function. If the parameter λ is large, the search direction will be the direction of steepest descent which gives slower but more stable convergence.

2.7.3 Convergence for Gauss-Newton and Levenberg-Marquardt methods

Gauss-Newton and Levenberg Marquardt optimization methods can under certain conditions guarantee global convergence, meaning that a solution is found regardless of the starting point. However, since the optimization problems here are non-convex the found solution may be only locally optimal. Thus, the ini- tialization must be close to the optimal solution. The Gauss-Newton iterations may even fail to converge to a solution, due to the possible rank-deficiency problem. The dampening in Levenberg-Marquardt helps against this but may be slower.

2.7.4 Solving the linearized system

The linearized system in (7) could be solved by inverting the coefficient matrix. However, matrix decompositions are typically used to avoid the matrix inversion, which is computationally expensive. Common matrix decompositions are the QR, SVD and Cholesky decompositions. In this work, the Cholesky decompositon is used.

Cholesky decomposition uses the fact that the approximated Hessian is sym- metric and positive semi-definite to do the decomposition H = LL^T where L is a lower triangular matrix. This allows for obtaining a solution to the system by first solving Ly = −b for y and finally solving L^Tx = y for x.

Additionally, if the coefficient matrix contains a large numer of zeroes it is said to be sparse and the calculations can be streamlined further by avoid- ing doing unnecessary operations on zeros. This is known as sparse Cholesky decomposition.

In both methods being studied each observation associates two poses and each pose is involved in a small number of observations. This means that our Jacobians and thus also the approximated Hessian, will contain a large portion of zeroes. This typical structure can be exploited for benefits in computation time. In this work sparse Cholesky decomposition was used.

(23)

Optimization on manifolds

In general, the solution being sought in optimization problems is a point in an n-dimensional Euclidean space Rⁿ. In this work, we are optimizing poses, which belong to the Special Euclidean group SE(3), which is a smooth manifold – a special kind of topological space. The solution for a single pose consists of a point on this 3-dimensional manifold where the neighbourhood of each point on the manifold resembles 3D Euclidean space.

The procedures used here take this into account in the computations by using an Euclidean space representation, which here is referred to as parameter block, of the poses in solution of the linear system for finding the increment in Euclidean space. The increment is mapped and applied using the manifold representation.

This essentially means that increments to a parameter block in x are spec- ified in a minimal representation (a point in 6-dimensional Euclidean space), converted to the matrix representation to compute the new parameter block.

Finally x is mapped back to Euclidean space for use in the linear system such that a new increment can be found.

2.7.5 Pose parametrization

How the poses are parametrized will affect the structure of the parameter blocks in the optimization. Section 2.2 describes a natural way to represent poses using a 4 × 4 matrix, which is essentially a compositions of a 3 × 3 rotation matrix and a 3 × 1 translation vector. Instead of parametrizing the poses with 12 parameters, they can be parametrized using six parameters, since the translation and rotation have three degrees of freedom each. Parametrizing the rotation using Euler angles has the problem of gimbal lock where one degree of freedom is lost under certain circumstances. In the framework being used here the rotation part of the pose is parametrized by the vector part of the quaternion representaiton of the rotation (see Section 2.1.3), which leads to a minimal representation of the rotation. This parametrization is used in the computation of Jacobians and when applying the increments in the iterative steps in the optimization.

Quaternions are determined by four elements qw, qx, qy and qz but knowing that the quaternion has unit length

qw= q

1 − kqxyzk² .

This way the full unit quaternion can be recovered. Using a rotation matrix directly in the parametrization would mean having to ensure orthonormality of the rotation matrix. Here, only the unit magnitude of the unit quaternion has to be ensured.

The translation part of the pose is simply represented by a vector t ∈ R³. This means a pose can be parametrized by six parameters. Since we are seeking a set of poses, say n poses, we seek the model x ∈ R⁶ⁿ.

3 Overview of the system

This section aims to provide a descripion and categorization of the specific SLAM system being considered in this work. References are made to the previous sections.

(24)

The system being considered belongs to the stereo-based category and uses 3D correspondences for motion estimation. As there are no capabilities for loop closure detection or relocalization, as described in Section 2.6.2, it can be seen as consisting of a visual SLAM front-end missing a back-end as described in 2.6.3. This means that it mainly performs visual odometry.

For each new stereo frame, a dense disparity map is computed. Random points are generated across the left image. Image points, for which no disparity information was obtained, are removed. For the remaining points, matching points are sought in the subsequent frame. This is done by assuming that the motion between the frames is small, such that the matching points can be found in a close neighbourhood to the generated points.

The kept points and their found matches are triangulated in both stereo frames to obtain two point clouds. The triangulation is done by finding one of the closest points between the lines defined by the image points in the left and right camera and the centers of projection. Estimating the relative pose between two stereo frames is done by aligning the two triangulated point clouds.

Aligning the point clouds is done by solving a minimization problem in which the distance between corresponding 3D points is minimized.

Assume we know the pose Tk ∈ SE(3) of the stereo pair in frame k. If the points are expressed in the respective local coordinate frames, the relative pose between frame k and k + 1 can be written

Tˆk,k+1= arg min

Tk,k+1

X

i

xⁱ_k− Tk,k+1xⁱ_k+1

The relative pose between frame k and k + 1 is then concatenated to the pose estimate of frame k to get the pose estimate of frame k + 1.

Tk+1= TkTˆk,k+1

The minimization problem above can be seen as the Procrustes algorithm for finding rigid body motion. Making connections to Section 2.6.1, this way of estimating the motion belongs to the 3D-to-3D category of methods.

The map consists of a set of stereo frames with their corresponding data. The data consists of the images from the stereo pair in the particular time instance, the computed disparity map and the pose estimate in the time instance. The disparity map and the color information from the images are used to generate a mesh which represents the reconstruction of the environment. The set of stereo frames that are saved to represent the map are referred to as the keyframes. A new keyframe is added to the map when the camera has been estimated to have translated or rotated more than a threshold. The default thresholds are 70 cm and 5^◦.

What this work deals with is correcting the trajectory of keyframes using global optimization. It is only the pose parameters that are changed in the optimization steps, the disparity maps are not altered.

In parallel to this work, a second thesis project at SAAB dealt with methods for place recogintion and re-localization for the same system. In the quantitative evaluation in this work, for loop closure global optimizations, it is assumed that a perfect (in the sense of ground truth) relocalisation and loop closure estimation is available, and focus is put on the global optimization after a loop closure. In the qualitative evaluation the SLAM system has recorded data starting in a pose and ending in the same pose.

(25)

4 Implementation

4.1 Pose Graph Optimization

One of the methods being studied is what in the literature is referred to as Pose Graph Optimization (PGO). The reason for the name Pose Graph Optimization is that the optimization problem can be seen as a graph in which each node represents a pose with its corresponding parameters and each edge represents an observation between two poses. The goal of the optimization is then to find the best node configuration based on the edges. Although it is not necessary to see the optimization problem as a graph, it can help in understanding the structure of it and visualising it.

The optimization problem is of the form of non-linear least squares as explained in Section 2.7.1. Here an observation is an estimated relative pose between two camera poses and the observation model (in a non-minimal representation) for observation i between pose A and pose B can be written

hi(x) = T_A⁻¹TB

where TA and TB are the matrix representations of the respective poses as explained in Section 2.2. The 6D minimal representation is used after and inbetween computations.

From a maximum likelihood point-of-view, in PGO what is sought is the set of poses which is the most probable, given the gathered observations. The observations partly consist of estimates obtained with the motion estimation method explained in Section 2.6.1. The observations used in PGO additionally include a loop closure estimate, mentioned in Section 2.7.

Since an observation in this optimzation is a relative pose, and we are using the pose parametrization described in 2.7.5, it follows that zi, ei∈ R⁶.

4.2 Building the graph

The set of parameters being estimated in both PGO and reprojection optimization are the set of parameters which define the keyframe poses. In this work the initial solution used is the set of keyframe poses estimated from the existing visual odometry method. The intitial solution could also include nodes being part of a result from a previous optimization.

In the single loop case the only observation or edge which contributes to the cost function in the first iteration is the loop closure estimation since the inital model parameters are derived from the visual odometry estimates. This means that the model will coincide with all observations except one. What will then effectively happen in the optimization is that the discrepancy between the model in the first iteration and the loop closure estimate will be smoothed out across all relative poses, which will give the corrected trajectory.

The discrepancy between the trajectory model and the loop closure estimate represents the amount of drift that has been accumulated in the motion estimates.

Figure 5 shows a simplified graph representation of the Pose Graph Opti- mization.

(26)

T

0

T

₁

T

₂

T

₃

T

₄

Odometry estimate Non-odometry estimate

Figure 5: Pose graph illustration.

Result: PGO graph foreach new keyframe do

add node for keyframe pose parameters;

add edge representing odometry observation between two nodes;

if non-odometry observation then

add edge representing non-odometry observation two nodes;

end end

Algorithm 1: PGO graph construction.

4.3 Reprojection error optimization

Another method for optimization considered is reprojection optimization. Here instead of finding the set of keyframe poses which best fit a set of relative pose observations, the set of keyframe poses which best fit a set of image point observations is sought.

In the literature treating sparse reconstruction methods, bundle adjustment has been a popular method for reprojection based global optimization [16]. In bundle adjustment reprojection errors are optimized for and the parameters to be optimized are the pose parameters as well as the sparse reconstructed points.

Here we optimize for reprojection errors but the pose parameters are the only parameters in the optimization.

This optimization procedure is, like Pose Graph Optimization, a non-linear least squares problem with the same model but with a different observation model. Here an observation between two keyframes is the observed image points of a common 3D point, resulting in matching points in the left images of the two keyframes. The observation model is given by the 3D point triangulated by the first keyframe, reprojected into the second keyframe.

hi(x) = reproject_B(triangulate_A(pi))

The set of observations used in this optimization is obtained by looking for image

(27)

point matches between each pair of consecutive keyframes. Just as described in Section 3, random points are generated in the first keyframe. Point matches are sought in the second keyframe by using the estimated relative pose between the keyframes to triangulate image points in the first keyframe, transforming them according to the relative pose and projecting them into the second keyframe.

This simulates the keyframe seeing the scene from the same pose. Thereafter correlation, as explained in Section 2.4 is used to find point matches. The image coordinates in the the original image can then be obtained by going backwards.

Tk−2 Tk−1 Tk

Figure 6: Illustration of the idea beind the reprojection optimization.

This optimization problem can also be seen as a graph where each node is a keyframe with its corresponding parameters. The edges, instead of being relative poses, are image point observations as described above. In the graph representation keyframes are pairwise connected by the observations and changing the parameters of one keyframe will affect the cost function contribution of the observations it is involved with. Figure 7 shows a simplified graph visualization of the optimization problem where the edges represent all the observations corresponding to the point matches found between the keyframes. Note that in contrast to Figure 5 the edges represent a variable number of point correspondence observations instead of single odometry estimate observations.

Since the same model structure is sought, the linear system being solved in each iteration is the same size as in PGO, but the larger number of observations means more computations needed to obtain the linear system in each step.

Additionally, the more complex observation model involving triangulation and reprojection of 3D points, further hurts the computational complexity compared to the Pose Graph Optimization.

In this optimization we have the same model x ∈ R⁶ⁿ but the observations are expressed in pixel coordinates, and thus zi, ei∈ R².

(28)

One of the differences between this method and Pose Graph Optimization is that loop closure information is not needed in order to get something valuable out of the optimization. A valid optimization problem can be constructed from only a pair of keyframes, which would result in a possibly refined odometry estimation. However, if the goal is to remedy a gap in a 3D reconstruction there needs to be common points such that the graph is connected all the way around the loop in the trajectory.

T0

T₁ T₂ T3

T4

Variable number of point observations

Figure 7: Graph visualization for the reprojection optimization.

Result: Reprojection optimization graph foreach new keyframe do

add node for keyframe pose parameters;

find point correspondences between new keyframe and last keyframe;

foreach point correspondence do

add edge for observation relating the two keyframe nodes;

end end

Algorithm 2: Reprojection optimization graph construction.

4.4 Gauge freedom

For both methods, the parameters of one pose are fixed in the optimization procedure to deal with the arbitrariness in the choice of coordinate system. This means that if the optimization graph is connected, the non-fixed parameters will be changed relative to the fixed parameters. If the parameters of one pose are not fixed, all parameters could be changed without affecting the cost function.

Here the parameters of the first keyframe pose in the trajectory is fixed.

4.5 Modeling the uncertainty

In the literature the inverse of covariance matrix Σ⁻¹in (6) is referred to as the information matrix. How this is chosen for each added observation will affect the outcome of the optimization problem.

Since Σ corresponds to the covariance of the supposed error, it may be chosen using any available knowledge about the error and uncertainty in the observations used in the optimization. In the case of reprojection optimization we simply assume that the image point observations are independent and we

(29)

assume that the image coordinates for a particular point observation are independent, meaning all observations will have the same uncertainty represented by a diagonal, scalar covariance matrix, in this case an identity matrix.

For the Pose Graph Optimization we have reason to believe that the uncertainty of some observations are larger than others. Each inter-frame motion estimation problem will be easier or harder depending on the conditions when the images were captured: illumination, motion blur, texture for example. Also, and more importantly, there are a variable number of these noisy estimates inbetween keyframes which means the error between keyframes is the result of error propagation for multiple inter-frame estimates.

For the linear relationship

y = Mx + b

where x is a random vector, the covariance of y can be written Σy= MΣxM^T

In this work, the relation between a pose for one frame and the pose for the next frame is given by

Tk+1= TkTk,k+1

and the first-order Taylor approximation of the covariance will be [14]

Σk+1= JT_kΣkJ^T_T_k+ JT_k,k+1Σk,k+1J^T_T_k,k+1

where JTk and JTk,k+1 are the Jacobians of Tk+1with respect to Tk and Tk,k+1, respectively.

Making full use of the first-order approximation would increase the computational complexity, and here only the fact that the uncertainty is increasing for each inter-frame estimation is used. The information about the number of inter-frame estimates is used by weighting the covariance matrix is according to the number of motion estimates that a relative keyframe pose estimate consists of. By doing this it is assumed that a keyframe estimate produced with a greater amount of procrustes estimates is more uncertain. That the diagonal elements of the covariance matrix get more weight means that these estimates will be suppressed during optimization1 and contribute less to our cost function.

5 The g2o framework

The g2o (general graph optimization) framework [8] was used to implement the Pose Graph Optimization and reprojection optimization. The optimization problem for the Pose Graph Optimization is built continously as new keyframes are added. The g2o framework is written with the graph structure of the optimization problem in mind. As a new keyframe is added, a new graph node is added along with an edge corresponding to the estimate of the relative pose between the added keyframe and the previously added keyframe.

For the sequences obtained starting and ending in the same pose the loop closure constraint is between the first and last frame. For the sequences used for quantitative evaluation in this study (see Section 6.2.2), the loop closure constraint was added between the first and last frames which had corresponding

(30)

ground truth data. The optimization problem for the reprojection optimization is created by going through the estimated trajectory of keyframes and finding common points between each subsequent pair of keyframes and for each common point observation adding an edge.

A custom type of vertex and edge was implemented for the reprojection error optimization, to reflect what is described in Section 4.3.

As explained in the next section, optimization with both Gauss-Newton and Levenberg-Marqaurdt was evaluated. For this, the g2o classes

OptimizationAlgorithmGaussNewtonand OptimizationAlgorithmLevenberg were used.

6 Evaluation

There are a number of aspects in a visual SLAM system that might be of interest for evaluation.

• How ”correct” is the estimated trajectory?

• How ”correct” is the reconstruction produced?

• How quickly can the solution(s) be obtained? Can the system be run in real-time?

The first and second points above are addressed here by looking at the improvements to the reconstructed map by quantitative and qualitative evaluation of the methods discussed. The third point is addressed with a brief discussion of the added computation time and looking at convergence times.

The qualitative evaluation is done by inspection of the produced mesh of the environment, before and after correction of the trajectory. The data for this evaluation was obtained using the hardware for which the initial software prototyped was developed.

The quantitative evaluation is done using an open dataset which provides ground truth trajectories to compare with.

6.1 Qualitative evaluation

The sequence used for qualitative evaluation was obtained using the intended hardware inside a garage. The sequences were captured such that the stereo rig starts and ends in the same pose, thus providing information to be used as the non-odometry, loop closure constraint in the Pose Graph Optimization.

The qualitative evaluation criterion is, by inspecting the mesh, how well the reconstruction of the environment corresponds to the real world.

6.2 Quantitative evaluation

For evaluating possible improvements resulting from the optimization, stereo input and pose ground truth data were needed in order to get a quantitative measure of the improvement. There are not a lot of dataset alternatives for evaluating visual 3D SLAM systems and there are mainly two datasets used in the literature for stereo-based systems: KITTI [6] and EuRoC [2]. The KITTI dataset includes sequences in which data was obtained from a car driving in an

Improving a stereo-based visual odometry prototype with global optimization

Examensarbete 30 hp Maj 2019

Improving a stereo-based visual odometry prototype with global optimization

Felix Verpers

Abstract

Improving a stereo-based visual odometry prototype with global optimization

Felix Verpers

In this degree project global optimization methods, for a previously developed software

prototype of a stereo odometry system, were studied. The existing software estimates

the motion between stereo frames and builds up a map of selected stereo frames which accumulates increasing error over time. The aim of the project was to study methods to mitigate the error accumulated over time in the step-wise motion estimation.

One approach based on relative pose estimates and another approach based on reprojection optimization were implemented and evaluated for the existing platform.

further improvements to become applicable in situations where corrections are needed

in real-time, and it is hard to justify the increased computations for the relatively small error reduction.

The results also show that the global optimization primarily improves the absolute trajectory error.

Ämnesgranskare: Nataša Sladoje

Handledare: Jimmy Jonsson

Sammanfattning

Contents

List of Figures

List of Tables

Abbreviations

1 Introduction

2 Background

2.1 Rotation representations

2.2 3D poses

2.3 Camera projection

P = (x, y, z)

z = f

2.4 Image point matching

2.5 Stereo vision

z x y x

z

y

2.6 Visual SLAM overview

2.7 Global optimization

3 Overview of the system

4 Implementation

4.1 Pose Graph Optimization

4.2 Building the graph

T

T

T

T

T

Odometry estimate Non-odometry estimate

4.3 Reprojection error optimization

4.4 Gauge freedom

4.5 Modeling the uncertainty

5 The g2o framework

6 Evaluation

6.1 Qualitative evaluation

6.2 Quantitative evaluation