Visual-Inertial Odometry for Autonomous Ground Vehicles

(1)

STOCKHOLM SWEDEN 2017,

Visual-Inertial Odometry for Autonomous Ground

Vehicles

AKSHAY KUMAR BURUSA

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Ground Vehicles

Akshay Kumar Burusa burusa@kth.se

Master’s Thesis at school of Computer Science and Communication

———

KTH Supervisor : John Folkesson Scania Supervisor : Zhan Wang

Examiner : Patric Jensfelt

September 2017

(4)

(5)

Monocular cameras are prominently used for estimating motion of Unmanned Aerial Vehicles. With growing interest in autonomous vehicle technology, the use of monocular cameras in ground vehicles is on the rise. This is especially favorable for localization in situations where Global Navigation Satellite System (GNSS) is unreliable, such as open-pit mining environments. However, most monocular camera based approaches suffer due to obscure scale information. Ground vehicles impose a greater difficulty due to high speeds and fast movements. This thesis aims to estimate the scale of monocular vision data by using an inertial sensor in addition to the camera. It is shown that the simultaneous estimation of pose and scale in autonomous ground vehicles is possible by the fusion of visual and inertial sensors in an Extended Kalman Filter (EKF) framework. However, the convergence of scale is sensitive to several factors including the initialization error. An accurate estimation of scale allows the accurate estimation of pose. This facilitates the localization of ground vehicles in the absence of GNSS, providing a reliable fall-back option.

(6)

Monokulära kameror används ofta vid rörelseestimering av obe- mannade flygande farkoster. Med det ökade intresset för auto- noma fordon har även användningen av monokulära kameror i fordon ökat. Detta är fram för allt fördelaktigt i situationer där satellitnavigering (Global Navigation Satellite System (GNSS)) är opålitlig, exempelvis i dagbrott. De flesta system som använder sig av monokulära kameror har problem med att estimera skalan.

Denna estimering blir ännu svårare på grund av ett fordons större hastigheter och snabbare rörelser. Syftet med detta exjobb är att försöka estimera skalan baserat på bild data från en monokulär kamera, genom att komplettera med data från tröghetssensorer.

Det visas att simultan estimering av position och skala för ett fordon är möjligt genom fusion av bild- och tröghetsdata från sen- sorer med hjälp av ett utökat Kalmanfilter (EKF). Estimeringens konvergens beror på flera faktorer, inklusive initialiseringsfel. En noggrann estimering av skalan möjliggör också en noggrann estimering av positionen. Detta möjliggör lokalisering av fordon vid avsaknad av GNSS och erbjuder därmed en ökad redundans.

(7)

This thesis was completed at the department of Autonomous Transport Solu- tion, REPF, at Scania CV AB.

I would like to thank Zhan Wang for providing me the opportunity to work on this thesis. I thank him for his continuous guidance and for giving me great insight regarding how to approaching problems. I thank Lars Hjorth for supporting and encouraging me to complete the thesis. I thank Bengt Boberg for providing me the necessary materials that helped me achieve a greater understanding of concepts. I also thank Patricio Valenzuela for all the interesting and significant discussions.

I would like to thank my supervisor at KTH, John Folkesson, for encouraging me and for all his suggestions which kept me focused on the important problems.

Finally, I thank my dear family and amazing friends who were very supportive of my decisions and helped me throughout my studies. This thesis would not be possible without them.

Akshay Kumar Burusa Stockholm, September 2017

(8)

Abstract

Sammanfattning Acknowledgements Glossary

1 Introduction 1

1.1 Problem Statement . . . 1

1.2 Objectives . . . 2

1.3 Organization . . . 2

2 Background 5 2.1 Quaternions . . . 5

2.2 Sensors . . . 7

2.2.1 Pose from Visual Sensor . . . 7

2.2.2 Pose from IMU . . . 12

2.3 Fusion of IMU and Vision . . . 14

2.3.1 Bayesian Inference . . . 15

2.3.2 Kalman Filter . . . 16

2.3.3 Extended Kalman Filter . . . 18

2.3.4 Error-State Kalman Filter . . . 20

3 Related Work 23 3.1 Visual Odometry . . . 23

3.1.1 Feature-based Approach . . . 23

3.1.2 Direct Approach . . . 24

3.1.3 Hybrid Approach . . . 25

3.2 Visual-Inertial Odometry . . . 26

3.2.1 Loosely-coupled . . . 26

3.2.2 Tightly-coupled . . . 28

(9)

4.2 Vector Notation . . . 32

4.3 Visual-Inertial System . . . 32

4.4 Extended Kalman Filter design . . . 33

4.4.1 Continuous-time Nonlinear State Model . . . 33

4.4.2 Discrete-time Nonlinear State Model . . . 34

4.4.3 Prediction Step . . . 34

4.4.4 Measurement Model . . . 36

4.4.5 Update Step . . . 37

5 Results 39 5.1 Malaga Dataset . . . 39

5.2 Experimental Setup . . . 39

5.2.1 Malaga Track 7 Results . . . 42

5.2.2 Malaga Track 8 Results . . . 45

5.3 Summary . . . 48

6 Conclusion 51 Bibliography 53 Appendices 56 A Sustainability, Ethics and Social Impact 57 B Detailed Results 59 B.1 Malaga Track 7 . . . 59

B.2 Malaga Track 8 . . . 62

(10)

(11)

Pose The combination of position and orientation of an object with respect to a coordinate frame.

Odometry Use of data from sensors to estimate egomotion of an object over time. This estimation with the use of cameras is termed as visual odometry.

Scale Scale refers to the ratio between the size of estimated vehicle trajectory and the size of true trajectory.

Epipolar Geometry Projective geometry between two different views of a Camera.

Photometric Error Geometric error between a pixel in an image and it’s corresponding projection in another image. Pixels are matched based on intensity.

Reprojection Error Geometric error between a 2d location of a feature in an image and corresponding projection of the 3d feature in another image.

(12)

(13)

Introduction

Experiments for developing autonomous vehicles started as early as 1920’s [15]. Almost a century later, autonomous vehicle technology has come a long way and is almost ready for large scale commercial production.

In 1995, the NavLab 5 team at Carnegie Mellon University successfully developed an autonomous vehicle [21] which drove from Pittsburgh to San Diego without much human intervention. However, further development was required to make the vehicles safe and robust in all conditions. These vehicles are required to navigate through a complex environment of urban streets. A major contribution to autonomous vehicles came from the robotic technology that was developed for the DARPA challenge [28], an autonomous vehicle challenge funded by the US Department of Defense which started in 2004.

Soon all major automotive companies started adopting this technology and today driverless vehicles have gained huge popularity although concerns for safety still remain.

The autonomous vehicle technology is not a solved problem yet, in the sense that they are not completely ready for the roads. They fail to handle environments such as a crowded streets or bad weather conditions. Regarding safety of human life, there are several ethical issues such as the famous “Trol- ley problem” [27] and how an autonomous vehicle would react under such circumstances.

1.1 Problem Statement

Autonomous vehicles rely heavily on their sensors for information about the environment. One obvious drawback here is that the failure of one sensor can result in serious consequences. This is addressed by having multiple sensors

(14)

that can perceive similar information. For example, obstacle detection can be achieved using cameras and ultrasonic sensors. The combined use of these sensors is highly desirable for higher accuracy and also to handle cases of sensor failure.

This thesis focuses on the problem of localization and odometry of autonomous ground vehicles, in particular, autonomous trucks. Global Navigation Satellite System (GNSS) are most commonly used for this purpose. GNSS is also prone to failure. It is very common for GNSS receivers to lose satellite signals in remote locations such as open-pit mining environment. A high accuracy GNSS sensor is very expensive and hence using multiple GNSS sensors is not practical. This implies that we require other set of sensors that can perform the same role. Lately, cameras are being used for localization and odometry in Unmanned Aerial Vehicles (UAVs).

The use of monocular camera in ground vehicles introduces some challenges such as fast movements and obscure scale information. This thesis investigates the possibility to simultaneously estimate vehicle position and recover scale information by combining data from monocular camera and Inertial Measure- ment Unit (IMU).

1.2 Objectives

• Investigate a robust real-time visual odometry algorithm for autonomous trucks in an open-pit mining scenario.

• Investigate the possibility of accurate scale recovery with the fusion of monocular camera and IMU.

• Analyze the algorithm for real data which closely represents open-pit mining scenario.

1.3 Organization

Chapter 2: Background An introduction to odometry using cameras and IMU is given, followed by an overview of probabilistic state estimation and sensor fusion techniques.

Chapter 3: Related Work The several methods available for sensor fusion, particularly in visual-inertial odometry, will be focused in this chapter. Some of the state-of-the-art works are introduced.

(15)

Chapter 4: Implementation An algorithm for visual-inertial odometry for autonomous vehicles is implemented using an Extended Kalman Filter (EKF) framework.

Chapters 5, 6 : Results and Conclusion The algorithm was tested on an online dataset and the estimation of scale for monocular visual data was analyzed.

(16)

(17)

Background

This chapter begins with the introduction of quaternions, which are widely used to represent 3D orientation of autonomous vehicles. This is followed by techniques of pose estimation using visual and inertial data. Estimation theory is then introduced that leads to the implemented filter. Details of implementation will follow in Chapter 4.

2.1 Quaternions

Quaternions are a set of hypercomplex numbers that are most prominently used in 3D mechanics. They are suitable for attitude representation in 3- dimensional space and are used to represent orientation in pose estimation.

There are several conventions for quaternions with their own set of rules and formulas. The Hamilton convention is followed in this thesis, as explained by Kuipers [14].

A quaternion is a 4-dimensional vector in the space of quaternions H of the form,

q = q_w+ q_xi + q_yj + q_zk ∈ H, (2.1) where the imaginary components i, j, k follow the rule

i² = j² = k² = ijk = −1, (2.2)

ij = −ji = k, jk = −kj = i, ki = −ik = j. (2.3) The quaternions can be be represented as,

(18)

q =

"

q_w q_v

#

=







q_w q_x qy

q_z







, (2.4)

where q_w is the real or scalar part and q_v = q_xi + q_yj + q_zk is the imaginary or vector part of the quaternion.

Properties

• Sum

The quaternion sum is given by, p ± q =

"

p_w p_v

#

±

"

q_w q_v

#

=

"

p_w± q_w p_v ± q_v

#

. (2.5)

The sum is commutative and associative.

p ± q = q ± p, (2.6)

p ± (q ± r) = (p ± q) ± r. (2.7)

• Product

The quaternion product is given by, p ⊗ q =

"

p_wq_w − p^T_vq_v p_wq_v+ q_wp_v+ p_v× q_v

#

. (2.8)

The product is not commutative but it is associative.

p ⊗ q 6= q ⊗ p, (2.9)

p ⊗ (q ⊗ r) = (p ⊗ q) ⊗ r. (2.10)

• Rotation matrix

The quaternion q^B_A between two coordinate frames represents the ro- tation from frame A to frame B. The rotation matrix defined by C_q^B rotates a vector of frame B to frame A, that is, x_A= C_q^B A

Ax_B. C_q =







q_w² + q_x²− q²_y− q_z² 2(qxqy− qwqz) 2(qxqz+ qwqy) 2(q_xqy+ q_wqz) q_w² − q_x²+ q²_y− q_z² 2(q_yqz − qwqx) 2(q_xqz− qwqy) 2(q_yqz + q_wqx) q²_w− q_x²− q_y²+ q_z²





 (2.11)

(19)

• Quaternion product matrices

Matrix representation for multiplication of quaternion q and angular velocity ω.

Ω(ω) =







0 −ω_x −ω_y −ω_z ω_x 0 ω_z −ω_y ω_y −ω_z 0 ω_x ω_z ω_y −ω_x 0







(2.12)

Ω(q) =¯







−q_x −q_y −q_z q_w −q_z q_y q_z q_w −q_x

−q_y q_x q_w







. (2.13)

2.2 Sensors

Estimation of motion can be achieved by several sensors. The sensors that are of interest in this thesis are camera and Inertial Measurement Unit (IMU). The estimation of pose (position + orientation) using these sensors is discussed in this section.

2.2.1 Pose from Visual Sensor

A camera is an optical device that captures incident light rays with an optical system and a light-sensitive electronic sensor enables to store this information in a digital format. This digital image is then processed by computer vision algorithms for applications such as motion estimation. Estimating the motion of a vehicle using visual cues from cameras attached to it is called Visual Odometry (VO).

Figure 2.1: General Pipeline for Incremental Structure from Motion.

Feature-based methods for visual odometry is a well-researched topic [7] [18]

[20] [22] [25]. The general pipeline for these methods is shown by Figure 2.1

(20)

and explained below. It involves determining the epipolar geometry between two image frames, estimating the relative pose between the frames, followed by the optimization of estimated pose using bundle adjustment. These steps require functions such as feature extraction and tracking, finding feature correspondences and minimization of errors. Initially, these processes required heavy computation which forced the optimization to be done offline. Klein et al. [11] proposed parallel threads for tracking and mapping which made online optimization in a bundle adjustment framework possible. Most algorithms today continue to use the parallel approach.

Visual odometry is the estimation of camera motion from visual data and Structure from Motion (SfM) involves the simultaneous estimation of motion and reconstruction of a 3D environment from visual cues. These are briefly introduced here. More details can be found in the description by Cipolla [22].

STEP 1: Image Features

Image features are uniquely identifiable areas in the image that are associated with some 3D object in the world. They should be recognizable even as the scene changes. A feature detector extracts interesting features in the image, such as such as corners or textured areas. Feature descriptors are vectors that are useful to differentiate between two features, that is, they can be considered as numerical fingerprints.

Together, feature detectors and descriptors help in storing the most relevant and unique information about a scene. They are also important to solve the correspondence problem, that is, identifying the same set of features between multiple views of the same scene.

STEP 2: Epipolar Geometry

This is also known as the two-view geometry which defines the relation between a 3D point and its projection onto two different image frames, as described by Figure 2.2.

The epipolar constraint can be obtained from the essential matrix E which relates the corresponding projections in two image frames. Consider a point X with respect to the center of projection C, then it’s location X⁰ with respect to center of projection C⁰ is

X = RX⁰+ t, (2.14)

(21)

C C' X

x x'

e e'

m'

R,t

Figure 2.2: When the projection of a 3D point X in one image is known to be x, then the projection x⁰ on the second image will be restricted to the epipolar line m⁰. The 3D point X and the centers of projection C and C⁰ belong to the same plane.

where R and t are the rotation and translation matrices between the coordi- nates of C and C⁰. Pre-multiplying Equation (2.14) by X^T[t]× 1 gives,

X^T[t]×RX⁰ = 0. (2.15)

So, the essential matrix is defined as E = [t]×R, which implies

X^TEX⁰ = 0. (2.16)

This is also true for the projected image points, which gives

x^TEx⁰ = 0. (2.17)

1[a]_× is skew symmetric matrix. For a = (a_x a_y a_z)^T, [a]_×=





0 −az ay

az 0 −ax

−ay ax 0





(22)

The essential matrix applies only to calibrated cameras. If the cameras are not calibrated, the intrinsic parameters of the cameras are required to determine the epipolar geometry. This is given by the fundamental matrix F.

F = K^0−TEK⁻¹, (2.18)

where K and K⁰ are the camera calibration matrices.

Once the epipolar geometry is determined, it is possible to estimate the relative pose (rotation and translation) between two camera frames when a set of feature correspondences are known.

STEP 3: Motion Reconstruction

The relative pose between two frames can be obtained from the singular value decomposition of the essential matrix associated with the two frames, as proposed by Hartley [9],

E = U ΣV^T, (2.19)

where U and V are orthogonal matrices and Σ is a rectangular diagonal matrix. The rotation R and translation t are given by,

R = U







0 −1 0

1 0 0

0 0 1





V^T, (2.20)

[t]× = U







0 1 0

−1 0 0

0 0 0





U^T. (2.21)

The camera projection matrix, which describes the mapping of a camera from 3D world points to 2D image points, can be obtained from the rotation and translation matrices. Considering the projection matrix P associated with first frame to be

P = K[I | 0], (2.22)

the projection matrix P⁰ associated with the second frame is

P⁰ = K⁰[R | t]. (2.23)

(23)

There are four possible values for P⁰ due to choice of signs for R and t. One way to determine the best solution is to triangulate the four solutions and calculate the reprojection error.

STEP 4: Triangulation and Reprojection error

Triangulation refers to calculating the location of a 3D point in space using the projections of the point onto two or more images. In an ideal case, the 3D point will lie on the intersection of the back-projected rays. However, almost always these rays do not intersect at the true 3D point due to the presence of noise and measurement errors, as shown in Figure 2.3.

A1

C1

F1

R,tF2

A2

A3

C2

C3 X

F2

R,tF3 x1

x2

x3

Figure 2.3: Best estimate of the triangulated 3D point X.

So we estimate the location of the 3D point that minimizes the reprojection error, that is, the squared error between all the estimated and measured image points. This is formulated as a minimization problem,

X = arg min

X

i

||x_i− ˆx_i(P_i, X)||². (2.24)

(24)

At the end of this step, we will have a good estimate of the relative pose between two image frames. When this process is applied to two consecutive frames of a camera, we can estimate the motion of the camera.

STEP 5: Bundle Adjustment

Bundle adjustment optimizes the 3D geometry, the relative pose and also the intrinsic camera parameters at the same time. The optimization formulation is based on minimizing the weighted sum of squared reprojection errors.

Imagine there are ‘p’ 3D points captured from ‘q’ different views, and let x_ij be the i^th projection on the j^th frame. Then the minimization follows as,

minX,Pi

p

X

i=1 q

X

j=1

b_ij||x_i− ˆx_i(P_i, X)||², (2.25)

where b is a binary vector where b_ij is ‘1’ if the point i is visible from the frame j.

A local bundle adjustment applies the optimization to a set of co-visible frames, while the rest of the frames remain fixed. A global bundle adjustment optimizes all the frames and 3D points assuming only the initial frame to be fixed.

2.2.2 Pose from IMU

An Inertial Measurement Unit (IMU) measures the linear acceleration, angular velocity and magnetic field acting on a body. The combination of accelerometer, gyroscope and magnetometer are used for this purpose.

Accelerometer

An accelerometer is a sensor that measures linear accelerations. It has 3 degrees of freedom.

The measured accelerations are influenced by noise and bias. Suppose that a_t is the true acceleration void of any external disturbance and a_m is the measured acceleration from the sensor, then a_t and a_m are related by,

a_m = C_q⁻¹(a_t− g) + b_a− n_a, (2.26)

(25)

where b_a is the acceleration bias, n_a is the measurement noise modelled as Guassian white noise, g is the vector corresponding to the acceleration due to gravity and C_q is the rotational matrix that records how the body is oriented with respect to a stationary coordinate system.

The acceleration bias is not constant but varies with time. It is modelled as a random walk with bias noise n_ba modelled as Guassian white noise with zero mean,

˙b_a= n_ba. (2.27)

Gyroscope

A gyroscope is a sensor that measures angular velocities. It has 3 degrees of freedom.

Similar to the accelerometer, a gyroscope measurement is also influenced by noise and bias. Suppose that ω_t is the true acceleration void of any external disturbance and ω_m is the measured acceleration from the sensor, then ω_tand ωm are related by,

ω_m = ω_t+ b_ω− n_ω, (2.28)

where b_ω is the angular velocity bias, n_ω is the measurement noise modelled as a white Gaussian with zero mean.

The angular velocity bias is not constant but grows with time. It is modelled as a random walk with bias noise n_bω modelled as Guassian white noise,

˙b_ω = n_bω. (2.29)

Coordinate Frames

There are two different coordinate frames that need to be introduced when working with IMU’s.

• Body frame (B) or IMU frame (I): The body coordinate frame is rigidly attached to the IMU. When an IMU is attached to a vehicle, this coordinate frame moves according to the motion of the vehicle.

(26)

• Navigation frame (N) or World frame (W): The navigation coordinate frame is firmly attached to the earth’s surface. Usually, this is the location where the vehicle begins from. The motion of the vehicle is regis- tered with respect to this frame.

Navigation Equations

The motion of IMU is modelled with a set of navigation equations. Con- sider the position of IMU in World frame pⁱ_w, the velocity of IMU in World frame vⁱ_w, the attitude quaternion describing the rotation from World frame to IMU frame q_wⁱ , and the accelerometer and gyro biases b_aand b_ω. The IMU kinematics and bias model is as follows,

˙

pⁱ_w = v_wⁱ, (2.30)

˙vⁱ_w = C_qⁱ

w(a_m− b_a+ n_a) − g, (2.31)

˙ qⁱ_w = 1

2Ω(ω_m− b_ω+ n_ω)qⁱ_w, (2.32)

˙ba = nba, (2.33)

˙b_ω = n_b_ω. (2.34)

where C_qⁱ_w represents the rotation matrix that transforms a vector from Iner- tial frame to World frame, Ω(ω) is the quaternion product matrix as defined in Section 2.1 and g is the vector corresponding to the acceleration due to gravity. The pose of the vehicle is represented by [pⁱ_w, q_wⁱ ].

The integration of these differential equations help us estimate the pose from IMU measurements.

2.3 Fusion of IMU and Vision

The goal of sensor fusion is to estimate the state x of a system given observa- tions z from different sensors and possibly a control sequence u acting on the system.

Generally, several sensors are used to gather information of interest regarding a system. These sensor observations are influenced by noise and it is not straight-forward to determine a model that accurately maps the observations to the system state. Moreover, the noise acting on the system and uncertainty of the model are of stochastic nature.

(27)

This situation is dealt by formulating an approximate system model. The parameters of this model are estimated using observed data such that the noise and uncertainty in the system are minimized.

Consider a general discrete-time state space system,

x_k= f (x_1:k−1, u_1:k−1, w_1:k−1), (2.35)

zk= h(x_1:k, v_1:k). (2.36)

Here, x is the state vector describing the system, u is the control input, w is the noise related to the uncertainty in the system, z is the measurement vector, v is the measurement noise and k is the time step.

2.3.1 Bayesian Inference

Bayesian inference is an important tool for statistical estimation and integration of disparate information in an optimal way. This process uses the prior belief about the system state, combines it with the current observations and determines an optimal belief regarding the current system state.

Mathematically, this can be determined with the conditional probability den- sity p(x_k|z_1:k) where x_k is the state at time k and z_1:k are all the observations made until time k. The Markov property will be assumed here.

The probability density of the current state given the previous observations depends on the probability associated with the transition from state x_k−1 to state x_k and the probability density of the previous state. This is the prior belief,

p(x_k|z_1:k−1) =

Z

p(x_k|x_k−1)p(x_k−1|z_1:k−1)dx_k−1. (2.37) The probability density of the current state after the current observation has been made depends on the measurement likelihood and the prior belief,

p(x_k|z_1:k) = p(z_k|x_k)p(x_k|z_1:k−1)

p(zk|z1:k−1) . (2.38)

The term p(zk|zk−1) is determined by, p(z_k|z_1:k−1) =

Z

p(z_k|x_k)p(x_k|z_1:k−1)dx_k (2.39) The Equations (2.37) to (2.39) give us a recursive Bayesian estimation to determine the conditional probability density p(x_k|z_1:k).

(28)

2.3.2 Kalman Filter

The Kalman filter is a special case of a recursive Bayesian filter for multivariate normal distributions where the system is discrete-time linear and the noise parameters are assumed to be additive white Gaussian. The filter is named after Rudolf E. Kálmán who contributed towards its development.

The filter consists of two parts, prediction and correction. The prediction step uses the system equations to determine a prior estimate of the state at the current time step. The correction step updates the prior estimate based on observations made. The Kalman filter follows Bayesian inference and estimates the states based on the uncertainties in the system and measurement models.

Consider the system described by Equations (2.35) and (2.36). The linear discrete-time system can be described as,

xk= A_k−1xk−1+ B_k−1uk−1+ w_k−1, (2.40)

z_k= H_kx_k+ v_k, (2.41)

where A_k is the state transition model, B_k is the control input model and H_k is the sensor observation model.

The system and measurement noises w_k and v_k are assumed to be additive white Gaussian. The uncertainty introduced by these noises are captured by covariance matrices Q_k and R_k respectively where

Q_k = E[wkw_k^T], (2.42)

R_k = E[vkv_k^T], (2.43)

where E[X] indicates the expected value of X.

The error covariance matrix P_kcaptures the uncertainty in the estimated state xˆk, which is defined as

P_k= E[(xk− ˆx_k)(x_k− ˆx_k)^T]. (2.44)

The recursive algorithm is as follows,

(29)

• Prior estimation of states:

An estimate of states at time step k is determined using the system model based on the states and control input at time k − 1,

ˆ

x_k|k−1= A_k−1xˆ_k−1|k−1+ B_k−1u_k−1. (2.45)

• Prior estimation of error covariance:

The uncertainty of states varies as the system evolves due to model uncertainty and external noise. The error covariance distribution is scaled and transformed according to the state transition matrix and system noise covariance matrix,

Pk|k−1 = A_k−1Pk−1|k−1A^T_k−1+ Q_k−1. (2.46)

• Predicted measurement

The observed quantity can be estimated from the current system state.

This is given by,

ˆ

z_k = H_kxˆ_k|k−1. (2.47)

The uncertainty related to this estimate is H_kP_kH_k^T.

• Refining the estimate with measurements:

There are two distributions, the predicted measurement with mean and variance (H_kxˆ_k, H_kP_kH_k^T) and the observed measurement with mean and variance (z_k, R_k). The best state estimate will include these two measurements. The combination of these gives us the distribution,

• Update the state estimate:

From the Equations (2.48) and (2.49), we can determine the new state estimate and the associated error covariance as

(30)

Summary

The Kalman filter algorithm can be summarized as, Prediction Step

ˆ

x_k|k−1 = A_k−1xˆ_k−1|k−1+ B_k−1u_k−1 (2.53) P_k|k−1 = A_k−1P_k−1|k−1A^T_k−1+ Q_k−1. (2.54) Update Step

˜

z_k = z_k− H_kxˆ_k|k−1 (2.55)

K_k = P_k|k−1H_k^T(H_kP_k|k−1H_k^T + R_k)⁻¹ (2.56) ˆ

x_k|k = ˆx_k|k−1+ K_kz˜_k (2.57)

P_k|k = (I − K_kH_k)P_k|k−1(I − K_kH_k)^T + K_kR_kK_k^T. (2.58)

2.3.3 Extended Kalman Filter

An optimal solution is reached in case that all noise are additive white Gaus- sian with zero mean and the system is perfectly linear. Almost always, the systems in the real world are non-linear, better represented in the form

x_k= f (x_k−1, u_k−1) + w_k−1, (2.59)

zk= h(x_k) + v_k. (2.60)

The general Kalman filter can be applied in this case if the system can be approximately linearized at every time step. The extended Kalman filter linearizes the system about an estimate of the current mean. Using the first order Taylor expansion for f (.) and h(.) about the mean gives,

f (x_k−1, u_k−1) ≈ f (ˆx_k−1, u_k−1) + ∂f (x_k−1, u_k−1)

∂x_k−1

_x

k−1=ˆxk−1

| {z }

Fk−1

(x_k−1− ˆx_k−1)

(2.61) h(x_k) ≈ h(ˆx_k) + ∂h(x_k)

∂xk

_x

k=ˆxk

| {z }

Hk

(x_k− ˆx_k). (2.62)

(31)

The prior estimate of the state is determined by the expected value of f (x_k−1, uk−1) conditioned by z_k−1,

E[f (xk−1, u_k−1)|z_k−1] ≈ f (ˆx_k−1, u_k−1) + F_k−1E[ek−1|z_k−1], (2.63) where e_k−1 = x_k−1− ˆx_k−1. Since E[ek−1|z_k−1] = 0, the prior state estimate is

ˆ

x_k|k−1= f (ˆx_k−1|k−1, u_k−1). (2.64) The state error gives us

e_k = x_k− ˆx_k

= f (x_k−1, uk−1) + w_k−1− f (ˆxk−1, uk−1)

≈ F_k−1e_k−1+ w_k−1.

(2.65)

Using Equation (2.65), we can determine the prior state covariance as, P_k|k−1= E[eke^T_k]

= F_k−1E[ek−1e^T_k−1]F_k−1^T + E[wk−1w^T_k−1]

= F_k−1P_k−1|k−1F_k−1^T + Q_k−1.

(2.66)

Summary

The extended Kalman filter algorithm can be summarized as, Prediction Step

ˆ

x_k|k−1= f (ˆx_k−1|k−1, u_k−1) (2.67) P_k|k−1= F_k−1P_k−1|k−1F_k−1^T + Q_k−1. (2.68) Update Step

˜

z_k = z_k− h(ˆx_k−1|k−1) (2.69)

K_k = P_k|k−1H_k^T(H_kP_k|k−1H_k^T + R_k)⁻¹ (2.70) ˆ

x_k|k = ˆx_k|k−1+ K_kz˜_k (2.71)

P_k|k = (I − K_kH_k)P_k|k−1(I − K_kH_k)^T + K_kR_kK_k^T. (2.72)

(32)

2.3.4 Error-State Kalman Filter

In most Kalman filter implementations involving pose estimation, it is common to use the error-state Kalman filter formulation. This formulation is used to overcome the drawbacks in a full-state Kalman filter. Roumeliotis et al. [23]

introduce the error-state Kalman filterdiscuss.

Most systems consist of a high frequency non-linear dynamics which is best described only by a non-linear model. The accurate linerization of such a system is not possible. However, the error-state Kalman filter considers only the perturbations in state which can be linearly integrated close to zero mean and are suitable for linear Guassian filtering. This leads to better accuracy in error-state Kalman filters.

Another motivation for choosing the Error-State filter is due to additional constraints on quaternion used to represent the orientation. The quaternion is a four-element representation whereas the orientation in space has only three degrees of freedom. The fourth element is dependent on the other three elements by a constraint. This makes the corresponding covariance matrix singular and causes numerical instability [26].

The error-state is simply the arithmetic difference of the estimated state ˆx to the true state x, which is, ˜x = x − ˆx. This rule does not apply to quater- nion, which uses a quaternion difference to define the error-state instead of arithmetic error. Hence, an error quaternion δq is defined as,

δq = q ⊗ ˆq⁻¹ =

"

δq_w δq_v

#

(2.73)

=

"

cos(δθ/2) ksin(δθ/2)

#

(2.74)

≈

"

1

1 2kδθ

#

, (2.75)

where q_v is the imaginary part and q_w is the real part of quaternion q. The error quaternion is approximated by kδθ, which is a three dimensional vector that represents the error angle.

The nominal and error states are predicted in parallel, as shown by Equa- tion (2.76) and Equation (2.77). Nominal state ignores all noises in the system.

Error-state tracks perturbations and noises. Note that the mean of error-state is always reset to zero.

ˆ

xⁿ_k+1 = F_kⁿ(x, u) ˆxⁿ_k (2.76) δ ˆx_k+1 = F_k^x(x, u, w) δ ˆx_k. (2.77)

(33)

Corrections are made to error-state using measurement data. The error-state mean is then added to nominal state to determine the corrected state estimates, shown by

xˆk= ˆxⁿ_k+ δ ˆxk, (2.78) where ˆxk is the state estimate, xⁿ_k is the nominal state and δxk is the error- state. F_kⁿ and F_k^x are jacobians of nominal and error states respectively.

Based on the discussion above, the error-state Kalman filter is chosen in this thesis for implementing the fusion of visual and inertial data.

(34)

(35)

Related Work

This chapter introduces several visual odometry and visual-inertial odometry techniques. The need for visual-inertial fusion is discussed and the strengths of these techniques are analyzed, which provides motivation for the implemented method.

3.1 Visual Odometry

Estimating the motion of a vehicle using visual cues from cameras attached to it is Visual Odometry (VO). This has applications in the field of robotics and recently it’s being used in Autonomous vehicles for localization. We analyze three approaches of Visual odometry.

3.1.1 Feature-based Approach

The feature-based method is a classical approach to visual odometry problem.

This method relies on unique interests points in the image, as illustrated by Figure 3.1. These points are tracked and matched over consecutive images.

The general pipeline for this was introduced in Section 2.2.1.

ORB-SLAM is a visual odometry and mapping algorithm. Mur-Artal et al.

[18] [20] use Oriented FAST and Rotated BRIEF (ORB) features proposed by Rublee et al. [24]. ORB features are rotation invariant but not scale invariant.

ORB SLAM runs tracking, local mapping and loop closing in separate threads.

The tracking thread is responsible for feature matching between consecutive frames and estimating pose by minimizing reprojection error for matched features. The mapping thread performs a local optimization of camera poses to achieve optimal reconstruction of the surroundings. The loop closure thread is constantly searching for loops, fuses duplicate data and performs a pose

(36)

Figure 3.1: Feature extraction and tracking from an image frame. (Reprinted from ORB- SLAM [20]).

graph optimization to achieve global consistency. The ORB SLAM method is quite robust but loses tracking during pure rotations for monocular case.

Another feature-based method is LibViso, proposed by Geiger et al. [7]. This method focuses on 3D reconstruction of the environment using Stereo cameras.

Features are extracted by Blob and Corner masks and are matched using a circular matching approach between consecutive stereo images. The camera motion is estimated by minimizing the reprojection error using Gauss-Newton method.

3.1.2 Direct Approach

Feature based methods involve a process of feature extraction and matching.

The direct method suggests an approach that compares pixels directly, hence skipping the feature extraction step. Direct methods use the pixel intensity information for estimating motion between two frames, achieved using photometric error minimization such as the one used by Engel et al. [5]. This means that every pixel introduces a constraint in the optimization problem leading to an accurate pose and also a dense map of the environment.

These methods work on the assumption that the projection of a point in both frames has the same intensity. This assumption often fails due to lighting changes, sensor noise, pose errors and dynamic objects. Hence, direct methods require a high frame rate which minimizes the intensity changes. The light intensity can also be handled by normalizing the global intensity, as done by Gonçalves and Comport [8]. Another issue is high computation due to the use of all pixels over all frames. This is addressed by only using pixels with sufficient information, that is, with high intensity gradient as proposed by Engel et al. [5]. This produces a semi-dense map that mostly contains edges

(37)

Figure 3.2: Extraction of pixel intensity data at corners and edges. (Reprinted from LSD- SLAM [5]).

and corners, as illustrated by Figure 3.2. Concha et al. [2] proposed regu- larizing the image intensity error in addition to minimizing the photometric error. This enforces a smooth solution and aids in reconstructing low-gradient pixels. However, this method is based on the assumption that homogeneous color regions from a super-pixel are planar, which works decently for indoor scenarios and handheld motions only. Also, direct methods are more sensitive to camera calibration as compared to feature-based methods [4].

3.1.3 Hybrid Approach

A hybrid approach can be used to combine the benefits of feature-based and direct methods. Hybrids methods will require moderate computation, can handle large inter-frame motion and produce a semi-dense map. Forster et al. [6] use a 4x4 patch around features and estimate the camera pose by minimizing the photometric error of these feature patches. In order to refine the result, the reprojection error of each feature is determined with respect to the keyframe that has observed the feature at roughly the same angle. This produces a reprojection residual which is minimized in order to estimate the camera pose.

A combined feature-based and direct method for Stereo visual odometry was proposed by Krombach et al. [13]. This method initializes a keyframe using stereo correspondence. The small motions are then estimated using feature- based method. When there is a fast or large motion, the direct method is used to assign a keyframe and this is optimized with stereo measurements and also the propagated depth from previous keyframe. The method focuses on combining the speed of feature-based methods and dense map from direct methods to fill the voids caused due to large motions.

(38)

3.2 Visual-Inertial Odometry

Visual odometry methods are well defined, however they suffer from some drawbacks.

• Loss of Tracking

Visual odometry performs best when the frame rate of the camera is high, which means there is sufficient overlap of features between two images. This is not practical since it requires high computation speed.

So when there is a sudden movement of the camera, there isn’t enough overlap of features due to low frame rate. This might cause loss of tracking.

• Scale Ambiguity

When using a single camera for visual odometry, scale of the scene is unknown to the algorithm. Hence, it generates a result that is accurate but not scaled correctly, as shown by Figure 3.3. Scale here refers to the ratio between the size of the vehicle trajectory estimated from visual odometry and the size of true trajectory.

An IMU provides good odometry information for large sudden movements across a small time interval. The information from IMU is also scaled correctly.

These factors motivate us to use an Inertial Measurement Unit (IMU) along with the visual sensor. On the downside, the IMU data is prone to noise and drift. So, the fusion of Visual and Inertial sensor is a suitable choice.

3.2.1 Loosely-coupled

A loosely-coupled approach for visual-inertial systems keeps the visual and inertial framework as independent entities. Weiss and Siegwart [31] proposed such a loosely-coupled implementation where the metric scale estimation re- mains independent of the visual algorithm. Kelly and Sukhatme [10] implemented a loosely-coupled visual-inertial approach using an Unscented Kalman Filter (UKF) which is better designed to handle non-linearity in the system compared to Extended Kalman Filter (EKF). Their solution self-calibrates the transformation between a camera and IMU, while simultaneously local- izing the body and mapping the environment. Lynen et al. [17] designed a multi-sensor fusion in EKF framework with Micro Aerial Vehicle (MAV) data which is robust to long term missions. All these filter-based implementations consider the visual odometry algorithm as a black-box and only use the pose

(39)

Figure 3.3: The trajectory estimated by Monocular Visual odometry is accurate up to a scale. Here, the true scale 0.5 indicates that the Visual odometry data is in the ratio 1:2 with the Ground truth.

Figure 3.4: General framework for Loosely-coupled Visual Inertial odometry.

(40)

and covariance result provided by it. The general framework is shown by Figure 3.4.

The loosely-coupled framework is efficient as it requires less computation, which is also beneficial for real-time operation. It is also easier to expand the system to include other sensors. However, it loses in terms of accuracy due to negligence of cross-coupling between visual and inertial parameters.

3.2.2 Tightly-coupled

The tightly-coupled approach for visual-inertial systems combines the visual and inertial parameters under a single optimization problem and their states are jointly estimated. Leutenegger et al. [16] proposed one such formulation,

J (x) =

I

X

i=1 K

X

k=1

X

j∈J (i,k)

e^{i,j,k T}_r W_r^i,j,ke^i,j,k_r

| {z }

visual

+

K−1

X

k=1

e^{k T}_s W_s^ke^k_s

| {z }

inertial

, (3.1)

where J (x) is the cost function that contains the weighted reprojection errors e_r and weighted temporal errors from IMU e_s. i is the camera index, k is the frame index and j is the image feature index.

Figure 3.5: General framework for Tightly-coupled Visual Inertial odometry.

Such an approach can be formulated for direct visual odometry methods too, as proposed by Concha et al. [3]. The optimization function minimizes the IMU residual error, IMU pose error and photometric error. The optimization between two frames assumes the previous frame to be constant and adjusts the current frame. A similar energy function that considers IMU and Image errors is also proposed by Usenko et al. [29], which is used to estimate the camera pose and translational velocity along with IMU biases.

(41)

A tightly-coupled approach considers the correlation between both the sensors, leading to better accuracy. However, it is hard to debug and computationally costlier than the loosely-coupled method.

Choice of method

A loosely-coupled visual-inertial odometry framework is chosen for implementation since it is modular, flexible and computationally less demanding. This framework requires a visual odometry algorithm and a sensor fusion algorithm.

For the visual odometry algorithm, the feature-based method is preferred over direct method since the direct method is computationally expensive and un- able to handle large inter-frame motions. Ground vehicle applications involve high speeds and it’s crucial for the visual odometry algorithm to handle fast movements.

(42)

(43)

Implementation

In this chapter, the implementation of visual-inertial odometry using an error- state Kalman filter is discussed. The chapter begins with a brief description of coordinate frames involved and goes on to detail the models for inertial and visual sensor data.

4.1 Coordinate Frames

World Frame

Vision Frame IMU Frame

Camera Frame

Figure 4.1: Different coordinate frames in the system and the transformations that relates them.