Visual Inertial Navigation and Calibration

(1)

Linköping studies in science and technology. Thesis. No. 1500

Visual Inertial Navigation and

Calibration

Martin A. Skoglund

REGLERTEKNIK

AU

TOM_{ATIC CONT}ROL

LINKÖPING

Division of Automatic Control Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden

http://www.control.isy.liu.se ms@isy.liu.se

(2)

This is a Swedish Licentiate’s Thesis.

Swedish postgraduate education leads to a Doctor’s degree and/or a Licentiate’s degree. A Doctor’s Degree comprises 240 ECTS credits (4 years of full-time studies).

A Licentiate’s degree comprises 120 ECTS credits, of which at least 60 ECTS credits constitute a Licentiate’s thesis.

Linköping studies in science and technology. Thesis. No. 1500

Visual Inertial Navigation and Calibration Martin A. Skoglund

ms@isy.liu.se www.control.isy.liu.se Department of Electrical Engineering

Linköping University SE-581 83 Linköping

Sweden

(3)

(4)

(5)

Abstract

Processing and interpretation of visual content is essential to many systems and applications. This requires knowledge of how the content is sensed and also what is sensed. Such knowledge is captured in models which, depending on the application, can be very advanced or simple. An application example is scene reconstruction using a camera; if a suitable model of the camera is known, then a model of the scene can be estimated from images acquired at different, unknown, locations, yet, the quality of the scene model depends on the quality of the camera model. The opposite is to estimate the camera model and the unknown locations using a known scene model. In this work, two such problems are treated in two rather different applications.

There is an increasing need for navigation solutions less dependent on external navigation systems such as the Global Positioning System (GPS). Simultaneous Localisation and Mapping (slam) provides a solution to this by estimating both navigation states and some properties of the environment without considering any external navigation systems.

The first problem considers visual inertial navigation and mapping using a monoc-ular camera and inertial measurements which is a slam problem. Our aim is to provide improved estimates of the navigation states and a landmark map, given a slam solution. To do this, the measurements are fused in an Extended Kalman Filter (ekf) and then the filtered estimates are used as a starting solution in a non-linear least-squares problem which is solved using the Gauss-Newton method. This approach is evaluated on experimental data with accurate ground truth for reference.

In Augmented Reality (ar), additional information is superimposed onto the sur-rounding environment in real time to reinforce our impressions. For this to be a pleasant experience it is necessary to have a good models of the ar system and the environment.

The second problem considers calibration of an Optical See-Through Head Mount-ed Display system (osthmd), which is a wearable ar system. We show and moti-vate how the pinhole camera model can be used to represent the osthmd and the user’s eye position. The pinhole camera model is estimated using the Direct Lin-ear Transformation algorithm. Results are evaluated in experiments which also compare different data acquisition methods.

(6)

(7)

Populärvetenskaplig sammanfattning

Bearbetning och tolkning av visuellt innehåll är avgörande för många system och tillämpningar. Detta kräver kunskap om hur innehållet mäts och även vad mäts. Sådan kunskap kan fångas i modeller som, beroende på tillämpning, kan vara mycket avancerad eller enkel. En applikationsexempel är scenrekonstruk-tion med en kamera; om en lämplig modell av kameran är känd, då kan en mo-dell av scenen uppskattas från bilder som tagits ifrån olika, okända, platser. Dock beror scenmodellens kvalitetet påkamereramodellens kvalitetet. Motsatsen är att uppskatta kameramodellen och de okända platserna med en känd scenmodell. I detta arbete behandlas två sådana problem i två olika tillämpningar.

Det finns ett ökande behov av navigationslösningar som är mindre beroende av externa navigationssystem såsom Global Positioning System (GPS). Samtidig Lo-kalisering och kartering (slam) erbjuder en lösning genom att uppskatta både navigationstillstånd och vissa egenskaper i den omgivande miljön utan något ex-ternt navigationssystem.

Det första problemet handlar om visuell tröghetsnavigering och kartering med monokulär-kamera och tröghetssensorer. Vårt mål är att ge förbättrad uppskatt-ning av navigeringstillstånd och en landmärkseskarta, givet en lösuppskatt-ning från slam. För att göra detta vägs mätningarna samman i ett Extended Kalman Filter (ekf) och sedan används de filtrerade skattningar som en initiallösning i ett ickelinjärt minsta-kvadratproblem, som i sin tur löses med Gauss-Newtons metod. Denna strategi utvärderas i experiment där en tillförlitlig referens är känd.

I Augmented Reality (AR)-applikationer överlagras information på den omgivan-de miljö i realtid för att förstärka våra intryck. För att omgivan-detta skall bli en trevlig upplevelse är det nödvändigt att ha en bra modeller av AR-systemet och miljön. Det andra problemet handlar om kalibrering av ett Optisk Halvtransparent Hu-vudburet Displaysystem (OSTHMD), som är ett bärbart AR-system. Vi visar och motiverar varför pinhålskameramodellen kan användas för att representera OST-HMDn och användarens ögonposition. Pinnhålskameramodellen uppskattas med hjälp av den så kallade Direct Linear Transformation algritmen. Resultaten utvär-deras i experiment som också jämför olika datainsamlingmetoder.

(8)

(9)

Acknowledgments

First of all I like to thank my supervisor Prof. Fredrik Gustafsson for your 24/7 in-spiring expertise guidance, how do manage? My co-supervisor Dr Thomas Schön, I admire your sound approach to research and your never ceasing positiveness. Dr David Törnqvist, thanks for sharing so much of your knowledge, time and for finding the bargains that no one else can.

Thanks Prof. Lennart Ljung and Fredrik for inviting me to join the Automatic Control group, it’s been a great ride and I hope it will continue this way. Prof. Svante Gunnarsson heads the group, and Ninna Stensgår makes this (and many other things) a lot easier and for this I am truly grateful, thanks! The former head Lennart, well supported by Ulla Salaneck and later Åsa Karmelind, did a terrific job and created a pleasant environment!

There are many people at Automatic Control I would like to thank but I have only time to mention a few. In no particular order; Prof. Mikael Nörrlöf, sorry for killing Marvin, thanks for not killing me! Lic. Zoran Sjanic (The Zoran), it’s allways interesting and fun to collaborate with you and I really like your weird sense of ”humour”. Magnus Axholt, thanks for showing me the strange world of Augmented Reality, also thanks for introducing me to all the nice people at ITN. Lic. Jeroen Hol, thanks for sharing your expertise, I like the interesting discussions we have when you visit Linköping. Dr Ragnar Wallin, this is for you: P?. Special thanks goes to Lic. Jonas Callmer, Lic. Karl Granström, Lic. Christan Lundquist for the TAFWJP we spent together. I also want to thank the super hacker geniuses Dr Henrik Tidefelt and Dr Gustaf Hendeby for all the excellent LA_{TEX advice and Shapes support. To all the people I did not mention; remember,} to me you are all equally unimporta..., eeeeh sorry, equally important!

I would also like to thank LINK-SIC Industry Excellence Centre for the financial support, thanks!

There are many people outside the Automatic Control group that means a lot to me, but I will only mention a few since LiU-Tryck want me to finish this now. My second family in Falkenberg; Lena, Kurt, Pär, Lotta, Anna, Bertil, Amanda and all the others, your hospitality is incredible and I truly enjoy when we meet! My brother Johan; I’m amazed how much fun we have together, let it stay this way! My father Bernt; our family has had some really tough years and you have always taken the biggest load, I can not find words to thank you for it!

Finally; Maria, time flies when we are together and I enjoy every second of it! Thanks for being supportive, loving, caring and being you!

Solebo, WGS 84 − decimal : 58.329808 15.734846, May 2011 Martin Skoglund

(10)

(11)

I

Background

1 Introduction 3

1.1 Problem Formulation . . . 3

1.1.1 Visual Inertial Navigation and Mapping . . . 3

1.1.2 Optical See-Through Head Mounted Display Calibration . 4 1.2 Publications . . . 4 1.3 Contributions . . . 5 1.4 Thesis outline . . . 6 2 Models 7 2.1 Kinematics . . . 7 2.1.1 Notation . . . 8 2.1.2 Translational Kinematics . . . 8 2.1.3 Rotational Kinematics . . . 8 2.1.4 Quaternions . . . 9 2.2 Dynamics . . . 10 2.2.1 Translational Dynamics . . . 10 2.2.2 Rotational Dynamics . . . 10 2.2.3 Dynamic Models . . . 11 2.3 Inertial Measurements . . . 11 2.3.1 Gyroscopes . . . 11 2.3.2 Accelerometers . . . 11

2.3.3 Discrete-time Dynamic Models . . . 12

2.4 The Pinhole Camera Model . . . 13

2.4.1 Camera Coordinates . . . 13

2.4.2 World to Pixel Coordinates . . . 16

3 Estimation 17 3.1 Sensor Fusion . . . 17

(12)

xii CONTENTS

3.2 Calibration . . . 18

3.3 Nonlinear Estimation . . . 19

3.3.1 Extended Kalman Filter . . . 19

3.3.2 Nonlinear Least-Squares . . . 19

4 Visual Inertial Navigation and Mapping 21 4.1 Visual and Inertial ekf-slam . . . 21

4.1.1 Prediction . . . 22

4.1.2 Measurement Update . . . 22

4.2 Nonlinear Least-Squares and slam . . . 23

4.2.1 Remarks . . . 23

5 Optical See-Through Head Mounted Display Calibration 25 5.1 System Overview . . . 25

5.1.1 Head Mounted Display . . . 25

5.1.2 Tracker . . . 26

5.2 Calibration . . . 26

5.2.1 Tracker Measurements . . . 28

5.2.2 Direct Linear Transformation . . . 29

5.2.3 Decomposition . . . 30 5.2.4 Remarks . . . 30 6 Concluding Remarks 33 6.1 Conclusions . . . 33 6.2 Future Work . . . 33 Bibliography 35

II

Publications

A A Nonlinear Least-Squares Approach to the SLAM Problem 43 1 Introduction . . . 45

2 Problem Formulation . . . 46

3 Models . . . 47

3.1 Dynamics . . . 48

3.2 Landmark State Parametrisation . . . 48

3.3 Camera Measurements . . . 49

4 Solution . . . 50

4.1 Initialisation . . . 50

4.2 Nonlinear Least-Squares Smoothing . . . 50

5 Experiments . . . 54

5.1 Experimental Setup . . . 54

5.2 Results . . . 55

6 Conclusions and Future Work . . . 56

(13)

CONTENTS xiii

B Parameter Estimation Variance of the Single Point Active Alignment Method in Optical See-Through Head Mounted Display Calibration 61

1 Introduction . . . 63

2 Background and Related Work . . . 65

2.1 Structure of Transformations . . . 65

2.2 Camera Resectioning Method . . . 65

2.3 Sensitivity of the DLT Algorithm . . . 66

2.4 Previous SPAAM Evaluations . . . 67

2.5 Other OSTHMD Calibration Methods . . . 67

3 Experiment . . . 68 3.1 Hardware Setup . . . 68 3.2 Software Setup . . . 69 3.3 Participants . . . 70 3.4 Procedure . . . 70 3.5 Task . . . 70 3.6 Independent Variables . . . 71 3.7 Dependent Variables . . . 71 4 Experiment Results . . . 73

4.1 Eye Point Estimation . . . 73

4.2 Parameter Variance . . . 73 4.3 Condition Number . . . 74 5 Simulation . . . 74 5.1 Independent Variables . . . 74 5.2 Dependent Variables . . . 75 6 Simulation Results . . . 75 6.1 Condition Number . . . 75 6.2 Parameter Variance . . . 75 7 Conclusions . . . 76

8 Discussion and Future Work . . . 76

(14)

(15)

Notation

Abbreviations

Abbreviation Meaning

ar Augmented Reality dof Degrees of Freedom ekf Extended Kalman Filter hmd Head Mounted Display

imu Inertial Measurement Unit kf Kalman Filter

mems Microelectromechanical Systems ost Optical See-Through

osthmd Optical See-Through Head Mounted Display pose Position and Orientation

sift Scale-Invariant Feature Transform slam Simultaneous Localisation and Mapping

vr Virtual Reality

(16)

(17)

Part I

(18)

(19)

1

Introduction

Processing and interpretation of visual content is essential to many systems and applications. This requires knowledge of how the content is sensed and also what is sensed. Such knowledge is captured in models which, depending on the appli-cation, can be very advanced or simple. An application example is scene recon-struction using a camera; if a suitable model of the camera is known, then a model of the scene can be estimated from images acquired at different unknown locations, yet, the quality of the scene model depends on the quality of the camera model. The opposite is to estimate the camera model and the unknown locations using a known scene model. The aim of this work is to approach two such prob-lems in two rather different applications.

1.1 Problem Formulation

This thesis considers two problems which are seemingly unrelated but actually have many things in common. The first problem is visual inertial navigation and mapping using a camera and inertial measurements. The second problem is calibration of an Optical See-Through Head Mounted Display system (osthmd) modelled as a pinhole camera.

1.1.1 Visual Inertial Navigation and Mapping

Navigation is the problem of estimating position, attitude and motion an object with respect to an inertial frame. The problem is fundamental to many appli-cations and reliable solutions are necessary. In aircraft appliappli-cations the naviga-tion is often provided using inertial sensors and compass in combinanaviga-tion with a Global Navigation Satellite System (gnss), such as the Global Positioning System (gps). In some environments, gnss is unavailable, e.g., underwater, or

(20)

4 1 Introduction

able due to multipath effects, (un)intentional jamming, among others. The map-ping problem, on the other hand, seeks to describe properties of the environment, given data. The purpose of the Simultaneous Localisation and Mapping (slam) problem is to estimate both properties of the environment and the navigation. In this thesis, we consider the slam problem using inertial sensors and a monocular camera. To solve this problem, mathematical models which captures the most essential parts of the sensors measurements are needed for statistical inference. The slam problem can be solved using an Extended Kalman Filter (ekf) or similar methods, yet without any guarantee of convergence. We treat the slam estimates given by ekf as initial parameters for a nonlinear least-squares problem which we solve using the Gauss-Newton method.

1.1.2 Optical See-Through Head Mounted Display Calibration

We use Augmented Reality (ar) to denote visual information overlaid, augmented, on human vision. Optical See-Through Head Mounted Displays (osthmd) are wearable ar systems, which means that the osthmd displayed graphics moves with the user. To augment the real world with graphics in such a system three main problems needs to be solved. First, the position and rotation of the osthmd with respect to the inertial frame needs to be known, i.e., the navigation. Second, the calibration problem which constitutes the static transformations of the rela-tive position and rotation of the semitransparent graphics screen with respect to the user’s eye and an inertial frame. Third, a model of the environment where the graphics should be superimposed must be found. In this thesis, the first two of the three problems are addressed. The first problem, navigation, is solved using a visual tracking system. The second problem, calibration, is solved using the navigation and user acquired measurements.

1.2 Publications

Published work of relevance to this thesis are listed below in chronological order. M. Axholt, M. A. Skoglund, S. D. O’Connell, M. D. Cooper, S. R. Ellis, and A. Ynnerman. Parameter Estimation Variance of the Single Point Active Alignment Method in Optical See-Through Head Mounted Dis-play Calibration. In Proceedings of the IEEE Virtual Reality Confer-ence, pages 27–34, Singapore, Republic of Singapore, March 2011.

Background: The author was invited by Magnus Axholt to join his research in 2010. Magnus has great experience of various display systems and had already been working on this project for more than a year. This work reflects some labour intense engineering where several months were spent on the ar system which consists of several hardware components interacting through several software layers. We adopt the theoretical framework for camera calibration founded in the computer vision and photogrammetry domains to osthmd calibration. The calibration problem is rather ill-posed since the measurements are few, and fairly noisy, compared to the parameter space.

(21)

1.3 Contributions 5

Z. Sjanic, M. A. Skoglund, T. B. Schön, and F. Gustafsson. A Nonlinear Least-Squares Approach to the SLAM Problem. In Proceedings of the IFAC World Congress, Milan, August - September 2011. To appear. Background:This work started as a mini project by the author and Zoran Sjanic in the graduate course Dynamic Vision, which was given by Thomas Schön in 2008 and some parts where also done within the graduate course Robot Modeling and Control, given by Mikael Norrlöf. The initial ideas actually emerged during the author’s Master’s Thesis project in 2008. Parts of this work has also been used in Nilsson (2010); Nilsson et al. (2011) where the sensor is a forward looking in-frared (IR) camera attached to a car, and in Karlsson and Bjärkefur (2010) where the authors use stereo camera and laser camera in an indoor environment. Published work not included in this thesis

J. Callmer, M. Skoglund, and F. Gustafsson. Silent Localization of Underwater Sensors Using Magnetometers. EURASIP Journal on Ad-vances in Signal Processing, 2010, 2010. doi: 10.1155/2010/709318. URL http://dx.doi.org/10.1155/2010/709318. Article ID 709318. M. Axholt, M. Skoglund, S. D. Peterson, M. D. Cooper, T. B. Schön, F. Gustafsson, A. Ynnerman, and S. R. Ellis. Optical See-Through Head Mounted Display Direct Linear Transformation Calibration Ro-bustness in the Presence of User Alignment Noise. In Proceedings of the 54th Annual Meeting of the Human Factors and Ergonomics Soci-ety, volume 54, pages 2427–2431, San Francisco, CA, USA, September - October 2010. Human Factors and Ergonomics Society.

1.3 Contributions

The main contributions in this thesis are listed below.

• In Sjanic et al. (2011) we provide a solution to the slam problem using a camera and inertial sensors. The individual contributions of the author and Zoran Sjanic are equal and consist of deriving the proposed solution, carrying out the experiments onto which the algorithms where evaluated and documenting the work.

• In Axholt et al. (2011) the Single Point Active Alignment Method (spaam) used for osthmd calibration is evaluated in both experiments and simula-tions. Previously, much effort has been spent on analysing the effects of user’s interaction, which is known as human factors, in ar systems. The in-dividual contributions of the author to this work is parts of the experiment design, derivation and verification of mathematical models and implemen-tation in both software and hardware. Interdisciplinary research collabo-ration can be challenging but also very rewarding. The author believes an important contribution is due to this clash. It has led to a change of focus from the analysis of errors possibly introduced by humans to focus on the

(22)

6 1 Introduction

analysis of different experiment designs and the tracker data. Most of the documentation was carried out by Magnus Axholt in which the author only had minor contribution.

1.4 Thesis outline

The thesis is divided into two parts, with edited versions of published papers in the second part. The first part starts with Chapter 2 that introduce kinematics, dynamics, sensors and model structures. Some estimation techniques relevant two this thesis are presented in Chapter 3. Chapter 4 describes some details not covered in Paper A and Chapter 5 describes the ar system and some details of the calibration procedure used in Paper 5. The first part ends with Chapter 6 which presents conclusions and directions for future work. The second part con-tains edited versions of unpublished work, Sjanic et al. (2011) in Paper A and and unpublished work Axholt et al. (2011) in Paper B. Paper A is about the slam problem using a camera and inertial sensors, a solution is provided using non-linear least-squares and the results are evaluated on experimental data. Paper B is about osthmd calibration where the is collected using spaam which is solved using the Direct Linear Transformation algorithm. The parameter variance for different acquisition methods using spaam is evaluated in both experiments and simulations.

(23)

2

Models

In this chapter we outline the most important model structures which are used in the publications in Part II. The models describe rigid body kinematics and dynamics in three dimensions, the pinhole camera and inertial measurements.

2.1 Kinematics

Kinematics describe the motion of bodies without considering the forces caus-ing the motion. Kinematic transformation between coordinate frames consist of length preserving translations and rotations which is illustrated in Figure 2.1.

xw yw zw xc yc zc (a) A rotation is a displacement of the axes while the origin is kept fix. cw xw yw zw w xc yc zc c (b)A translation is a displace-ment of the origin while the axes are kept aligned.

Figure 2.1:Rigid body transformations.

(24)

8 2 Models

2.1.1 Notation

Throughout the thesis, superscript lower case letters are used to denote the asso-ciated coordinate frame. Subscript lower case letters denote, either a coordinate and/or a time instant. The common convention is to use bold font lower case letters to denote vectors and upper case letters to denote matrices. An example is

vct =          v_x,tc vc_y,t vc_y,t          = Rcwvwt, (2.1)

where the time dependent vector vwt in coordinate frame w is expressed in the

coordinate frame c by a matrix multiplication with Rcw_.

2.1.2 Translational Kinematics

A translation is a displacement of the body origin c while the axes are kept aligned. The translation is a vector, cw, from the origin of the frame w to the coordinate c expressed in the w frame and it is also called the position.

2.1.3 Rotational Kinematics

Proper rotations are maps that preserve length and orientation, i.e., the handed-ness of the basis. These maps are actually linear, hence, rotations can be done using matrices. There are many other parametrisation alternatives to rotation matrices and perhaps the most common ones are Euler angles and the unit quater-nion, see Section 2.1.4.

2.1 Definition (Special Orthogonal Group SO(3)). A matrix R ∈ R3×3belongs to the special orthogonal group SO(3) if and only if it holds

RTR _{= I}₃_, _(2.2a)

det R = 1. (2.2b)

Figure 2.1a describe a rotation from the w frame to the c frame. This can be parametrised by the rotation matrix Rcw. The inverse rotation direction using (2.2a) is (Rcw₎−₁

= (Rcw₎T _{, hence (R}cw₎T _{= R}wc_.

2.2 Example: Rotation

A vector in the a-frame, xa_{∈ R}3_{, is rotated to the b-frame by R}ba_∈_SO(3)

accord-ing to

(25)

2.1 Kinematics 9

and using (2.2a)

||_xb||2_{= ||R}ba_xa||2 ⇐⇒ _(2.4a)

(xb)Txb= (xa)TRbaTRba_xa ⇐⇒ _(2.4b)

||_xb||2_{= ||x}a||2_, _(2.4c)

hence, the length is preserved. Note that using (2.2b) we have

detRbaT _{= det R}ba_=⇒_{det R}ba2_{= 1} _(2.5) and the positive root of (2.5) assures Rba∈_SO(3).

2.1.4 Quaternions

Quaternions were invented by Sir William Rowan Hamilton (Hamilton, 1844) as a tool to extend the imaginary numbers. The unit quaternion is widely used in aerospace industry as well as computer graphics since it allows a computationally efficient and singularity free rotation representation.

2.3 Definition (Quaternion). The quaternion is a four-tuple of real numbers q ∈ R4. An alternative definition, as in Kuipers (2002, page 104), is to define a scalar part, q0, and a real valued vector, q = e1q1+ e2q2+ e3q3, where e1, e2, e3is the standard orthonormal basis in R3_{. With Kuipers definition we have}

q = q0+ q, (2.6)

which is a sum of a vector and a scalar which is not defined in ordinary linear algebra. The full vector representation is then

q =             q0 q1 q2 q3             ="q0 q # . (2.7)

Rotation using Quaternions

A quaternion can be used to represent a rotation in R3but in order to do so it must be constrained to the unit sphere, i.e., qTq = 1, hence the name unit quaternion. Let

qba=" cos δ sin δn

#

, (2.8)

which is a unit quaternion describing a rotation of an angle 2δ around the unit vector n ∈ R3. Then a rotation using this unit quaternion is

˜

xb= qba_x_˜a_qab _(2.9)

where ˜x? = [0, x?] are the vectors’ quaternion equivalents denote the quater-nion multiplication, see e.g., Törnqvist (2008). The unit quaterquater-nion can also be

(26)

10 2 Models

used to construct a rotation matrix, see for instance Shuster (1993, page 462), where the superscriptabis omitted for the sake of readability,

R_(qab_{) =}          (q2₀+ q2₁−_q2 2−q23) 2(q1q2+ q0q3) 2(q1q3−q0q2) 2(q1q2−q0q3) (q02−q21+ q22−q23) 2(q2q3+ q0q1) 2(q1q3+ q0q2) 2(q2q3−q0q1) (q20−q12−q22+ q23)          = (2.10) =          (2q₀2+ 2q₁2−₁₎ _2(q₁_q₂_{+ q}₀_q₃₎ _2(q₁_q₃−_q₀_q₂₎ 2(q1q2−q0q3) (2q02+ 2q22−1) 2(q2q3+ q0q1) 2(q1q3+ q0q2) 2(q2q3−q0q1) (2q02+ 2q32−1)          , (2.11)

where the last equality is established using q₀2+ q2₁+ q2₂+ q₃2= 1. Throughout the thesis R(qab) and Rbawill be used interchangeably.

2.2 Dynamics

As rigid bodies are studied, a fixed coordinate frame can be assigned to the lo-cal body frame to reveal the time-evolution of the body with respect to another coordinate frame.

2.2.1 Translational Dynamics

A body, in the b frame, with mass will be influenced by the Earths’ gravitational field. Assuming that the gravity component, ge = [0 0 − 9.81]Tm/s2, is constant we have

˙v_te= aet− ge= Rebabt − ge, (2.12)

meaning that the e-referenced acceleration depends on the orientation of the body. If there are errors in the rotation, the gravity subtraction will cause ˙vet to

deviate from the truth and this will result in velocity and position errors if (2.12) is used to resolve theses quantities. This is commonly known as gravity leakage and it contributes to the inherent drift in inertial navigation systems.

2.2.2 Rotational Dynamics

The differential form of the unit quaternion is bilinear in the angular velocity and is given by ˙qbet = 1 2               0 −_ω_xb −_ωb_y −_ωb_z ωbx 0 ωbz −ωby ωby −ωzb 0 ωbx ωbz ωby −ωbx 0               | {z } S(ωbt)               qbe_0,t qbe_1,t qbe_2,t qbe_3,t               = 1 2               −_qbe 1,t −q2,tbe −qbe3,t qbe_0,t −_qbe 3,t q2,tbe qbe_3,t qbe_0,t −_qbe 1,t −_qbe 2,t qbe1,t q0,tbe               | {z } e S(qbe t )          ωbx ωby ωbz          , (2.13)

here, ωb_t is the instantaneous angular velocity of the moving body frame b relative to the fixed e frame, expressed in the b frame. For a complete derivation see Törnqvist (2008).

(27)

2.3 Inertial Measurements 11

2.2.3 Dynamic Models

Assume that the acceleration, abt, the gravity, ge, and the angular acceleration, ωtb,

are known. This leaves us a state vector resolved in the e frame

xet =         pet vet qbe_t         , (2.14)

where pet, vet and qbet , represents the position, velocity and the rotation (attitude)

of the b frame with respect to the e frame, respectively. Obviously ¨

pet = ˙vet = aet, (2.15)

and by using (2.13) and (2.12) the dynamic model becomes

˙x_te=         ˙pe_t ˙ve_t ˙qtbe                  0 I 0 0 0 0 0 0 1₂Sω_tb                  pet vet qtbe         +           0 R_qbe t T ab_t − ge 0           . (2.16)

2.3 Inertial Measurements

Inertial sensors measure specific force and angular velocity and they should be modeled as states which are measured. However, to reduce the dimension of the state space needed the inertial measurements are considered to be inputs.

2.3.1 Gyroscopes

The output of the gyroscopes is

ωtb= ubω,t+ wω,t, (2.17)

where ub_ω,tis used to stress that the angular velocity is considered to be an input. Furthermore, wω,tis noise. Using (2.13) the quaternion dynamics becomes

˙qbe= 1 2S ubω,t+ wω,t q_tbe=1 2S ubω,t qbe_t + 1 2eS qbe_t wω,t. (2.18)

Due to unmodeled effects in the sensors, a bias, δω,t, could be added to (2.17)

giving

ωbt = ubω,t+ δω,t+ wω,t, (2.19)

and δω,tcould also be included in the state vector.

2.3.2 Accelerometers

Contrary to what the name implies, an accelerometer does not only measure ac-celeration when used on earth, since it is subject to the gravitational field. Hence, accelerometers measure the specific force which is a sum of the free acceleration and the gravity field. The accelerometer is also treated as a known signal and

(28)

12 2 Models output becomes uba,t = abt − gb+ wa,tb = Rbe aet− ge+ wea,t , (2.20)

where wb_a,t is measurement noise. It should be clear that the measured specific force depends on the attitude Rbe. If the noise is assumed to be zero mean Gaus-sian with the same variance in all components, (2.20) has the same statistical properties as

uba,t= Rbe(aet− ge) + wba,t, (2.21)

i.e., the noise does not depend on the coordinate frame.

2.3.3 Discrete-time Dynamic Models

The sensors deliver sampled data and the dynamic model (2.16) therefore needs to be expressed in discrete time. This can be done in many ways. Assume that the angular velocity is constant over the sampling interval T and free from noise, then (2.18) integrates to

qt+1= e

T

2S(ubω,t)qbe

t , (2.22)

approximating the exponential results in qt+1≈ I + T 2S ub_ω,t qbe_t , (2.23)

and assumed that the noise from (2.16) account for the exponential approxima-tion as well as the true sensor noise we get

qt+1= qt+ T 2S ubω,t qbet + T 2eS qbet wω,t. (2.24)

The discete-time dynamic models for the position and velocity can be computed in a similar fashion. Assuming that ˙aet = 0, from (2.15) we then have

        ˙pe t ˙vet ˙ aet         =         0 I 0 0 0 I 0 0 0                 pet vte aet         (2.25)

Assuming that the noise input is constant during the sampling interval, the discrete-time translational dynamics is

        pe_t+1 ve_t+1 ae_t+1         =          I T I T₂2I 0 I T I 0 0 I                  pet vet aet         . (2.26)

(29)

2.4 The Pinhole Camera Model 13 xc yc zc c xi yi f xp yp p mc mn h

Figure 2.2: Pinhole projection with image the plane placed in front of the camera centre.

Using (2.21), (2.24) and (2.26) we get the full discrete-time dynamic model         pe_t v_te qbe_t         =         I3 T I3 0 0 I3 0 0 0 I4                 pe_t−1 v_t−1e q_t−1be         +          T2 2 I3 0 T I3 0 0 T₂I4          "R(qbe t−1)Tuba,t+ ge S(ub_ω,t)q_t−1be # +          T2 2 I3 0 T I3 0 0 T₂eS(qbe_t−1)          " wba,t weω,t # (2.27)

which is also to be found in Section 3.1 in Paper A as (5) but is included here for completeness. It is also worth noting that this model integrates the errors in the sensor measurements directly, whereas if the sensor readings are modeled as measurements we would get low-pass filtering, as pointed out in Callmer (2011).

2.4 The Pinhole Camera Model

The pinhole camera is a mathematical model describing how points in R3relate to points in R2on the image plane through a central projection. The model only contains five intrinsic parameters and for most systems it is an approximation since it does not model illumination, contrast, lenses and lens-distortion, among others. Additionally, the position and orientation (pose) of the camera centre with respect to some reference frame is also needed if the camera is only a part of a system in which we are interested in. This adds another 6 degrees of freedom (dof) to the pinhole camera model.

2.4.1 Camera Coordinates

Figure 2.2 illustrates a frontal pinhole projection. In the pinhole camera model the coordinate frames are:

(30)

14 2 Models

• c - Camera frame with the optical centre as the origin where its z-axis coin-cides with the optical axis, also known as the principal axis.

• w - World frame.

• i - Image frame. The image plane is orthogonal to the optical axis and has the principal point h as its origin. The distance between the c and the i frame is called the focal length f .

• n - Normalised image frame.

• p - Pixel frame with center in the upper left corner as viewed from the optical centre c.

A point mc = [xc, yc, zc]T in the c-frame relates to a point mn = [xn, yn]T in the n-frame as "xn yn # = f zc "xc yc # . (2.28)

It is convenient to write express (2.28) as a linear system which can be done by appending unit elements to the vectors giving the homogeneous coordinate rep-resentation         xn yn 1         ∝_zc         xn yn 1         =         1 0 0 0 0 1 0 0 0 0 1 0         | {z } Π0             xc yc zc 1             . (2.29)

With focal length f = 1 the normalised image frame n and the image frame i coincides. The projection (2.29) can be expressed as

λmn= Π0mc, (2.30)

where λ ∈ R+is an arbitrary scale factor.

Intrinsic Parameters

The digitalisalisation of the image plane i is the pixel plane p "xp yp # ="fx sα 0 fy # "xi yi # +"hxx i hyyi # , (2.31) or in homogeneous coordinates         xp yp 1         =         fx sα hx 0 fy hy 0 0 1         | {z } K         xi yi 1         , (2.32)

where K is referred to as the intrinsic parameter matrix or simply the calibration matrix. The parameters of the calibration matrix are the focal lengths in pixel units fx= sx/f and fy= sy/f where f is the focal length expressed in meters per

(31)

2.4 The Pinhole Camera Model 15

(R

cw

, c

w

₎

m

w

R

cw T

_m

c

m

w

c

Figure 2.3:Camera to world transformation.

pixel and sx, syare the pixel sizes in metric units. The center of the image plane

h = [hxhy]T, is the principal point coordinate and sα = fxtan α is a skew

param-eter. In most cameras it is reasonable to assume sα ≈0 (Hartley and Zisserman,

2004).

Distortion

Most cameras suffer from non-negligible distortion due to the optical lenses in-volved. This can be compensated for by tangential- and radial distortion models (Ma et al., 2004; Zhang, 2000), which are provided in camera calibration software, see e.g., Bouguet (2010).

Extrinsic Parameters

Points in another frame, say w, can be expressed in the camera frame c. In the pinhole camera model, such transformation is called extrinsic since it does not depend on the intrinsic camera kalibration matrix K. Figure 2.3 describes the relation between a point mwin world coordinates w expressed in camera coordi-nates c

mc= Rcw(mw− cw) = Rcwmw− Rcw_cw_, _(2.33) which can be written in homogeneous coordinates as

            xc yc zc 1             ="R cw _−Rcw_cw 0 1 # | {z } Tcw             xw yw zw 1             . (2.34)

(32)

16 2 Models

2.4.2 World to Pixel Coordinates

Combining the extrinsic end intrinsic parameters, coordinates in the world frame can be expressed in pixel coordinates as

        xp yp 1         ∝_zc         xp yp 1         =         fx sα hx 0 fy hy 0 0 1                 1 0 0 0 0 1 0 0 0 0 1 0         "Rcw −Rcw_cw 0 1 # | {z } P             xw yw zw 1             , (2.35)

where P is called the camera matrix or projection matrix. In compact notation (2.35) is

(33)

3

Estimation

Estimation is the problem of taking measured data, yt, to reveal statistical

prop-erties, xt and θ, of systems which are of interest. Basic ingredients are model

structures representing the assumptions made on the interaction and time evolu-tion of xt, θ and yt. As tools for solving estimation problems, state space models

are commonly used model structures and a fairly general description of such is ˙xt= f (xt, ut, θ, wt), (3.1a)

yt= h(xt, θ, et), (3.1b)

where f ( · ) and h( · ) are general nonlinear functions, xtare states, ut are known

inputs, θ are unknown or partially known parameters and wtand etare noise

rep-resenting everything that cannot be predicted yet still affects the system. How-ever, in this work, an important special case is considered where the noise are assumed additive and Gaussian. Then the discrete time version of (3.1) is

xt= f (xt−1, ut, θ) + wt, (3.2a)

yt= h(xt, θ) + et. (3.2b)

The functions f ( · ) and h( · ) are assumed well behaved, meaning they are at least one time continuously differentiable and smooth in the domain.

3.1 Sensor Fusion

Sensor fusion is the process of combining multi-sensory data in a clever way to ob-tain a state estimate ˆxtor a state sequence estimate { ˆxt}Nt=0. Model parameters, θ,

are usually not of interest. The sensor fusion concept is illustrated in Figure 3.1, where the input sensors could as well be sensor fusion nodes.

(34)

18 3 Estimation Sensor Sensor Sensor State Estimates Sensor Fusion and System Models Computing Device

Figure 3.1: Sensor data is processed in a sensor fusion node to obtain state estimates.

An example of sensor fusion; An imu in combination with a vector magnetometer in a single unit facilitate the possibility of estimating the absolute attitude of the unit with respect to the local magnetic field and the local gravity field, assuming the fields are constant and the sensor is still.

In Chapter 4 sensor fusion of an imu and a camera data is considered. Cameras and IMUs are somewhat complementary sensors meaning that they together pro-vide information which is readily not available from any of the sensors alone. As imumeasurements are noisy they only allow short time attitude estimation, actu-ally, the absolute orientation cannot be obtained from an imu alone. On the other hand, cameras allow long term attitude estimation (Montiel and Davison, 2006) in case of moderate movement of the camera (not causing to much motion blur), assuming robust detection and matching of interest points.

3.2 Calibration

We refer to calibration as the problem of estimating the parameter vector θ as-sociated with the model structures (3.2a) and (3.2b). Some calibration problems also require solving sensor fusion and parameter estimation jointly. In System Identification literature calibration is usually called Grey-box Identification, see e.g., Ljung (1999).

In Chapter 5 calibration of an osthmd is considered, where states, xt, are given

by an electro optical tracking device and the measurements, yt, are gathered by

humans in what is called a bore sighting exercise. The calibration is solved using the Direct Linear Transformation (DLT) algorithm.

(35)

3.3 Nonlinear Estimation 19

3.3 Nonlinear Estimation

Methods of handling nonlinarities in estimation problems can broadly be di-vided into two classes; the first is nonlinear methods not requiering explicit lin-earisation where some examples are Uncented Kalman Filters (ukf) (Julier and Uhlmann, 1997), particle filters (pf) (Gordon et al., 1993; Schön, 2006) the second is to linearise the nonlinearities (at the current estimate) and use linear methods, e.g., Extended Kalman Filters (ekf).

3.3.1 Extended Kalman Filter

The celebrated Kalman Filter(kf) (Kalman, 1960) provide unbiased estimates with minimum variance if applied to linear systems which are subject to Gaussian noise. For nonlinear systems, a natural extension is to linearise the nonlinear functions by a Taylor expansion at the current estimate and then propagate the mean and the covariance approximations. This can be done in an Extended Kalman Filter (ekf), which is described in many books, see e.g., Kailath et al. (2000). The ekf works in a two step procedure summarised in Algorithm 1 below where the parameters, θ, are omitted and the sampling time is, T = 1.

Algorithm 1Extended Kalman Filter

In Algorithm 1 we have defined Ft, ∂f (xt, ut) ∂xt _(x t,ut)=( ˆxt−1|t−1,ut) , Ht, ∂h(xt ) ∂xt _(x t)=( ˆxt|t−1) , (3.5)

furthermore, Qtand Rtare the covariance matrices of wtand et, respectively.

3.3.2 Nonlinear Least-Squares

The objective in nonlinear least-squares is to estimate parameters, θ, by minimis-ing the residuals, εt(θ) = yt−ht(θ), see for instance Nocedal and Wright (2006).

The residuals are minimised in a least-squares sense

V (θ) = 1 2 N X t=1 ||_ε_t_(θ)||2 2= 1 2 N X t=1 εt(θ)Tεt(θ). (3.6)

(36)

20 3 Estimation

Define ε(θ) = [ε1(θ) ε2(θ) . . . εN(θ)]T and the Jacobian J(θ) = dε(θ)

T

dθ , then the

gradient and the Hessian of (3.6) with respect to the parameters are given by dV (θ) dθ = 1 2 N X t=1 εt(θ)dεt(θ) dθ = J(θ)ε(θ), (3.7) and d2_{V (θ)} dθ2 = J(θ) T_{J(θ) +}1 2 N X t=1 εt(θ) d2_ε t(θ)T dθ2 , (3.8)

respectively. The extension to multivariable residuals is easily obtained by stack-ing the vectorisation of the individual residuals which again gives a scalar cost function, see Gustafsson (2010, page 49).

The Gauss-Newton method is a popular second order method used for solving nonlinear least-squares problems. It is computationally simple since the Hessian of the objective function is approximated as

d2V (θ) dθ2 ≈J(θ)

T_J(θ), _(3.9)

significantly reducing computations. The Gauss-Newton method as an algorithm is summarised in Algorithm 2

Algorithm 2Gauss-Newton 1. Require initial an estimate ˆθ 2. Compute the Jacobian J( ˆθ)

3. Compute a search direction p by solving

J( ˆθ)J( ˆθ)Tp = −J( ˆθ)ε( ˆθ). (3.10) 4. Compute a step length, α, such that the cost (3.6) is decreased. This can be done using line search, see e.g., Nocedal and Wright (2006, page 297) or Boyd and Vandenberghe (2004, page 464).

5. Update the parameters

ˆ

θ := ˆθ + αp. (3.11)

6. Terminate if a stopping criteria is met. Such criteria can be; the change in cost is small, the number of iterations has exceeded some threshold, among others.

(37)

4

Visual Inertial Navigation and

Mapping

Simultaneous Localisation and Mapping (slam) emerged in the field of mobile robotics as a tool to enhance navigation and to infere properties of the suround-ing environment by means of the onboard sensors. A lot of attention was spent on indoor platforms equipped with wheel encoders for odometry and line sweep-ing laser range scanners measursweep-ing the environment. Today, all sorts of plat-forms and sensors are used in slam applications e.g., submarines (Eustice et al., 2006), monocular camera (Davison et al., 2007; Davison, 2003; Eade, 2008; Klein and Murray, 2007), aerial platforms (Karlsson et al., 2008; Caballero et al., 2009; Bryson and Sukkarieh, 2009; Lupton and Sukkarieh, 2009, 2008; Skoglund, 2008) and even domestic robots, Neato Robotics (2011). In this thesis, the sensors are an imu and a camera contained in a single unit.

4.1 Visual and Inertial

EKF

-

SLAM

EKF-SLAM is probably the most common slam method and it is often straightfor-ward to implement as described in e.g., Durrant-Whyte and Bailey (2006); Smith et al. (1990). For a thorough treatment, the book by Thrun et al. (2005) serves as a standard reference. In feature based slam, coordinates in the global frame are explicitly represented as landmarks, l, which are part of the state vector. The standard assumption is that the landmarks are stationary but alternatives exist, in Bibby and Reid (2007) a model structure allowing dynamic objects is considered. Some feature characteristics of the landmarks have to be detected and associated to the measurements. This is known as the data association problem in slam, and it depends on the sensors and the environment, among others. For a camera sen-sor, the data association is the 2D/3D correspondence between detected features in images and landmarks in the landmark map. We solve this using the Scale

(38)

22 4 Visual Inertial Navigation and Mapping

Invariant Feature Extractor (SIFT), Lowe (1999), see Section 4.1 in Paper A. In our application the camera and imu deliver data at different rates. This should be distinguished in the equations as well as in the implementation. Denote the time when camera measurements arrive by tk, then the dynamic, the landmark

and the measurement models are given by

xt = f (xt−1, ut) + Bwwt, (4.1a)

lt = lt−1, (4.1b)

ytk = h(xtk, ltk) + etk, (4.1c) which is also found as (1) in Paper A. The ekf given in Algorithm 1 applies to (4.1) with just a few modifications.

4.1.1 Prediction

Note that only the vehicle state covariance is updated. The full covariance matrix is Pt="P xx t Ptxl P_tlx P_tll # . (4.5)

Also, when new landmarks are initialised they are appended to the state vector.

4.1.2 Measurement Update

(39)

4.2 Nonlinear Least-Squares and slam 23 where Ht , h _∂ ∂xth(xt, lt) ∂ ∂lth(xt, lt) i _(x t,lt)=( ˆxt|t−1,ˆlt|t−1) . (4.7)

When the unit quanterion is used for rotation parametrisation in an EKF it has to be normalised

qt|t :=

qt|t

||_q_t|t||, (4.8)

preferably after each measurement update. An alternative is to introduce the quaternions’ deviation from unity as a fictive measurement.

4.2 Nonlinear Least-Squares and

SLAM

The basic concept of nonlinear least-squares for slam is to acquire an initial es-timate, linearise the models around the estimate and minimise all the measure-ment errors and the state trajectory errors in a least-squares sense.

ekf-slam (Section 4.1) provide an initial trajectory, x0

0:N, and an initial landmark estimate, l_N0. The linearised dynamics around x0_t at time t is then

x0t + δxt = f (x0t−1, ut) + Ftδxt−1+ Bwwt

|{z} e

wt

, (4.9)

where δxtis a small deviation from x0t and Ftis the derivative of f (xt−1, ut) with

respect to xt−1evaluated in x0t−1. The linearised measurements around lN0 and x0tk for the measurements at time tk is

ytk = h(x

0

tk, l

0

N) + Htkδxtk+ JtkδlN+ etk, (4.10) where δlN is a small deviation from l0N, Htk is the derivative of h(xtk, lN) with respect to xtk evaluated at x

0

tk and l

0

N and Jtk is the derivative of h(xtk, lN) with respect to lNevaluated at x0tk and l

0

N.

By rearranging the terms in (4.9) and (4.10) we can formulate the problem of finding the trajectory and map deviations that minimises the noise variances as

min δxt, δlN N X t=1 || e wt(δxt)||2_Q_e−₁ t + K X k=1 ||_e_t k(δxtk, δlN)|| 2 R−1 tk. (4.11) where eQt = BwQtBTw and ||x||2_P−₁ = xP −₁

xT. This is in principle a (weighted) nonlinear least-squares problem which gives an increment for the parameters, θ = [x_0:NT l_NT]T_{, and the Gauss-Newton algorithm (Algorithm 2) applies.}

4.2.1 Remarks

There are, at least, two problems associated to (4.11) which are briefly discussed below.

(40)

24 4 Visual Inertial Navigation and Mapping

Process Noise

The process noise, eQt is singular, see (2.27) on page 13. This is because the

ve-locity is affected by the same noise as the position and the quaternion only has 3dof but 4 states. It turns out that a solution does not have to be consistent with the dynamic model. A solution to this is to add the linearised dynamic model as a constraint to the problem (4.11) which results in a constrained quadratic program.

Quaternion

The unit quaternion is constrained to the unit sphere, qT_{q = 1, which is a}

non-convex constraint that also should be handled. A suggestion is to use a local 3 parameter approximation of the unit quaternion, Triggs et al. (2000); Kümmerle et al. (2011).

(41)

5

Optical See-Through Head Mounted

Display Calibration

The Optical See-Through Head Mounted Display (osthmd) calibration process is introduced by giving an overview of the ar system and its geometry, explaining the measurement procedure and the Direct Linear Transformation used for cali-bration. This chapter also explains some parts not covered in Paper B.

Figure 5.1 show the Kaiser ProView 50ST osthmd which was used in the exper-iments, Axholt et al. (2011). The red lights are tracker markers which are mea-sured by the tracker system.

5.1 System Overview

The ar system consists of an osthmd and a motion tracking system briefly de-scribed below.

5.1.1 Head Mounted Display

In a pinhole camera model the light pass through a lens system and then hits the camera sensor. In an osthmd there is no camera sensor, yet an optical combiner serves the same purpose when used in a pinhole camera model. A simplified illustration of how the graphics is created in th osthmd can be seen in Figure 5.2. The plane-to-plane homography of the pixels from the beamer screen, through a collimator and onto the ost semitransparent screen (optical combiner), is one of the paths light travel before reaching the eye and this is how the virtual overlay is created onto the scene. The other path of light comes from the scene, which does not pass through the beamer or the collimator. Note that the pixels are distorted by the relative rotation of the planes involved.

(42)

26 5 Optical See-Through Head Mounted Display Calibration

Figure 5.1: osthmd, the red lights are used by the optical tracker system consisting of several cameras.

5.1.2 Tracker

The optical tracker system, which is a PhaseSpace IMPULSE motion capture, mea-sure the position of tracker marks (red diodes) and provides 6dof pose at a max-imum of 200Hz, but due to software limitations the calibration application only runs at 60Hz. The tracker is used in two ways. Firstly, 11 tracker marks are attached to the hmd defining a rigid body of which the tracker provides 6dof pose. Secondly, a single tracker mark is mounted on a tripod and its position is also reported by the tracker. This single tracker mark is called the background marker.

5.2 Calibration

Traditional camera calibration considering electro optical cameras is a well stud-ied problem with many effective solutions and freely available software provid-ing very accurate calibration, almost automatically, see e.g., Bouguet (2010) or OpenCV. In osthmd calibration good results are difficult to obtain for various reasons, for example, equipment is fairly non-standard and requires special so-lutions, data acquisition is time consuming and errors are easily introduced (the screen shifts about 30-40 pixels by just frowning!).

The osthmd Figure 5.1 is quite a complex device and with few calibration data we are forced to approximations reducing the parameter space. With Figure 5.2 in mind, some assumptions motivating the pinhole camera model as an

(43)

approxi-5.2 Calibration 27 xc yc zc c xi yi x s ys xb yb

(a)3D illustration of how the graphics from the beamer b screen is displayed on the virtual image i as the reflected light falling on the semitransparent screen s. The black dot sents the beamer light source. The c coordinate frame repre-sents the user’s eye which corresponds to the camera centre in the pinhole camera model.

xb yb xs ys xi yi

(b)2D view of the beamer screen (left), ost display (middle) and the image plane (right).

Figure 5.2:Projection from an image on the beamer, b, to the screen s onto the image plane, i. The beamer is rotated around the xb_{axis while the screen}

is rotated around its xs_{and y}s_{resulting in a distorted image. This is because}

projective transformations do not necessary preserve angles. The distance from the c frame to the s frame is the focal length, f .

(44)

mate model structure for the osthmd are:

5.1 Assumption. The rotation between the frames b and s is small, i.e., Rbs≈_I, hence, the shape of the pixels projected onto the screen are rectangular.

5.2 Assumption. The pixel plane is perpendicular to the optical axis. With-out this assumption another rotation has to be estimated, Robinett and Rolland (1991).

5.3 Assumption. The pixels, as perceived by the user, are points. This is true for most users with uncorrected vision because of the VGA resolution(640 × 480 pixels) of the ost display and the distance of the optical combiner to the user’s eye makes the pixel shape almost indistinguishable.

5.2.1 Tracker Measurements

The aim of the measurement procedure is to gather N correspondences between the background marker xw and screen coordinates pixel xp, we denote this by xp_i ↔ x_iw, i = 1, . . . , N . This is done manually in a so called boresighting exercise where the user aligns the background marker with crosshair displayed on the ost. This is done N times as described in Section 3.5, in Paper B.

The tracker provide the position of the tracker markers referenced in the tracker frame t. The markers attached to the hmd is defined as a rigid body with its origin defined in one of the markers for which the tracker reports the full 6 dof pose. To simplify calculations the hmd is used as the base frame w and it is considered to be inertial. Since the background marker is also measured by the tracker it can easily be referenced in the hmd frame. Let Twt _{denote the homogeneous}

representation of Euclidean transformation from the t frame to w frame, then the background marker xtis referenced in the w frame as

xw = Twtxt, (5.1)

where the vectors are homogeneous.

This is a great advantage since we would otherwise need to know the exact loca-tion of the background marker in a global frame with respect to the tracker frame. Furthermore, this is basically the same idea as in the Single Point Active Align-ment Method (SPAAM) of Tuceryan and Navab (2000) where their background marker is the approximate tracker origin.

Using (5.1) and the pinhole camera model (2.35) on page 16, the screen coordi-nates are related to the background marker as

λxp= P xw, (5.2)

(45)

5.2 Calibration 29

5.2.2 Direct Linear Transformation

The Direct Linear Transformation (DLT) (Abdel-Aziz and Karara, 1971) is a pa-rameter estimation method which takes N point correspondences xp_i ↔ xw_i , i = 1, . . . , N , given by

λix p

i = P xwi , (5.3)

where all λi are unknown scale factors. The sought parameters are the entries in

the 3 × 4 camera matrix P . In our setup, P defines the pinhole camera model as in (2.35), where xp_i are pixel coordinates and xw_i coordinates in the w-frame. The unknown scale factors λi are removed by applying the cross product

λix p i × x p i | {z } =0 = xp_i ×_{P x}w i , (5.4)

which can be rearranged into          0T −xw_i T y_ipxw_i T x_iwT 0T −_xp ixwi T −_yp i xwi T ₋ xp_ixw_i T 0T                  p1 p2 p3         |{z} p = 0, (5.5)

where pi, i = 1, 2, 3 are the transposed rows vectors of P , i.e., the vectorisation of P . The are only 2 linearly independent equations in (5.5), so it is possible to leave one out. Since there will be measurement errors, in either xpor xw, or both, the right hand side of (5.5) will not be exactly zero. It is therefore preferable to collect more than the six correspondence points which is the minimum required to solve (5.5) for P . Taking the N correspondences and stacking the entries in a tall 3N × 12 matrix A we obtain

Ap = , (5.6)

where p = [p1 _p2_p3_]T _{and are the residual errors associated to the}

correspon-dences. The residual errors can be minimised by the following optimisation prob-lem min p ||_Ap|| subject to ||p|| = 1, (5.7) which has a non-degenerate solution if the correspondence point configuration is general, see Section 2.3 in Paper B. The solution correspondonds to the smallest nonzero singular value of A. The camera matrix has only 11 dof since the scale is unknown, hence, A has only 11 nonzero singular values. We must therefore add a constraint to p since otherwise the solution would be q = 0. The unit singular vector associated to the smallest nonzero singular value gives the minimising argument ˆp and hence the camera matrix ˆP is obtained. For a more complete discussion on DLT, see Hartley and Zisserman (2004); Sutherland (1974).

(46)

5.2.3 Decomposition

The 3 × 4 camera matrix P can be decomposed into extrinsic parameters, Rcwand cw, and the upper triangular intrinsic parameter matrix K. These parameters are needed for analysis but are also required to assemble the OpenGL matrices for display configuration of the ar graphics in VR Juggler. The decomposition is obtained from (2.36) as

P = KhRcw −Rcw_cwi_{= K R}cwh_I _−cwi_. _(5.8) Since Rcw is orthogonal (RcwRcw T _{= I) and K is upper triangular, the matrices} can be found as the RQ-decomposition of the left 3 × 3 sub-matrix of P . The ambiguity in the decomposition can be removed by requiring that K has positive diagonal entries (Hartley and Zisserman, 2004). Also, K is normalised by its last entry

K := K K33

. (5.9)

The camera centre (eye point location) is in the right null space of P P"c

w

1 #

= 0, (5.10)

and cwcan be recovered as

cw= −P†p4, (5.11)

where p4is the last column of P and † denote the pseudo inverse.

5.2.4 Remarks

The DLT minimises the error, ||Ap||, which is called the algebraic error and it is usually difficult to analyse geometrically. Actually, we would like to maximise the probability of the parameters given the correspondences, pP ; xp_i, xw_i . This can easier be done using Maximum Likelihood, pxp_i, xw_i ; P, where error measures have to be derived. These methods are called Gold Standard in Hartley and Zis-serman (2004) and usually results in nonlinear least-squares which is initialised with an estimate from e.g., DLT. Under the assumption of zero mean independent and identically distributed Gaussian noise in the pixels then

min P N X i=1 ||xp_i −_{P x}w i || 2_, _(5.12)

is the Maximum Likelihood estimator for P . With the experimental data in Pa-per B and minimisation of (5.12) using Levenberg-Marquardt no further improve-ment is obtained. We believe this is because of a poor initial estimate. A more rea-sonable assumption, given our setup, is that there are errors in the background marker coordinates, xw_i , as well. This is due to the fact that the background marker is measured, hence containing errors, and these measurements are refer-enced by Twt (5.1) which is also uncertain.

(47)

5.2 Calibration 31

Care should be taken interpreting the camera parameters as Euclidean from the decomposition (Section 5.2.3). Even if the lens system in the osthmd is ideal there is no guarantee that the ost display is perpendicular to the optical axis.

(48)

Visual Inertial Navigation and Calibration

Visual Inertial Navigation and

Calibration

Martin A. Skoglund

Abstract

Populärvetenskaplig sammanfattning

Acknowledgments

Contents

I

Background

II

Publications

Notation

Part I

1

Introduction

1.1

Problem Formulation

1.1.1

Visual Inertial Navigation and Mapping

1.1.2

Optical See-Through Head Mounted Display Calibration

1.2

Publications

1.3

Contributions

1.4

Thesis outline

2

Models

2.1

Kinematics

2.1.1

Notation

2.1.2

Translational Kinematics

2.1.3

Rotational Kinematics

2.1.4

Quaternions

2.2

Dynamics

2.2.1

Translational Dynamics

2.2.2

Rotational Dynamics

2.2.3

Dynamic Models

2.3

Inertial Measurements

2.3.1

Gyroscopes

2.3.2

Accelerometers

2.3.3

Discrete-time Dynamic Models

2.4

The Pinhole Camera Model

2.4.1

Camera Coordinates

(R

, c

)

m

R

m

m

w

c

2.4.2

World to Pixel Coordinates

3

Estimation

3.1

Sensor Fusion

3.2

Calibration

3.3

Nonlinear Estimation

3.3.1

₎

_m