Indoor Visual Localization of the NAO Platform Lokalisering i inomhusmiljö med NAO-plattformen

(1)

Degree project in

Computer Science

Second cycle

(2)

Lokalisering i inomhusmiljö med NAO-plattformen

ISAK TJERNBERG ISAKTJ@KTH.SE

Master’s Thesis at CSC Supervisor: Petter Ögren Examiner: Danica Kragic

(3)

(4)

(5)

Lokalisering i inomhusmiljö med

NAO-plattformen

(6)

(7)

1 Introduction 1

2 Background 3

2.1 The NAO robot . . . 3

2.2 SLAM . . . 3

2.3 Previous work . . . 4

2.3.1 SLAM solutions with the NAO robot . . . 4

2.3.2 SLAM with RGB-D sensor . . . 6

2.3.3 Detectors and Descriptors . . . 6

2.3.4 Monocular SLAM using an Extended Kalman Filter . . . 8

2.3.5 Multi-robot SLAM using particle filter . . . 10

2.3.6 RANSAC . . . 11

3 Exploring Performance of SLAM on the NAO 15 3.1 SLAM with EKF . . . 15

3.1.1 Result . . . 16

3.2 SLAM with structure from motion . . . 16

3.3 Discussion of results . . . 18

4 Using Kinect to Build a Map for Localization 19 4.1 Microsoft Kinect sensor . . . 19

4.2 Acquiring a pose graph . . . 20

4.3 Feature extraction . . . 22

5 Global Localization Using Vision and a Map 25 5.1 Feature matching . . . 25

5.2 Finding a transformation . . . 26

6 Filtering of the Noisy Pose Measurements 31 6.1 Extended Kalman filter . . . 31

6.1.1 Prediction . . . 32

6.1.2 Correction . . . 32

6.1.3 Outlier rejection . . . 34

(8)

7.2 Implementation . . . 38 7.3 Results . . . 38

8 Conclusion and Future Work 45

8.1 Discussion of results . . . 45 8.2 Possible Improvements and Future Work . . . 46 8.3 Conclusion . . . 46

(9)

(10)

Introduction

With a possible future involving autonomous robots helping out with household tasks, their ability of mapping and localization will be important. If a robot is to be able to navigate an indoor environment (that is, go from point A to point B) it must not only know what the environment looks like (i.e. have a map) but also know where it is located with respect to a given map (i.e. do localization). The mapping becomes even more important as objects can occasionally move and the map must then be updated to reflect these changes. The robot might otherwise crash into objects which are not correctly mapped, which brings up safety concerns. The topics of localization and mapping have received much attention (see for example [1]) but most techniques rely on high performance and relatively expensive sensors such as laser range finders, as in [2]. One can argue that to make house-hold robots available to the “general public”, one must not use expensive hardware but instead rely on lower cost components, such as those available in for example the NAO robot [3]. One application of a multi-agent task which relies on agent localization using such hardware is the RoboCup soccer league [4].

This thesis is focused on studying the capabilities of the NAO platform with respect to indoor mapping and localization. In particular, we investigate the per-formance impact of using different levels of engineering of the environment. With the level of engineering of the environment we mean how much the environment must be modified to enable a given solution. We investigate three different cases. First the case of a robot with no external equipment, and second the case where the map is built by external means and fed to the robot. The final case makes use of an external positioning system providing localization. The levels of preparations is increasing over these cases, the first one has the least amount of preparation as the robot is set off in a completely unknown environment, but by adding some engineer-ing effort the map can be given to the robot before it encounters the environment. The last case, with an external positioning system, contains more engineering of the environment as now the environment itself needs to be modified.

(11)

is then in Chapter 3 tested and evaluated on the NAO platform. Chapter 4 then explains how the Kinect sensor can be used to build a 3D map. With a 3D map available, Chapter 5 presents a solution for how to use the NAO camera to localize in such a map. To also incorporate odometry measurements from the NAO robot, a Kalman filter for data fusion is set up, as explained in Chapter 6. Finally, re-sults from experiments are presented, followed by a discussion of the rere-sults and a conclusion.

(12)

Background

This chapter is dedicated to give a description of the background knowledge and techniques which were needed for the solution of the project.

2.1 The NAO robot

The NAO is a 58 cm tall humanoid robot developed by the french company Alde-baran Robotics [3], see Figure 2.1. It is equipped with ultrasound sonars, a mi-crophone array, IR sensors, an inertial measurement unit, tactile sensors and two cameras. Onboard processing is done on an Intel Atom 1.6 GHz CPU. The robot comes shipped with functionality to do a number of tasks, for example walking and sound detection. Stereo vision is not possible in practice due to the fact that the two cameras have only a small overlap. One of the cameras is pointing slightly downwards, with the other one facing straight ahead providing 61◦ HFOV (Height Field of View) at 1.22 MPix. One can access VGA images at 15 fps, and higher resolution at lower fps. Among other applications, the NAO robot is used as the hardware in the RoboCup Soccer Standard Platform league [4].

2.2 SLAM

(13)

Figure 2.1. The NAO robot developed by Aldebaran Robotics.

2.3 Previous work

Earlier studies in the area will here be introduced, some of the ideas in these studies are further explained below. The work presented in this section was performed by the cited authors.

2.3.1 SLAM solutions with the NAO robot

(14)

camera and moreover, image points at a bigger angle from the direction of motion will move more between frames and can thus give more information on the robot movement. The authors chose not to use a Kalman filter approach for the visual SLAM with the argument that the processing power of the NAO is too limited for such an algorithm. The notion of observation lines is instead introduced, where an observation line is the line directed towards where a keypoint was observed, in the absolute frame. If a given keypoint has only been observed once, the keypoint can lie anywhere along the observation line. However, when the same keypoint is observed a second time the position in the absolute reference frame is given as the intersection of the two observation lines. This holds as long as the observation lines are not collinear. In the case of more than two observations of a certain keypoint, its position is determined as the average of the intersections. One issue presented in the paper is that of false matches of keypoints. This is an issue because when two keypoints from two different frames corresponding to two different real world points are matched, the position estimation of that real world point will not make sense. To avoid this issue, the authors of [6] note that one relatively easy check to do, is to see if the observations are physically possible. For example, the estimated position of a keypoint cannot be behind the robot or even out of the field of view of the robot. This reduces the number of false matches but does not make them all go away. The conclusion from the work is that “..., a visual SLAM with generic

keypoints does not seem reasonably feasible on a NAO robot.”. This is of course a

matter of interpretation and of the requirements the SLAM process has to meet. One similar paper to [6] is [7] which uses the NAO camera to build a map, also known as Monocular SLAM. Monocular SLAM is the process of performing SLAM by the use of a single camera, i.e. when no stereo vision is available. As estimating a 3D map from only a 2D image is hard, one approach for monocular SLAM is to track how objects in the scene has moved between two image frames. Paper [7] uses a Structure from Motion (SfM) approach to solve the SLAM problem. SfM is the technique of matching features (or keypoints) between image frames as the camera is moving. By knowing how 3D world points are projected onto the image frame, one can construct a 3D model of the world together with the camera poses from when the images were acquired. The work in [7] makes use of an SfM implementation known as SfM-Seqv2 [8], with slight modifications. Their modification allows for iterative updates, which means the camera poses are estimated and updated for every newly acquired image. In other words, the map is built iteratively during the process instead of only doing it once all images have been acquired.

For the experiments, [7] used both a simulated environment in Webots but also a real world run. In the simulated environment, the NAO robot was ordered to follow a circular path with images acquired every 1 m of walk. The mean pose error from this experiment is reported as being only 0.076 m when the robot has completed a circular path of about 10 m in radius. As the real world experiment lacks ground truth measurement the results from this run are hard to interpret. The conclusion from the work is “... the proposed algorithms are able to safely navigate the robot

(15)

Feature Detector Descriptor Harris Corner Yes No

SIFT Yes Yes

SURF Yes Yes

BRIEF No Yes

Table 2.1. List of feature detectors and/or descriptors, from [10].

artificial markers.”

One other study involving the NAO robot is [9] which notes that the odometry on the NAO robot “... deviates greatly from the ground truth”. The suggested approach to improve on the bad odometry is to also include data from the camera images. The solution depends on a pre-built 3D model of the environment in which the robot operates. The 3D model consists of image features which are later used to localize to robot as it moves in the environment. The main difference between [7] and [9] is that in the latter it is no longer the NAO which builds the map, but instead it uses the pre-built map to localize itself. This is similar to what we do in Chapter 5.

2.3.2 SLAM with RGB-D sensor

In [10], the problem of map building for indoor environments is studied. The work builds on the use of a Microsoft Kinect sensor which is a relatively low cost sensor for capturing RGB and depth data. Moreover, [10] uses Graph SLAM which is an offline algorithm (post processing) to solve the problem. As this work focuses on real time application, the details of the Graph SLAM solution will not be further discussed here. Even though a Kinect sensor is used for the map building, [10] uses tracking of 2D image features and provides a good summary of the concept. Tracking 2D image features contains two parts, a feature detector and a feature

descriptor. The detector is used to find interesting image areas, where interesting

means that the area has a corner or an edge. Typically, an image of a white wall has a low number of features while an image of for example a computer keyboard has many more features. The detector finds these small interesting patches and the descriptor is then used to give a description of the image patch, so that it can later be identified. Many different detectors and descriptors have been proposed, see table 2.1 taken from [10].

2.3.3 Detectors and Descriptors

(16)

Figure 2.2. From image gradients around keypoint to a feature descriptor in the

SIFT algorithm (courtesy of David Lowe).

pass filters for edges in the images. That is, for a certain DoG only those edges of a certain “magnitude” will survive the filtering. Typical SIFT implementations blur the image with 5 Gaussian kernels with increasing widths, thus producing 4 DoG. When this is done the original image is subsampled to produce an image of half the size and the process is then repeated. Doing this for subsampled images is denoted as different octaves. In short, each octave consists of a number of DoG images.

The SIFT features are finally detected as the extrema of the DoG images, for all octaves.

The SIFT descriptor is designed to create a feature descriptor which is invariant to rotation of the image. In order to do this, the main orientation of the keypoint is first found (the major gradient orientation) and all remaining computations are hereafter done relative to this orientation. The descriptor is finally stored as a vector of 128 numbers, this vector comes from a histogram of gradients around the keypoint. In more detail, the area around the keypoint is divided into a square grid (4x4) and in each of these grid cells a 8-bin histogram of the image gradients are formed. This makes the descriptor 4 · 4 · 8 = 128 items long. The concept is illustrated with a 2x2 grid in Figure 2.2.

When matching SIFT keypoints between frames, one can simply match a feature vector in the previous frame to the closest one in the current frame using a nearest neighbor approach, with possible restrictions on keypoint location in the image.

Another feature descriptor is the SURF (Speeded Up Robust Features) [12] descriptor, which is built on the same concepts as the SIFT descriptor but is quicker to compute. The speed up comes from the use of integral images and also from approximating the computation of second order image derivatives. As noted in [10] SURF is several times quicker than SIFT.

(17)

strings to describe a feature and uses Hamming distance for matching. This makes computation very fast, the authors note “Computing the Hamming distance between

two binary vectors corresponds to a simple bitcount on the result of a binary XOR operation between the vectors”. All in all, if one suffers from long computation time

with the SIFT and SURF descriptors, one can switch to the BRIEF descriptor. 2.3.4 Monocular SLAM using an Extended Kalman Filter

We now turn to a full solution of the SLAM problem with the use of only a standard digital camera, running in real-time. Davison et al. [5] use the classical approach of an Extended Kalman Filter (EKF) to solve the SLAM problem, but using image features as landmarks. A full description of the EKF can be found in [1], it will only be summarized here. A standard EKF for robot localization tries to estimate the robot state vector and its corresponding covariance matrix. The standard Kalman filter makes use of a motion model and a measurement model. The motion model describes how the state changes in an iteration and the measurement model gives the probability of a measurement given a robot state. The filter iterates over two steps, the prediction and the update. The prediction uses the motion model to predict where the state will go, and the update step uses a measurement to correct the state belief. In the SLAM scenario, the state vector is augmented to not only hold the robot/camera pose but also the positions of the landmarks in the map. The state vector then looks like

ˆ x =       ˆ xv ˆ y1 ˆ y2 .. .      

where ˆxv represents the camera pose and each ˆyk represent the 3D cartesian

coor-dinates of landmark k. Specifically, ˆxv is

ˆ xv =      rW qW R vW ωR     

which as said describes the camera state. r is a 3D position vector, q is the orienta-tion quaternion, v velocity vector (3D) and finally ω being angular velocity vector. This in total makes ˆxv 13 elements long. The superscripts W and R denotes which reference frame the coordinates belong to, world or robot respectively.

(18)

The EKF consists of a prediction and an update step. The prediction step is responsible for predicting where the camera will go in the next time step, then the update step corrects this “guess” by incorporating sensor measurements, which in this case is a freshly taken camera image. In the prediction step, the predicted state looks like      rW_new qW R new v_newW ω_newR      =      rW + (vW + VW)∆t qW R_{× q((ω}R_{+ Ω}R_)∆t) vW + VW ωR+ ΩR     

where VW and ΩR are velocity and angular velocity impulses respectively, caused by unknown accelerations modeled as distributed by a Gaussian with zero mean.

q((ωR+ ΩR)∆t) here means the quaternion made up by the vector (ωR+ ΩR)∆t. Image features (landmarks) used in [5] are simple 11 × 11 (pixels) image patches. In order to efficiently track these between image frames, the EKF prediction is used to estimate where in the image earlier seen features should be searched for. Given the estimated camera position rW and a landmark position y_kW, the relative position of a landmark from the camera point of view is

hR_L = RRW(y_iW − rW)

RRW is here the 3D rotation matrix from world to robot (or camera) frame. The expected image coordinates at where a given landmark can be found are then given as hi= " u v # =    u0− f ku hR Lx hR Lz v0− f kv hR Ly hR Lz   

where (u0, v0) is the point of intersection (in pixels) of the camera optical axis and

the image. f is the focal length of the camera and ku, kvis the scale factors between

distance and pixels for the two image axes. In order to know in which region around this point to search for the landmark the information about the current uncertainty in estimated position (Σ) can be transformed into the camera frame. In a standard EKF the measurement covariance is usually expressed as

HΣHT + R

the covariance matrix Σ is in this case very large as it also includes uncertainty for all landmarks. H is the Jacobian of the measurement function (hi) above. Instead

of doing the calculation of the measurement covariance for all landmarks at the same time, it can be done on a per landmark basis according to the following formula.

Si = ∂hi ∂xv Pxx ∂hi ∂xv T + ∂hi ∂xv Pxyi ∂hi ∂yi T +∂hi ∂yi Pyix ∂hi ∂xv T +∂hi ∂yi Pyiyi ∂hi ∂yi T + R

Si is a 2 × 2 matrix representing the measurement uncertainty in image pixels. xv

(19)

matrix of the measurement noise when identifying a feature in the image and must be adjusted with the image resolution. Each of the terms in the above equation is also a 2 × 2 matrix, for example

∂hi ∂xv Pxx ∂hi ∂xv T = ∂u ∂x ∂u ∂y ∂u ∂z ∂v ∂x ∂v ∂y ∂v ∂z ! · Pxx·    ∂u ∂x ∂v ∂x ∂u ∂y ∂v ∂y ∂u ∂z ∂v ∂z   

which results in a 2 × 2 matrix. Note that the matrices in the above expression are evaluated at the current estimated state.

Si defines an elliptic search region for where to look for a given landmark,

which can be varied in size by adjusting how many standard deviations are allowed. Davison et al. uses a value of 3σ.

As the 3D position of a landmark can not be estimated from a single image, landmarks go through an initialization step before their 3D position is added to the current map. When an 11 by 11 image patch is first selected as a landmark, it is initialized as a finite line from 0.5 to 5 m in the direction of where it was observed. The line is represented by a set of particles. When a new image is acquired, the particles’ positions are projected into the new image (using the same procedure with Jacobians as above) and the landmark is searched for. The particle set is then resampled so that particles which match the position of the landmark in the new image better are more likely to survive. Once the particle set has narrowed in distribution so that the ratio between the standard deviation of the particle set and the estimated depth measure of the landmark falls below a certain threshold, then the landmark is added to the map and thereafter tracked by the EKF.

As the complexity of the EKF is directly related to the dimension of the state vector, and thereby the number of landmarks, one must be careful not to add too many landmarks as it will make the algorithm slower. Davison et al. uses an approach where they try to keep the number of landmarks visible in each frame to between 6-10, this due to their requirement of the algorithm running at 30 Hz. With a less demanding requirement on speed, one can of course use more landmarks. If less landmarks than the threshold specifies can be observed from a given frame, then new possible landmarks are searched for using the Shi and Tomasi image interest operator [14] in a search box in the image. This search box is placed so that it does not overlap with existing features and that newly detected features should not leave the field of view in the near future.

The experiments performed mainly resulted in videos proving the concept. One of the videos shows a humanoid robot walking in a circular shape and performing loop closure. Features in the scene are tracked robustly, even if there are not that many features used. More tracked features can be traded for algorithm speed. 2.3.5 Multi-robot SLAM using particle filter

(20)

Rao-Blackwellized particle filter, in which the estimation problem is divided into a par-ticle filter part and a Gaussian estimation part. In the FastSLAM algorithm [1] the SLAM problem is solved by a particle filter where each particle runs a separate Kalman filter for each landmark. The argument for not running a full EKF SLAM is its complexity O(N2), while an efficient implementation of FastSLAM is only O(log N ) in the number of features. Moreover, the use of a non-parametric part in the filter allows it to track multiple hypotheses while for example EKF SLAM can only track the most likely one.

One assumption made in [2] is that the robots can measure their relative poses at the start of the process. This is a fairly limiting assumption as the robots may not be able to see each other without moving, however, this issue is sidestepped in [2] by the introduction of virtual robots. At first, the mapping only uses information from one robot, once the robot has encountered and measured the pose of another robot then that other robot’s measurements are incorporated in the map. At this time a virtual robot is also introduced (with the pose of the other robot), this virtual robot then travels backwards in time replaying the recorded measurements from the robot backwards. The real robot’s measurements in future time are also used of course. When any of these real robots encounters a third robot the same procedure is repeated with this robot. By using these virtual robots and playing back their measurements backwards, the assumption of the relative initial robot poses being known will eventually hold true.

Technically, before any robot encounters the filter has a set of particles hx1, m, wi

and once a second robot is encountered the state is augmented to track hx1, x2, ¯x2m, wi

where x2 is the pose of the second robot traveling forwards in time and ¯x2 is the pose of the virtual second robot traveling backwards in time. Letters m and w represent the current map and particle weight.

One important thing to note is that the algorithm can be run distributed on the robots. Each robot will then run the full setup of the filter and the robots will necessarily not come up with the same map as they are all running a separate particle filter and particle filters are not deterministic.

2.3.6 RANSAC

(21)

lower error, then that model is stored as the best one. The algorithm will necessarily not find the optimal solution (i.e. lowest global error possible) but it is highly likely to find a better solution than we would have done by naively using all samples to calculate the model.

Algorithm 1 RANSAC algorithm

function RANSAC(M easurements)

lowestError ← ∞ it ← 0

while it < nr_iterations do

it + +

Samples ← select k random samples from M easurements

[θ1, θ2, ..., θn] ← calculate model params. from Samples

Inliers ← Samples

for each m in M easurements and not in Inliers do

if calculate_error(m, [θ1, θ2, ..., θn]) < thresh_error then

Inliers.add(m)

if size(Inliers) > thresh_inliers then

[θ1, θ2, ..., θn] ← calculate model params. from Inliers

errorSum ← 0

for each m in Inliers do

errorSum+ = calculate_error(m, [θ1, θ2, ..., θn]) if errorSum < lowestError then

bestM odel ← [θ1, θ2, ..., θn]

lowestError ← errorSum

return bestM odel

(22)

Figure 2.3. RANSAC algorithm on possible robot positions. Blue dots indicates

(23)

(24)

Exploring Performance of SLAM on the

NAO

Many different approaches for solving the SLAM problem have been proposed [1, 2, 5, 10]. Here we will explore solutions that can be used with the hardware available on the NAO platform, and determine how well these methods perform. The SLAM methods applied to the NAO robot presented in Section 2.3 mainly re-lied on images acquired with the head camera as input data to the SLAM algorithm. With unreliable odometry and the remaining sensors being ultrasound and IR, this is the most straightforward solution. With the limited processing power available on the NAO platform the options for visual SLAM is to either use an Extended Kalman filter, such as in [5], or to use the more computer vision-inspired approach known as structure from motion, as in [7].

3.1 SLAM with EKF

(25)

One limitation of using an EKF for SLAM is the increased complexity when adding landmarks to the map. The reason for keeping the number of viewed land-marks in a given image between 6 and 10 is to not suffer from this increased com-plexity. This puts a direct limitation on how detailed the built map will be. In order to get a more detailed map the obvious solution is to track more features in each frame. However, this does not come without its issues. Not only does the complex-ity of the EKF increase with an increased map size, generating the measurements fed to the filter are also more computationally heavy. Measurements are generated by searching for the tracked features in each frame, where the search is performed inside a region whose size depends on how well the 3D position of the feature is known. This means that with more features tracked, more features must also be searched for in every frame, and this increases the computational cost. In the limit when computation time increases above the time difference between two frames (i.e. one or more frames are skipped) the filter will have a harder time tracking features as the camera has moved a longer distance between two frames.

3.1.1 Result

The technique described in [5] was tested using an implementation from [16]. Around 100 images of a scene where recorded with one of the cameras (the top one) on the NAO robot. This was done by moving the robot by hand back an forth along a bookshelf, while sampling images from the camera. The robot was moved slowly while trying to avoid jerks and other quick movements. The images acquired were stored on a computer for offline processing and evaluation. The implementation in [16] provides a GUI for visualization of the EKF SLAM while it is running. Shown in Figure 3.1 is an example of what the program looks like while running. The key parts of Figure 3.1 are on the right hand side. The bottom part shows the current image with landmark positions and their uncertainties (ellipses) over-laid. The top right part shows a 3D map of where landmarks are located, where each landmark is represented by an ellipsoid of uncertainty. The size and shape of each ellipsoid comes directly from the covariance estimated in the Kalman filter. From Figure 3.1 it is clear the ellipsoids are elongated in the direction away from the camera, this makes sense when knowing that the 2D image provides no depth information, depth information can only be obtained when the camera moves along the two other axes (i.e. up/down or left/right). The landmarks shown in Figure 3.1 have different colors depending on if they were matched correctly or not in the current frame, red is a good match and blue is not.

3.2 SLAM with structure from motion

(26)

Figure 3.1. Testing EKF SLAM with the Scenelib2 implementation. Ellipses in the

right part of the Figure represent landmarks and their uncertainty in position.

(27)

3.3 Discussion of results

The approaches described above, namely EKF SLAM and structure from motion, work as expected in an environment with many distinguishable details. Moreover, the EKF SLAM approach uses very few landmarks in the map and it is hard to get an understanding of what the environment really looks like. If the goal is to only track the pose of the camera, the detail level of the map is irrelevant and the approach is in this case fine.

However, if the task of the NAO robots is a bit more advanced (navigation, search and rescue, cooperation tasks) a more detailed map is of significant impor-tance. One example is a search and rescue task with the goal of finding for example a set of keys known to be located on some table. In order for the robot to know where in the world to look it would first need to find possible table tops in the map, and then search these areas. Finding possible table tops is of course harder in a map with few details. Another aspect worth noting is that many of the feature rich objects in a scene are typically located along the walls of an indoor environment, for example paintings, wallpapers, windows, doors, and thus the mapping becomes easier in this area which is not necessarily the working region of the robot.

(28)

Using Kinect to Build a Map for

Localization

As the ability of the NAO to build a map on its own has been deemed unsatisfactory due to the low resolution achieved in the map, we now investigate the problem of performing global localization, when provided with a 3D map of the environment.

The solution to this problem is to first acquire a visual 3D map of the environ-ment (by the use of external means) and then let the NAO use its camera to localize in this map. In order to create the 3D map of the indoor environment, we use a system developed by the authors of [10, 17, 18]. In these works, an RGB-D sensor is used to acquire an image sequence of an indoor environment. Associated with each RGB image captured is also a depth image, essentially providing 3D position (relative to the camera) of each image pixel. These RGB and depth images are then used in an offline algorithm to create a graph of poses (position and orientation) of where each image was acquired. By knowing the poses of where images were acquired one can reconstruct a model of the environment.

The NAO robot can then localize in this map by acquiring images with its camera and matching these to the world model.

4.1 Microsoft Kinect sensor

(29)

Figure 4.1. The Microsoft Kinect RGB-D sensor

4.2 Acquiring a pose graph

This section describes how data recorded with the Kinect sensor can be used to first calculate a set of poses of where images were taken, and then how these poses are used in order to recreate a 3D model of the environment. The process of building a 3D map of the robot’s environment starts out with moving the Kinect sensor around the area while capturing and saving both RGB and depth image pairs. As one can calculate the 3D position (in the Kinect’s coordinate system) of each pixel in these pairs of images, the only thing needed for a full 3D map is to find the transformation (rotation+translation) between each captured pair of images. This is where the work described in [10,17,18] comes in handy. The system described in these works takes RGB-D frames and by tracking features between frames deter-mines the transformation between consecutive frames. By also using loop closure and optimization techniques the result is a set of transformations describing the pose of the Kinect relative to a world coordinate system for each frame. Each transformation, R4×4, is of the form

R4×4=      tx R3×3 ty tz 0 0 0 1      (4.1)

where R3×3 is a classical 3D rotation matrix while tx, ty and tz are translations

along the three coordinate axes. This transformation is designed to be used with homogeneous coordinates. By using homogeneous coordinates we can describe both a rotation and translation with the same transformation matrix. By knowing the set of these transformations we can, for each image acquired with the Kinect, calculate the world position of every pixel.

(30)

Figure 4.2. Coordinate systems with the Kinect sensor. The left parts shows coordinate axes in the image frame and the right part how they relate to the 3D world.

Kinect’s coordinate system is as follows

xK = (u − cx) · z fx yK = (v − c_y) · z fy zK = z

where equations for xK and yK come from the pinhole camera model. Subscript K

here means coordinates in the Kinect coordinate system (see Figure 4.2). c_x and c_y are the center point in the image and fx and fy relates pixel distance to real world

metric distance.

To find the world coordinates, we use the corresponding transformation from Kinect pose to world coordinates. Thus,

     xW yW zW 1      = R_4×4      xK yK zK 1      . (4.2)

(31)

Figure 4.3. 3D map of a typical robot environment, created with the Kinect sensor.

4.3 Feature extraction

(32)

Figure 4.4. 3D map of a typical robot environment, each point represents a SURF

(33)

Algorithm 2 Creating a world model based on SURF features

function Create_Map(Images, P oses) for i ← 1, nr_images do

Img ← Images[i] P ose ← P oses[i]

F eatures ← Extract_SURF(Img)

for all F eature in F eatures do

(34)

Global Localization Using Vision and a

Map

With a map of feature points built, the task remaining for the NAO is to use its camera to measure its position and orientation relative to the map. As the map was built with image features (SURF) the NAO robot also extracts SURF features from its current view and then tries to find the closest match with the map features, for each one. Using these 2D to 3D point matches to calculate the camera position and orientation is known as the perspective n-point problem (PnP) [19].

5.1 Feature matching

Each SURF feature extracted from an image is described by a vector of 64 floating point numbers. The problem of matching consists of starting from one feature vector, and finding the most similar one in a set of other feature vectors. The set of feature vectors to search in could possibly be large, making the task more complex. One way to perform the search is by brute force, that is, going through each potentially matching feature and computing a score. The feature with the best score is then the closest match. This straightforward approach comes at the cost of being time consuming. Presented in [20] are methods which improve on the speed, but the results are then only approximate and may not be the same as would be achieved with the brute force approach. Related to [20] is a software library called

FLANN (Fast Library for Approximate Nearest Neighbors) which the computer

vision library OpenCV makes use of to quickly match for example SURF features. An example of 2D to 2D SURF feature matching with OpenCV and the FLANN library can be seen in Figure 5.1.

(35)

Figure 5.1. SURF feature matching with the FLANN library between two different

images of the same scene. One can easily spot one of the matches being faulty, but the majority are correct.

points).

One case of 2D to 3D feature matching is shown in Figure 5.2.

5.2 Finding a transformation

Once matches from 2D images coordinates to 3D world coordinates have been found, it remains to calculate a pose (position and orientation) of the camera in world coordinates. As was stated above, this is known as the perspective n-point problem (PnP) [19]. A simple visualization of the geometry involved in the PnP problem is shown in Figure 5.3, here with only 2 point correspondences.

As stated in [19] different solutions to the PnP problem have been proposed. Presented in [21] is one low complexity algorithm which also gives more accurate results than many others. This algorithm is a suitable solution to our problem as we are interested in real time performance, but are also suffering from some noisy correspondences which the algorithm is said to handle. Moreover, the solution in [21] is available in OpenCV which is suitable from an implementation point of view.

An overview of the solution in [21] is as follows. By finding the coordinates of the 3D points in the camera reference system one can align the set of these points to the set of world coordinates and in that way find the rotation and translation of the camera relative to the world origin. In order to find the 3D coordinates in the camera reference system one must know the internal camera calibration matrix, a matrix which holds camera parameters such as focal lenghts, principal point and skew between axes. If one has access to the camera this matrix can be found empirically by taking pictures of a so-called calibration pattern. This work made use of a software toolbox for Matlab [22] in order to estimate this matrix for the NAO camera.

(36)

Figure 5.2. SURF feature matching between the NAO image and the world model.

Each green circle in the upper right image represents a SURF feature extracted from the NAO camera image. Red blobs in the world model represents matching SURF features which were extracted with the Kinect sensor when building the map. Unfor-tunately this visualization contains no connections between matching green and red blobs.

used in a weighted sum to express world coordinates as in

pw_i = 4 X j=1 αijcwj with 4 X j=1 αij = 1 (5.1)

By using the camera calibration matrix (denoted as A) the following relationship between 3D coordinates and image coordinates holds

∀ i, wi    ui vi 1   = A3×3p c i = A 4 X j=1 αij    xc_j yc_j z_jc    (5.2)

where superscript c denotes camera frame. Note that one of these equations can be set up for each i, that is each 2D to 3D point correspondence. It is important to understand that the αij which appear in both Eq. 5.1 and Eq. 5.2 are the

(37)

coordinate systems. The unknowns in equation 5.2 are the n w_i and the 12 control point coordinates n(xc_j, y_jc, z_jc)o

j=1...4. The details for how to solve this system and

find the 12 control point coordinates can be found in [21].

The end result achieved in the implementation of [21] in OpenCV, is a rotation and translation matrix P (homogeneous coordinates) transforming coordinates in the world coordinate system into the camera coordinate system, as in

     xc yc zc w      = Pxw=      tx R3×3 ty tz 0 0 0 1           xw yw zw 1     

As we are interested in finding the camera position in world coordinates (that is, describing the origin in the camera system relative to the world) we can invert the equation as follows cw = P−1·      0 0 0 1      (5.3)

with cw being the camera position in homogeneous coordinates. With the transfor-mation matrix P being of a special form with a rotation and translation combined (thanks to the homogeneous coordinates) the inverse has a closed form

P−1 =      RT −RT _{· t} 0 0 0 1      .

The letter t here denotes the 3 × 1 translation vector.

(38)

Figure 5.3. Geometry of the PnP problem. c is the camera center, pi, pj are

(39)

(40)

Filtering of the Noisy Pose

Measurements

As the given pose measurements are fairly noisy, and the NAO robot can use odom-etry to calculate its movement between subsequent pose measurements, a filter is introduced to fuse the data sources and come up with an estimate of the robot pose. Given the performance requirements (real time) and direct measurements of the robot pose an extended Kalman filter is used in the implementation. The non-linear motion model of the robot is linearized in accordance with the extended Kalman filter and then used in the prediction step, while the measurements are fed to the correction step.

6.1 Extended Kalman filter

The extended Kalman filter was discussed in Section 2.3.4 with the application of solving the SLAM problem. A simpler use case of the EKF is for pose tracking which is here presented. One limitation of the EKF is that it can not handle global estimation, i.e. when the state to be estimated is completely unknown at first. The EKF must first be provided an initial guess before we can set it off to work, how this initial guess is created is discussed in Section 6.1.4 below.

The EKF iterates over two main steps known as prediction and correction. The prediction step is meant to provide a “guess” of where the state has moved in the last time step. The correction step incorporates the latest measurement and adjusts the current state estimate accordingly.

The state vector in the EKF, at time t, is set up as

µt=    xt yt θt   

(41)

6.1.1 Prediction

When the NAO robot moves around by walking, the motion can be estimated by dead reckoning, that is, by knowing how leg motion relates to linear and angular speed of the robot, the position is found by straightforward integration. The draw-back when integrating is that an error in the measurement will also be integrated which makes the position error grow. The odometry calculation is available in the NAO robot, which provides a position estimate relative to some arbitrary starting point. By comparing positions at two times the NAO movement in between can be found. dxt= ut=    dxt dyt dθt   =    xt yt θt   −    xt−1 yt−1 θt−1   

Following the notation in [1], the state prediction can be written as

¯ µt= g(ut, µt−1) =    xt−1 yt−1 θt−1   +    dxtcos(θt−1) − dytsin(θt−1) dxtsin(θt−1) + dytcos(θt−1) dθt    (6.1)

Also stated in [1] is the equation for the covariance prediction ¯

Σ_t= G_tΣ_t−1GT_t + R_t

where G_tis the Jacobian of the motion model equation g(u_t, µt) evaluated at µt−1.

Rt is a 3 × 3 covariance matrix for the motion uncertainty. Calculating Gt from

(6.1) gives Gt=    1 0 dxtsin(θt−1) − dytcos(θt−1) 0 1 dxtcos(θt−1) + dytsin(θt−1) 0 0 1   

where θ_t−1 is notation for the θ value in the last estimate µ_t−1. 6.1.2 Correction

The correction step is responsible for incorporating a measurement and use it to correct ¯µt and ¯Σt into the new estimate µt and Σt. A measurement, zt, is in our

case a direct measurement of the state as provided by the solution to the PnP problem, with standard EKF notation this is

zt=    xt yt θt   + N (0, Qt)

(42)

choice of the 3 × 3 measurement covariance matrix adjusts for how much the mea-surements are trusted (in relation to the Rt matrix in the prediction step). Note

that there is nothing stopping us from changing the values inside Qtonce the process

has started.

Choosing measurement covariance matrix, Qt

First off, an initial assumption made about the pose measurements is that the errors in x, y and θ are all independent, this means the off-diagonal elements in Qtare all

zero. As the measurements is of a 2D position and an angle a second assumption is that noise in x and y are similar while noise in θ is of a different magnitude. This gives the structure of Qt as

Qt=    c1 0 0 0 c1 0 0 0 c2   

After running some experiments a trend seemed to emerge, the pose provided by the PnP solution seemed more accurate the more inliers it found using the RANSAC algorithm. A quick reminder might come in handy: the PnP problem is that from a number of 2D to 3D point correspondences estimate a pose. With the 2D to 3D correspondences containing false matches, the problem is solved in a RANSAC fashion, and we can from this process determine how many inliers exist (in other words, how many point correspondences that fit the current model).

In order to see if this trend was true, the robot was placed in a fixed position and its camera was sampled. For each image the PnP problem was solved and thus providing both a pose measurement but also a number of inliers from the PnP solution. With the true robot position known, the error in the measured position from each frame can be easily calculated. A plot of position error vs. number of inliers from the measurement is shown in Figure 6.1. Please note that some clear outlier measurements were removed in order to make the plot more easily readable. It is fairly obvious when looking at the plot that the tendency is for the position error to drop off as the number of inliers increase. Included in Figure 6.1 is the function

25

nrInliers which fits the trend of the example data. Since a larger measurement error

gives more uncertainty the covariance matrix Q_t can be set to

Qt= 25 nrInliers    c1 0 0 0 c1 0 0 0 c2   .

(43)

Figure 6.1. Position measurement error vs. number of inliers from the PnP RANSAC solution together with the function 25

nrInliers

6.1.3 Outlier rejection

With the possibility of measurements being outliers these must be rejected in order not to confuse and possibly divert the filter. An outlier is a measurement which does not fit with the currently true value of the estimated variable, even with the noise model applied. To test if a given measurement is an outlier or not, one can use the so called Mahalanobis distance as an indication of how well the measurement fits the current estimate. The Mahalanobis distance is defined as

dM ahalanobis(x) =

q

(x − µ)T_Σ−1_{(x − µ)}

with µ the estimate of x and Σ the covariance matrix of the estimate. Using the inverse of the covariance matrix, the Mahalanobis depends on the uncertainty of the current estimate. For a fixed distance (x − µ) the Mahalanobis distance will be larger with a smaller covariance and vice versa.

In the case of measurement outlier rejection, x is a measurement while µ is the current estimated state with the EKF and Σ the state covariance. Outlier rejection can then be done as follows

if dM ahalanobis(x) > threshold then

outlier = true

else

(44)

When an outlier is detected, the measurement is ignored and the EKF uses only odometry data to update the pose estimate.

6.1.4 Initializing the filter

One aspect of using the Extended Kalman filter is the need for an initial assumption of the state and its covariance. One could of course initialize the state at an arbitrary value (all zeros for example) and with a huge state covariance. However, this makes the estimation very dependent on the first available measurement because outlier rejection is difficult as the filter has poor knowledge of the true state. The covariance is thus large and the Mahalanobis distance in the outlier rejection will be small for almost any measurement. A bad first measurement would make the filter lock in on a wrong state estimate and it might not recover.

In order to avoid this situation the proposed idea is to, instead of directly ini-tializing the filter, first collect a number of measurements with the robot stationary, and then based on those measurements initialize the EKF. To still handle possible outlier measurements, RANSAC is applied to the gathered measurements and a consensus value is calculated.

(45)

(46)

Experiments

The approach for global localization with the NAO platform described in this work was tested and evaluated using a real NAO robot in an indoor environment. For performance evaluation, an external camera system was used to provide reference measurements of the NAO position. A sketch of the system setup is shown in Figure 7.1.

7.1 System overview

The global localization with the NAO platform was implemented and tested with a system set up as illustrated in Figure 7.1. Software running on the NAO robot sends data (images, odometry etc.) over a wireless network to a desktop computer. The robot can also accept high level commands sent from the desktop computer

Figure 7.1. Physical setup of the system used to run the localization algorithm,

(47)

to the NAO to perform simple tasks such as stand up, sit down, walk, move its arms or head. The USB cameras shown in Figure 7.1 are connected to a computer running software developed in [23] which is able to accurately locate agents in the room. These measurements are used as reference data so that the performance of the localization with the NAO can be evaluated.

7.2 Implementation

The solution for global localization with the NAO platform was implemented in the software framework Robot Operating System (ROS) [24]. Despite its name, ROS is not an operating system per se but runs on top of for example Linux. There exists already a NAO stack in ROS which provides an API for accessing sensor data such as camera images or odometry but also accepts commands for moving [25]. The stack can either run directly on the NAO hardware or on a desktop to remotely control the NAO, both with the same APIs. This is useful as there is also a software simulator available for the NAO robot, which means one can test code in the simulated environment and then deploy exactly the same code to the real robot. This is beneficial as it both cuts development time and allows for eliminating bugs which could potentially damage the hardware. One other great feature of ROS is how easy it is to communicate between separate processes, even if they are not running on the same machine (but on the same network of course). Remember that the localization procedure relies on having a model of what the world looks like, built up with SURF features, see Chapter 4. To create this map, provided at the start of the program are RGB and depth image files captured with the Kinect sensor. Also provided is a set of poses, one for each image. This data is read from the file system and when the map is built the localization of the NAO can start. The process has two states, sampling and tracking, where the first is the state of collecting measurements while the robot is stationary and the second is for running the EKF. Switching from state sampling to tracking initializes the EKF and thus starts tracking the pose.

The implementation is written in C++ and has currently not been tested on-board the robot due to technical reasons, but instead on a remote desktop computer, communicating with the robot.

7.3 Results

We now present experimental data. In the cases illustrated below, data was collected as the robot walked a circular path while the localization solution was running. Measurements from the external reference system were also acquired. We should note that the external reference system performs well in most cases, but for unknown reasons the reported position can at times be off by decimeters.

(48)

Figure 7.2. Localization when the NAO walks a circular path. Circles are

measure-ments from the global localization based on the NAO built in camera, while squares originate from the overhead USB camera system. Flow of time is color coded: circles go from blue to green with time, squares go from dark grey to orange. Blue stars indicate raw odometry measurements, here aligned to the global data as odometry is only a local measurement. The blue outline indicates the walls of the room, where most of the tracked image features are located.

the circular path) match fairly well with the position reported from the external reference. One should note that the robot does not actually complete the full circle, which can be seen by looking at the external reference measurements. It is important to not be fooled by the path of the raw odometry, one may think that it is almost as correct as the localization filter based on vision. However, one must keep in mind that the raw odometry is a non-global measurement, and is useless without an accurate starting position. In the plots presented here, the raw odometry has been aligned by hand to start at the same position as the global reference. Absolute position error between the localization filter and the external reference is shown in Figure 7.4. Again, keep in mind that the external reference is not always correct, as can be seen in the bottom left of Figure 7.3.

(49)

Figure 7.3. Zoomed in version of the plot in Figure 7.2

(50)

Figure 7.4. Absolute errors calculated from the difference between the NAO

(51)

Figure 7.5. Localization when the NAO walks a circular path. Circles are

(52)

(53)

(54)

Conclusion and Future Work

In this chapter we will first discuss the results presented in the previous chapter, then make suggestions for how performance can be improved and end with a conclusion and proposed future work.

8.1 Discussion of results

(55)

In order to analyze if the overall performance of the system is acceptable, one needs to put it in context of the robot task. If the robot is to carefully navigate an environment crowded with obstacles to fetch an item at a given place, the per-formance is most probably not sufficient. However, the system could be combined with a more local planning and navigation solution to improve performance in this case. One could imagine the robot first performing global localization to find out in which general direction to move, then doing obstacle avoidance while moving and performing local search when close enough to the searched for item.

8.2 Possible Improvements and Future Work

The main issue affecting performance in a negative way is the lack of features in parts of the world model. Looking at Figure 4.4 it is evident that the density of features varies in different parts of the room. The bookshelves give rise to many features, while for example white walls lead to less features. The consequence in this case is that the robot has a better chance of localizing correctly if a bookshelf is in view than with only the white wall in view. Since feature descriptors rely on texture it is hard to improve on this. However, having overcrowded regions in the model, that is, regions where there are a lot of features, could also degrade performance as it might lead to confusion between similar features. One way of improving this is to not keep adding features to the model if there already exists enough in the vicinity, or in a post-processing step remove features which are close together in the feature space.

Regarding the problem of the robot “locking in” on a wrong initial pose, using a weighted RANSAC approach could be a possible solution to the problem. A weighted RANSAC would in this case give samples with a larger number of inliers from the PnP solution a larger weight when calculating the consensus value in the RANSAC algorithm. Either this is done as a weighted mean value or the weight is used as an extra factor when calculating the error from the consensus value for a given sample.

Regarding performance improvements, the most time is spent on feature extrac-tion and matching. Computaextrac-tion time could be cut by using a feature descriptor which is quicker to compute than the SURF in the current implementation. BRIEF, see Section 2.3.2, could be tested as it is supposed to give great speed improvements in extraction and matching as compared to SURF [13].

8.3 Conclusion

(56)

The main part of this work shows how global localization can be performed on the NAO robot. As the solution relies only on the built in camera and robot odometry, the approach should be easily transferable to other hardware platforms. We should also note here how dependent the solution is on the 3D map. If the 3D map is wrong, which happens when the environment has changed since the 3D scan, performance will degrade and in the worst case the system will stop producing any meaningful output.

(57)

(58)

[1] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Intelligent Robotics and Autonomous Agents Series, Mit Press, 2005.

[2] A. Howard, “Multi-robot simultaneous localization and mapping using parti-cle filters,” The International Journal of Robotics Research, vol. 25, no. 12, pp. 1243–1256, 2006.

[3] “NAO documentation.” http://www.aldebaran-robotics.com/ documentation/. Accessed: 2013-05-13.

[4] “RoboCup soccer standard platform league.” http://www.robocup.org/ robocup-soccer/standard-platform/. Accessed: 2013-05-13.

[5] A. Davison, I. Reid, N. Molton, and O. Stasse, “Monoslam: Real-time single camera slam,” Pattern Analysis and Machine Intelligence, IEEE Transactions

on, vol. 29, pp. 1052 –1067, june 2007.

[6] E. Wirbel, B. Steux, S. Bonnabel, and A. de La Fortelle, “Humanoid robot navigation: from a visual slam to a visual compass,”

[7] Š. Fojtu, M. Bresler, and D. Pruša, “Nao robot navigation based on a single VGA camera,” in Computer Vision Winter Workshop 2012, 2012.

[8] A. Torii, M. Havlena, and T. Pajdla, “From Google Street View to 3D city models,” in Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th

International Conference on, pp. 2188 –2195, 27 2009-oct. 4 2009.

[9] Š. Fojtu, M. Havlena, and T. Pajdla, “Nao robot localization and navigation using fusion of odometry and visual sensor data,” in Intelligent Robotics and

Applications (C.-Y. Su, S. Rakheja, and H. Liu, eds.), vol. 7507 of Lecture Notes in Computer Science, pp. 427–438, Springer Berlin Heidelberg, 2012.

[10] V. Högman, “Building a 3D map from RGB-D sensors,” Master’s thesis, KTH, Stockholm, Sweden, 2011.

[11] D. Lowe, “Distinctive image features from scale-invariant keypoints,”

(59)

[12] H. Bay, T. Tuytelaars, and L. Gool, “Surf: Speeded up robust features,” in

Computer Vision – ECCV 2006 (A. Leonardis, H. Bischof, and A. Pinz, eds.),

vol. 3951 of Lecture Notes in Computer Science, pp. 404–417, Springer Berlin Heidelberg, 2006.

[13] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, and P. Fua, “BRIEF: Computing a Local Binary Descriptor Very Fast,” IEEE Transactions

on Pattern Analysis and Machine Intelligence, vol. 34, no. 7, pp. 1281–1298,

2012.

[14] J. Shi and C. Tomasi, “Good features to track,” in Proc. IEEE Conf. Computer

Vision and Pattern Recognition, pp. 593–600, 1994.

[15] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,”

Commun. ACM, vol. 24, pp. 381–395, June 1981.

[16] H. Kim, “Scenelib2 - monoslam open-source library.” https://github.com/ hanmekim/SceneLib2. Accessed: 2013-04-29.

[17] J. Ekekrantz, A. Pronobis, J. Folkesson, and P. Jensfelt, “Adaptive closest keypoint,” in Submitted to: Proc. of the International Conference on Intelligent

Robot Systems (IROS), 2013.

[18] J. Ekekrantz, A. Pronobis, J. Folkesson, and P. Jensfelt, “Adaptive iterative closest keypoint using orb features and clustering,” in Submitted to: Proc. of

the European Conference on Mobile Robots(ECMR), 2013.

[19] R. Szeliski, Computer Vision, Algorithms and Applications, pp. 284–287. Springer, 2011.

[20] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with au-tomatic algorithm configuration,” in International Conference on Computer

Vision Theory and Application VISSAPP’09), pp. 331–340, INSTICC Press,

2009.

[21] F.Moreno-Noguer, V.Lepetit, and P.Fua, “Accurate non-iterative o(n) solution to the pnp problem,” in IEEE International Conference on Computer Vision, (Rio de Janeiro, Brazil), October 2007.

[22] “Camera calibration toolbox for matlab.” http://www.vision.caltech.edu/ bouguetj/calib_doc/. Accessed: 2013-05-22.

(60)

Research, Development and Education on Unmanned Aerial Systems, (Seville,

Spain), 2011.

[24] “Robot operating system.” http://www.ros.org. Accessed: 2013-05-13. [25] “ROS NAO robot stack.” http://www.ros.org/wiki/nao_robot. Accessed:

(61)