Localization for autonomous construction vehicles using monocular camera and AprilTag

(1)

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Localization for autonomous

construction vehicles using

monocular camera and AprilTag

(2)

Localization for autonomous construction

vehicles using monocular camera and

AprilTag

XIAODI YU

(3)

(4)

Abstract

Global Navigation Satellite System (GNSS) is commonly used to provide the location to navigate the vehicles in outdoor environment. However, GNSS is un-reliable in some complex environment such as an open-pit mining environment. Although localization methods based on vision sensors have been proposed to solve this problem, most of them use on-board cameras easily obscured by mud and dirt for autonomous construction vehicle applications.

(5)

Sammanfattning

Global Navigation Satellite System (GNSS) används ofta för att tillföra lokalis-erings och navigationsmöjlighet för fordon i utomhusmiljöer. GNSS är dock otillförlitligt i somliga miljöer, till exempel bergtäkter. Även om det föreslagits lokaliseringsmetoder baserade p˚a computer-vision som ersättare för GNSS s˚a föresl˚ar m˚anga s˚adana lösningar att kameror monteras p˚a fordonen, vilket lätt kan resultera i att kameror täckas över av lera och smuts i en applikation s˚asom bergtäkter.

(6)

Acknowledgment

Firstly, I would like to thank to my Volvo CE supervisor Viktor Gustavsson for providing me the opportunity to work on this thesis. Especially, I thank him for his continuous support and feedback along my thesis work.

I also want to thank my university supervisor Valerio Turri and Dirk van Dooren for encouraging me, giving me suggestions to complete the thesis.

(7)

Introduction

Autonomous vehicles can sense the environment and navigate with little or no human interference. Many efficient techniques including but not limited to radar, laser, odometry and computer vision have been proposed to assist ground vehicles to perceive the environment and plan a path to the desired destination [1]. For autonomous vehicles navigation, the position and orientation of the vehicle is one of the key components.

The most commonly used pose estimation method is Global Navigation Satellite System (GNSS) aided by Inertial Measurement Unit (IMU) [11]. GNSS uses the version of triangulation to locate the vehicles, through calculations involving the information from a number of satellites. GNSS benefits from cheap on-board hardware, being computationally light, and requiring a relatively short time to set up. However, GNSS is only operational outdoors and its accuracy deterio-rates considerably if the receiver loses the signals from the satellites because of the occlusion of line-of-sight to satellites.

(10)

Figure 1.1: Fully autonomous load carrier designed by Volvo CE, from[2].

However, in the open-pit mining environment, it is common for GNSS anten-nae to lose the signals from the satellites. Hence, there is an apparent demand for navigation solutions that are less sensitive to the layout of the surrounding environment and allow for indoor functionality.

1.1 Motivation

This thesis focuses on the localization and odometry of autonomous ground ve-hicles, especially, autonomous construction vehicles. Instead of on-board vision navigation systems, the use of off-board cameras lightens the payload of the vehicles and they are not easily obscured by mud and dirt since they can be mounted at fixed high point surrounding the work site. The fact that the work site of the autonomous construction vehicles is confined and restricted to a rel-atively small area which further adds the benefit of using off-board cameras to locate and navigate the machine.

(11)

1.2 Research Question

The objective of the thesis is to investigate a robust real-time localization sys-tem for autonomous construction vehicles using off-board monocular network camera. The visual odometry based localization system should be implemented on both a scaled model hauler prototype and a real hauler prototype. The ac-curacy of the proposed localization system should also be compared with the proposed ground truth and the GNSS based localization system.

1.3 Organization

(12)

Chapter 2

Background

This chapter describes related work regarding several visual odometry meth-ods. Additionally, this chapter provides basic background information about quaternion, camera calibration, rigid transformation.

2.1 Related Work

Visual odometry is the process of estimating the position and orientation of a robot by analyzing sequential images from the camera. It applies to the field of robotics and it is also used for localization of autonomous driving vehicles. Visual odometry always follows the following stages: object detection, object tracking with bundle adjustment to optimize motion estimation. The general pipeline for visual odometry is shown in Figure 2.1.

Figure 2.1: Pipeline of visual odometry.

2.1.1 Object detection and tracking

To be able to detect an object from the video or image sequence, the first step is to choose features of the object that make the object distinguishable. Color, edges, corners and blobs are commonly used as features when doing object detection.

(13)

in image processing. Usually, we use HSV to segment image based on color. However, one problem of using color as feature for object detection is that the apparent color of an object is directly influenced by the illumination conditions. Edge is the border of an object and the background. In practice, edges usually have a strong gradient magnitude and they are less sensitive to change in illumination than, for instance, color. Two representative detectors using edges are Canny edge detector and Sobel operator[16].

Corner is commonly used feature characterized by the intersection of two edges that represents a variation in the gradient of the image. In general, corner detection obtains more stable feature[8]. Harris Detector and Features from Accelerated Segment Test (FAST) are two typically algorithms in corner detection. Harris Detector extracts features by looking for points with stability, repeatability and low self-similarity and FAST is suitable for real-time video processing application because of its computational efficiency.

Blobs provide a complementary description of image structures in terms of regions, as opposed to corners that are more point-like. Blob detectors can detect areas in an image which are too smooth to be detected by a corner detec-tor. And there are also some other feature-based object detection algorithms. For example, Scale-invariant feature transform (SIFT)[9], proposed by Lowe, can detect and describe the scale-invariant and rotation invariant region-based features in the image and is widely used in robotic mapping and navigation.

After detecting the object in the initial image frame, we need to track the object through the entire video. The common object tracking algorithm is con-tour based tracking[10], the process of tracking object using the border between the tracked object and the background. It uses the contour from the previous frame as an initial contour in the current frame, to evolve a new contour for the current object position. The method uses edge-based features which are insensitive to illumination changes, which makes it robust.

2.1.2 Motion estimation

Once the object is detected and tracked in the two-dimensional image plane, the next task is estimating three-dimensional structures from two-dimensional image sequences, which is also called three-dimensional reconstruction.

We have the detected object with coordinate (x, y) in the image frame, also the camera intrinsic parameters from camera calibration and we want to find the coordinate (X, Y, Z) in the camera frame. However, it is difficult to compute the (X, Y, Z) since the scale factor, denoted by c, in Equation(2.14) is unknown. This is also the drawback of using monocular camera to do the pose estimation because monocular can not provide the depth information, which is the distance from the object to the camera projection center, denoted by Z and it is equal to the scale factor c.

(14)

2.1.3 Bundle adjustment

Bundle adjustment (BA) is always used at the last step of every feature-based three-dimensional reconstruction algorithm. Consider the problem: If three dimensional points Xj, j = 1, 2, ..., n are projected to two-dimensional images

with xj = P Xj, what is the optimal projection matrix P so that the summed

squared re-projection error is minimal? Thus, we need to solve minP

n

X

j=1

d(P Xj, xj)2, j = 1, 2, ..., n (2.1)

where d(x,y) is the distance between image points x and y. BA provides the solution to the aforementioned problem. It is a nonlinear minimization problem which can be solved using iterative non-linear least square methods such as Levenberg-Marquardt [12].

2.2 Quaternion

A quaternion is a four-element vector that can used to encode any rotation in a three dimensional coordinate system. There are several conventions for quaternions with their own set of rules and formulas. The Hamilton convention is used in this thesis, as explained by Kuipers [7].

A quaternion is commonly written as

q = (w, x, y, z) (2.2) which follows the form

q = w + v = w + xi + yj + zk (2.3) in which, w is called the scalar part of the quaternion while v is called the vector part of the quaternion. i, j, k forms the three bases of the imaginary parts and i2= j2= k2= ijk = −1. The w, x, y, z are the components of the quaternion. Properties

• Complex Conjugate

The complex conjugate of the quaternion

q = w + v = w + xi + yj + zk (2.4) is denoted

q∗= w − v = w − xi − yj − zk (2.5) • Quaternion Norm

The norm of a quaternion q is denoted by the scalar N (q) where N (q) =√q∗_{q =}p

w2_{+ x}2_{+ y}2_{+ z}2 _(2.6)

• Unit Quaternion

A quaternion q is a unit quaternion if the norm of q is equal to one, that is

(15)

Quaternion to rotation matrix

A vector in a three-dimension space can be written as a quaternion with no real part: q = 0 + xi + yj + zk. A rotation is expressed by a quaternion qR

with the additional requirement that its norm N (qR) be equal to one. The

ro-tation from coordinate frame A to another coordinate frame B is given by the conjugation operation: qB = qRB AqAqRBA ∗ _(2.8) q_RB AqAqRBA ∗_{= (w + xi + yj + zk)(x} Ai + yAj + zAk)(w + xi + yj + zk)

= (xA(w2+ x2− y2− z2) + 2yA(xy − wz) + 2zA(wy + xz))i+

(2xA(wz + xy) + yA(w2− x2+ y2− z2) + 2zA(yz − wx))j+

(2xA(xz − wy) + 2yA(yz + wx) + zA(w2− x2− y2+ z2))k

where qA= 0 + xAi + yAj + zAk and qRB

A = w + xi + yj + zk.

To convert the quaternion to rotation matrix, qRB AqAqRBA

∗ _{can be expressed}

as a matrix fomula, that is

R =   w2_{+ x}2_{− y}2_{− z}2 _{2(xy − wz)} _{2(xz + wy)} 2(xy + wz) w2_{− x}2_{+ y}2_{− z}2 _{2(yz − wx)} 2(xz − wy) 2(yz + wx) w2_{− x}2_{− y}2_{+ z}2   (2.9) Therefore, a vector p = [x, y, z] rotates from frame B to frame A can be expressed as pA= RpB.

2.3 Camera calibration (perspective projection)

Camera calibration is the process of estimating the intrinsic and the extrinsic parameters. Intrinsic parameters contains camera’s internal characteristics, such as focal length, distortion and principal point. Extrinsic parameters represent the position and orientation of the camera in the world. To get the intrinsic and extrinsic parameters of camera, we need to look into the camera projection model, which is the model of perspective projection.

A mapping from a three dimensional world onto a two dimensional plane is called perspective projection. A pinhole camera is the simplest image device which captures the geometry of perspective projection.

Figure 2.2 illustrates the perspective projection of a pinhole camera involved in calibration. A three dimensional coordinates system whose origin is at the center of the projection and whose z axis is along the optical axis is the camera standard coordinate. The image plane is modelled in front of the projection center, which is ideally parallel with the XY plane of the camera coordinates. A point M with camera coordinate (X, Y, Z) is imaged at point m = (x, y) in the image plane. The relationship between the two coordinates system is

x = fX

Z, y = f Y

Z (2.10)

(16)

Figure 2.2: Illustration of perspective projection of a pinhole camera involved in calibration.

It can also be written in the homogeneous coordinates as

c   x y 1  =   f 0 0 0 0 f 0 0 0 0 1 0       X Y Z 1     (2.11)

where the scale factor is c is arbitrary 6= 0.

The actual pixel coordinates (u, v) are defined with the origin on the left top corner of the image plane and satisfy

u = uc+ x pixel width (2.12) v = vc+ y pixel height (2.13) where (uc, vc) is the principal point and it is often close to the image center.

The transformation from the three dimensional camera coordinates to the image pixel coordinates can be expressed by a 3 × 4 matrix as follows:

c   u v 1  =   fu 0 uc 0 0 fv vc 0 0 0 1 0       X Y Z 1     (2.14)

where fu=_{pixel width}f , fv =_{pixel height}f and the scale factor c = Z.

The 3 × 4 matrix is called the perspective projection matrix P

(17)

Due to imperfect placement of camera chip relative to the lens system, there is always a small rotation and shift of the center position. In other words, the image plane is not perfectly parallel with the XY plane of the camera coordinates. A more general projection matrix P0can be introduced to allow the image coordinates with a offset origin, non-square pixel and skewed coordinate axes, which is denoted by

P0=   fu γ uc 0 0 fv vc 0 0 0 1 0   (2.16) Skew γ, focal length (fu, fv) and principal point (uc, vc) are the five intrinsic

parameters of the camera. Finding the matrix P0 is the main part of camera calibration.

2.4 Rigid transformation

To do the transformation between different three dimensional coordinate frames, three methods are commonly used to represent any rotation, which are Rotation matrix, Euler and Quaternion. In this section, rigid transformation (rotation and translation matrix) is used to express the relationship between world coor-dinate frame and the camera coorcoor-dinate frame, which is shown in Figure 2.3.

Figure 2.3: Illustration of camera coordinate frame with world coordinate frame. A point M with world coordinate (Xw, Yw, Zw) is transformed to camera

(18)

It can also be written as:     Xc Yc Zc 1     =R T 0 1     Xw Yw Zw 1     (2.18)

The 4 × 4 matrix is called the extrinsic matrix E E =R T

0 1

(19)

Chapter 3

Pose estimation method

with AprilTag

Fiducials are artificial visual features designed for autonomous detection and it is useful for pose estimation or object tracking in robotics application. Fiducials mounted on the objects can be used to identify and localize the objects. April-Tags shown in Figure 3.1 is a visual fiducial system developed by The APRIL Robotics Laboratory at the University of Michigan[13],[15].

Figure 3.1: AprilTags used in this thesis.

(20)

to estimate autonomous construction vehicles in this thesis. The system will be introduced in the following subsections based on [13],[15].

3.1 AprilTag Marker Detection

Figure 3.2: The pipeline of AprilTag detection.

The process of AprilTag Marker detection is composed of two main parts: the tag detector and the code system. The tag detector is used to find the AprilTag and estimate the camera-relative pose of the tag. The code system is to extract the information included in the tag. The pipeline of the AprilTag detection is shown in Figure 3.2. The detection process can be divided into several distinct phases, which are described in Section 3.1.1.

3.1.1 Detect the AprilTag

Figure 3.3(a) is the original input image. The first step is to threshold the grayscale input image into a black-and-white image.As shown in Figure 3.3(b), an adaptive thresholding method [3] is applied to find the minimum and max-imum values in a region around each pixel. The light and dark pixels which form the tag are differentiated and the gray color part in Figure 3.3(b) which represents the regions of image with insufficient contrast are excluded.

Finding the edges, shown in Figure3.3(c), that might form the boundary of a tag is the next step after getting the binarized image. The union-find algorithm [4] is used to segment the edges based on the identities of the black and white components from which they arise and it efficiently clusters pixels which border the same black and white region.

Finally, quads are fit to each cluster of border pixels, shown in Figure 3.3(d), poor quad fits and uncodable tags are discarded, and the valid tag detections are output in Figure3.3(e).

3.1.2 Homography and extrinsic estimation

The 3 × 3 homography matrix H that projects two-dimensional points in ho-mogeneous coordinates from the tag’s coordinate system (in which [0 0 1]T _is

at the center of the tag and the tag extends one unit in the ˆx and ˆy directions) to the two-dimensional image coordinate system is computed using the Direct Linear Transform (DLT) algorithm [6].

(21)

(a) (b) (c)

(d) (e)

Figure 3.3: The process of AprilTag Detector, from[15].

calibration) and the 4 × 3 truncated extrinsic matrix E. In general, extrin-sic matrix are 4 × 4 matrix, but every position on the tag has z = 0 in the tag’s coordinate system. Thus, we can rewrite every tag coordinate as a two-dimensional homogeneous point with z = 0, and remove the third column of the extrinsic matrix, forming the truncated matrix. The rotation component is denoted as Rij, the translation component is denoted as Tk and the element of

the homography matrix is denoted as hij. The homography matrix satisfies

H = cP E   h00 h01 h02 h10 h11 h12 h20 h21 h22  = c   fu 0 uc 0 0 fv vc 0 0 0 1 0       R00 R01 Tx R10 R11 Ty R20 R21 Tz 0 0 1     (3.1)

where c is the unknown scale factor.

By expanding the right hand side of Equation 3.1, it is easily to solve the elements of Rijand Tkexcept for the unknown scale factor c. Since the columns

of a rotation matrix must all be of unit magnitude, we can constrain the mag-nitude of c. The sign of c can also be derived by the truth that tags appear in front of the camera, that is Tz < 0. We have two columns of the rotation

matrix, c can be calculated as the geometric average of their magnitudes. And the third column can be recovered by computing the cross product of two known columns, because the columns of a rotation matrix must be orthonormal.

3.2 Coordinate frame transformation

(22)

world coordinate frame. Thus, we need to find the relationship between the camera coordinate frame and the world coordinate frame.

For point P ∈ R3_{, vectors P}

A= [XA, YA, ZA]T and PB= [XB, YB, ZB]T are

the coordinates of P in two different coordinate system {A} and {B} respec-tively. In Euclidean geometry, they satisfy

PB = RBAPA+ TAB (3.2)

where RB_A ∈ R3×3 _{is the rotation matrix, and T}B A ∈ R

3×1 _{is the translation}

matrix.

To find the solution of RB_A and T_AB, a minimum least squares problem can be created as err = N X i=1 ||RBAP i A+ T B A − P i B|| (3.3)

To minimize the err in Equation 3.3, Singular Value Decomposition (SVD) algorithm is used [14]. The algorithm steps are described as follows:

(1) Randomly select more than five points and get their coordinates in frame {A} and frame {B}, denoted by Pi

Aand PBi respectively. Assuming that a total

of N points are selected with i = 1, 2, ..., N ; (2) Find the centroids of the selected points:

centroidA= 1 N N X i=1 P_Ai centroidB= 1 N N X i=1 P_Bi

(3) Using SVD algorithm to find the optimal rotation matrix; H = (PA− centroidA)(PB− centroidB)T

[U, S, V ] = SV D(H) RBA= V UT

H is the covariance matrix.

(4) Find the optimal translation matrix;

T_AB = −R_AB× centroidA+ centroidB

(5) Calculate the root mean square error with the derived rotation matrix RB

(23)

Chapter 4

Experiments

The navigation system using monocular camera and AprilTags is tested both in lab with vehicle model and outdoor at test track with real vehicle.

4.1 Indoor lab experiments

4.1.1 Set up

Figure 4.1: Experimental scene for indoor test.

(24)

the vehicle model, while tag with id1 is attached on the front side and tag with id3 is attached on the rear side. All experiments are run on Robot Operating System (ROS) Kinetic with Ubuntu 16.04.

4.1.2 Coordinate frames

Figure 4.2: Different coordinate frames in the system.

The system is composed of four major coordinate frames (see Figure 4.2). It is vitally important to understand these frames and the transformations between them.

• World Frame

The world frame is a static reference frame with respect to which the vehicle motion is estimated. The world frame is defined as +X forward, +Y right, +Z up.

• Camera Frame

The camera frame is a static reference frame with respect to which the tag’s motion is estimated. The camera frame is defined as +X right, +Y down, +Z forward.

• Tag Frame

The tag frame moves along with the vehicle. The original point is at the center of the tag. The tag frame is defined as +X right, +Y up, +Z forward.

• Vehicle Frame

The vehicle frame is translated from the tag frame. Thus, the vehicle frame has the same direction as the tag frame with +X right, +Y up, +Z forward.

The transformations between the world frame and the camera frame, denoted as Rw

(25)

Marker detection system estimates the camera-relative pose of the tag, that is the transformations between the camera coordinate and the tag coordinate, denoted as Rc

t, Ttc. Given the geometric parameters of the vehicle, the

transfor-mations between the tag frame and the vehicle frame, denoted as Rt

v, Tvtcan also

be computed. Hence, we can get the vehicle’s position in the world coordinate and compare it with the position provided by ground truth.

4.1.3 Static test

(a) (b)

(c) (d)

(e) (f)

Figure 4.3: Camera scene and rviz view for indoor static test.

The indoor test consists of two parts: the static test and the dynamical test. The static test begins with randomly putting the HX02 model at six different positions, recording their ground truth position and the position provided by the AprilTag Marker detection system. Then, with the recorded data, Rw

c, Tcw

(26)

by the proposed system to validate the reliability.

As can be seen from Figure 4.3, the AprilTags attached on the HX02 model are detected. The border of the tag is presented and the id of the detected tag is also shown. The camera coordinate frame and the detected tag coordinate frame are displayed in the rviz with both X-axis in red, Y-axis in green and Z-axis in blue.

4.1.4 Dynamical test

Two scenes are used for dynamical test. The fist scenario is depicted in Figure 5.2. The HX02 model moves manually towards the camera. The AprilTag attached on the front is detected. The border of the tag is presented and the id of the detected tag is also displayed. The camera coordinate frame and the detected tag coordinate frame are displayed in the rviz with both X-axis in red, Y-axis in green and Z-axis in blue. The blue point in Figure 5.2(a) is the start point and the blue line in Figure 5.2(b) shows the detected path of the HX02 model in the first scenario.

(a)

(b)

(27)

and the detected tag coordinate frame are displayed in rviz with both X-axis in red, Y-axis in green and Z-axis in blue. The blue point in Figure 5.3(a) is the start point and the blue line in Figure 5.3(b) shows the detected path of the HX02 model in the second scenario.

(a)

(b)

Figure 4.5: Camera scene and rviz view for indoor dynamical test (scenario 2).

4.2 Outdoor test-track experiments

4.2.1 Set up

(28)

Figure 4.6: Experimental scene for outdoor test

4.2.2 Coordinate frames

The coordinate frames involved at outdoor test-track are the same as that at indoor lab, depicted in Figure 4.2. The description of the frames are also demon-strated at Section 4.1.2.

4.2.3 Static test

The outdoor static test begins with finding the optimal translation and rotation matrix between the camera frame and the world frame. Using the remote control handle, the HX02 is stopped at ten different points at the test track. According to the method mentioned in Section 3.2, the data from GPS and the data from the AprilTag Marker detection system are recorded to compute Rwc, Tcw. Rwc, Tcw

are unchanged unless the camera is moved.

(29)

(a) (b)

(c) (d)

(e) (f)

(g) (h)

(i) (j)

Figure 4.8: Camera scene and rviz view for outdoor static test.

(30)

record their GPS data and the position provided by the proposed system to validate the reliability.

As can be seen from Figure 4.8, the AprilTags attached on the frame of HX02 are detected. The border of the tag is presented and the id of the detected tag is also shown. The camera coordinate frame and the detected coordinate frame are displayed in the rviz with both X-axis in red, Y-axis in green and Z-axis in blue.

4.2.4 Dynamical test

Two scenes are used for dynamical test. The first scenario (see Figure 4.9) is conducted in cloudy day. The HX02 is controlled by the remote control handle manually, which moves forward in front of the camera with the left side facing the camera. The AprilTag attached on the left side is detected. The border of the tag is presented and the id of the detected tag is also displayed. The camera coordinate frame and the detected coordinate frame are displayed in rviz with both X-axis in red, Y-axis in green and Z-axis in blue. The blue line in Figure 4.9 demonstrates the detected path of the HX02, which is formed by the detected positions along the movement.

(31)

(a) Cloudy

(b) Rainy

(c) Sunny

(32)

Chapter 5

Experiments Results

Here, we used the method mentioned in Chapter 3 to process the image and estimate the position of the HX02. The result is analyzed both for indoor lab test and outdoor test.

5.1 Indoor lab result

5.1.1 Static test

Figure 5.1 shows the estimated position and the ground truth of HX02 model. The estimated position is labeled by ’·’. The ground truth is given by manual measurement, labeled by ’?’.The colorbar is used to present the z coordinates. Point1-6 corresponds to the test in Figure 4.3(a)-(f) respectively.

The mean square error(MSE) and mean absolute error (MAE) for localiza-tion results is defined as

Where (Xei, Yei, Zei) is the estimated position, (Xgi, Ygi, Zgi) is the ground

truth, n is the number of the total static points. By calculation, HX02 model position estimation has a MAE in X-axis of 3.7698×10−04m, in Y-axis of 0.0029 m, in Z-axiz of 1.2483 × 10−04 m and total distance MSE of 0.0052 m for the indoor static test.

(33)

Figure 5.1: The estimated position and the ground truth of HX02 model.

5.1.2 Dynamical test

Figure 5.2 shows the estimated trajectory and the ground truth of HX02 model in scenario 1 (mentioned in Section 4.1.4). The estimated position of the HX02 model is expressed by ’◦’. In the first scenario, 46 detected positions form the estimated trajectory, which is denoted by green solid line. However, most of the 46 detected positions are at the start point or the end point, about 10 points are connected to display the path of the HX02 movement. Since the HX02 model is manually manipulated, only the start point (-3cm,23cm) and the end point(-3cm,5cm) of the ground truth are provided, which are displayed by ’x’. The ground truth is also manually measurement. The ground truth path connects the start point and the end point by blue dashes.The colorbar is used to present the Z coordinates.

The green solid line is similar to the blue dash line. In general, it demon-strates that the HX02 is moving straightly from (-3cm,23cm) to (-3cm,5cm), which satisfies the situation that the HX02 model moves towards the camera. The estimated trajectory is accurate enough to show the motion of the HX02 model compared with the ground truth trajectory.

(34)

Figure 5.2: The estimated trajectory and the ground truth of HX02 model(scenario 1).

position of the HX02 model is expressed by ’◦’. In the second scenario, 31 detected positions form the estimated trajectory, which is denoted by the green solid line. However, most of the 31 detected positions are at the start point and the end point, about 8 points are connected to display the path of the HX02 movement. Only the start point (10cm,16cm) and the end point (-8cm,16cm) of the ground truth are provided because of the manually manipulated, which are denoted by ’·’. The ground truth is given by manually measurement. The ground truth path connects the start point and the end point by blue dashes. Z coordinates are demonstrated by the colorbar.

(35)

Figure 5.3: The estimated trajectory and the ground truth of HX02 model(scenario 2).

5.2 Outdoor test-track result

5.2.1 Static test

The out door test-track static test is conducted in cloudy and rainy days. As can be seen from Figure 4.7, both tag1 and tag3 are detected in the same image with an appropriate camera-relative pose. The rectangle in green is the shape model of the HX02 frame where the tags are attached. Tag1’s detected position is in the middle of the short edge of the rectangle, tag3’s detected position is in the middle of the long edge of the rectangle. This satisfies the fact that tag 1 and tag3 are hanging in the middle of the front frame and the left side frame respectively.

Figure 5.4 shows the estimated position and the ground truth of HX02. The estimated position is labeled by ’·’. The ground truth is given by GPS, labeled by ’?’.The GPS position of the camera is labeled by ’◦’. The colorbar is used to present the z coordinates.

(36)

Figure 5.4: The estimated position and the ground truth of HX02.

camera and the nearest point (Point7) is 9.6889m to the camera.

The result demonstrated that the proposed localization system based on AprilTags and monocular camera can capture the motion of the real autonomous construction vehicles and locate the vehicle with a satisfied accuracy in XY-plane. However, the accuracy for the Z coordinates is still not good enough to navigate the vehicle. Moreover, if the vehicle is far from the camera, the estimation result is inaccurate and the system might be unable to locate the vehicle since the AprilTags can not be detected.

5.2.2 Dynamical test

(37)

Figure 5.5: The trajectory of HX02 in outdoor dynamical test (scenario 1).

below the GPS trajectory with the maximum error about 0.8m.

In general, when the HX02 is controlled by the remote control handle, the localization system using AprilTags and off-board monocular camera can pro-vide the estimated positions that present the motion tendency but the accuracy is unable to navigate the HX02.

(38)

The trajectory and the estimated position of HX02 in outdoor scenario 2 is shown in Figure 5.6. The tests are conducted in different weather: cloudy, windy and sunny. The estimated positions under the sunny test are depicted by purple circle, the estimated positions under windy test are depicted by yellow stars and the estimated positions under windy test are displayed by red dots. The blue solid line is the ground truth trajectory given by the GPS data with publishing frequency of 10HZ. Camera position is also given by the green filled circle.

During the path following task, the proposed localization system first de-tected only Tag1 when the HX02 was moving towards the camera. When the HX02 was going to turn right, the system detected both Tag1 and Tag3. How-ever, when the HX02 finished turning right, the system could not detect any AprilTags because no tags are shown in the view of the camera. When the HX02 was going to turn right, as can be seen from Figure 5.6, fewer positions are estimated. The reason for this is that the proposed localization system has longer computation time when it detects more than one tag in the same image frame. No matter what weather it is when the test are conducted, the estimated positions are relatively farther from the ground truth trajectory when HX02 is far from the camera, while the estimated positions are nearer to the ground truth trajectory when HX02 is near the camera, for example when it is turning right.

Number of estimated positions Maximum error of distance Cloudy 20 ≈ 1.21m

Rainy 19 ≈ 1.46m

Sunny 32 ≈ 0.95m

Table 5.1: Result comparison under different weather in outdoor dynamical test (scenario 2)

From Table 5.1. In sunny days, proposed localization system could detect more points than that in cloudy or rainy days. This is also true for the maximum error of distance. The result in sunny days is the most accurate and the result in rainy days is the worst. This might because fog appears on lens more frequently in cloudy and rainy days and even the rainwater on the camera lens might block the line of sight.

(39)

Chapter 6

Conclusions and Future

Work

6.1 Conclusion

In this thesis, a localization system is proposed based on off-board monocular camera and the vision fiducial marker AprilTags. The system is developed for autonomous construction equipment in open-pit mining environment where GNSS might not be accessible. The system is tested both indoor in lab with vehicle model and outdoor at test track with real vehicle. The experiments are both divided into static tests and dynamical tests.

Through the experiments, the localization system is proved to be able to accurately locate the vehicle during the static test both indoor and outdoor. In indoor and outdoor dynamical tests, even though the estimated positions could demonstrate the motion tendency of the vehicle, the number of the detected estimated positions is not enough to navigate the vehicle because to navigate the vehicle, positions should be provided in a short time interval.

The limitation for this system is that the accuracy for localization of the vehicle depends on the size of the AprilTags used on the vehicle and the reso-lution of the off-board monocular camera. The system also suffers from loss of tracking the vehicle. The visual odometry performs well when the frame rate of the camera is high. However, with high camera frame rate, the system requires high computational speed which makes it unpractical. If the vehicle moves sud-denly with a high speed, the camera might not detect the AprilTags attached on the vehicles because of the limited computation speed. This causes that during the outdoor dynamical test, there are fewer positions estimated through the movement of the vehicle.

(40)

6.2 Future work

The future work is mainly focus on how to improve the performance of the localization system of autonomous construction vehicles from both software and hardware aspects.

The AprilTags should be fixed rigidly to the frame of the autonomous con-struction vehicles. Otherwise, the vibration of the autonomous concon-struction vehicles can distort the visual odometry, especially for the orientation of the AprilTags. The size of the AprilTags should be as large as possible so that the camera could detect the vehicles from a farther distance than it is now.

There is a trade-off between the frame rate of the camera and the computa-tion time of the AprilTags marker deteccomputa-tion systems. If the image acquisicomputa-tion frequency is high, the computation time of the system is also large. Therefore, the algorithm of the system could be optimized to have a faster computational speed.

(41)

Bibliography

[1] How autonomous vehicles perceive and navigate their surroundings, 2018. [Online]; Available at: https://velodynelidar.com/newsroom/

title-how-autonomous-vehicles-perceive-and-navigate-their-surroundings/. [2] Volvo concept lab, the hx02 prototype, 2019. [Online];

Avail-able at: https://www.volvoce.com/global/en/this-is-volvo-ce/ what-we-believe-in/innovation/.

[3] C.K Chow and T Kaneko. Automatic boundary detection of the left ven-tricle from cineangiograms. Computers and Biomedical Research, 5(4):388– 410, 1972.

[4] Thomas H. Cormen. Introduction to algorithms. Third edition. edition, 2009.

[5] M. W. M. G. Dissanayake, P. Newman, S. Clark, H. F. Durrant-Whyte, and M. Csorba. A solution to the simultaneous localization and map building (slam) problem. IEEE Transactions on Robotics and Automation, 17(3):229–241, 2001.

[6] Richard Hartley. Multiple view geometry in computer vision. Cambridge University Press, Cambridge, UK ; New York, second edition. edition, 2003. [7] Jack B Kuipers et al. Quaternions and rotation sequences, volume 66.

Princeton university press Princeton, 1999.

[8] S. Li. A review of feature detection and match algorithms for localization and mapping. volume 231. Institute of Physics Publishing, 2017.

[9] David Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.

[10] M. Youssef M. Zaki. Evaluation of active contour-based tracking methods. Int. J. of Signal and Imaging Systems Engineering, 2(4), 2009.

[11] J. Meyer-Hilberg and T. Jacob. High accuracy navigation and landing system using gps/imu system integration. In Proceedings of 1994 IEEE Position, Location and Navigation Symposium - PLANS’94, pages 298– 305, 1994.

(42)

[13] E. Olson. Apriltag: A robust and flexible visual fiducial system. In 2011 IEEE International Conference on Robotics and Automation, pages 3400– 3407, May 2011.

[14] M. Peris, S. Martull, A. Maki, Y. Ohkawa, and K. Fukui. Towards a simulation driven stereo vision system. In Proceedings of the 21st Interna-tional Conference on Pattern Recognition (ICPR2012), pages 1038–1042, Nov 2012.

[15] John Wang and Edwin Olson. AprilTag 2: Efficient and robust fiducial de-tection. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4193–4198. IEEE, oct 2016.

[16] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey. 38(4):13–es, 2006.

(43)

Localization for autonomous construction vehicles using monocular camera and AprilTag

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Localization for autonomous

construction vehicles using

monocular camera and AprilTag

Localization for autonomous construction

vehicles using monocular camera and

AprilTag

XIAODI YU

Abstract

Sammanfattning

Acknowledgment

Contents

Chapter 1

Introduction

1.1

Motivation

1.2

Research Question

1.3

Organization

Chapter 2

Background

2.1

Related Work

2.1.1

Object detection and tracking

2.1.2

Motion estimation

2.1.3

Bundle adjustment

2.2

Quaternion

2.3

Camera calibration (perspective projection)

2.4

Rigid transformation

Chapter 3

Pose estimation method

with AprilTag

3.1

AprilTag Marker Detection

3.1.1

Detect the AprilTag

3.1.2

Homography and extrinsic estimation

3.2

Coordinate frame transformation

Chapter 4

Experiments

4.1

Indoor lab experiments

4.1.1

Set up

4.1.2

Coordinate frames

4.1.3

Static test

4.1.4

Dynamical test

4.2

Outdoor test-track experiments

4.2.1

Set up

4.2.2

Coordinate frames

4.2.3

Static test

4.2.4

Dynamical test

Chapter 5

Experiments Results

5.1

Indoor lab result

5.1.1