Extended target tracking using Gaussian processes on stick-pixel defined objects

(1)

School of Innovation Design and Engineering

V¨

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics

30.0 credits

EXTENDED TARGET TRACKING

USING GAUSSIAN PROCESSES ON

STICK-PIXEL DEFINED OBJECTS.

Anders Olsson

aon11013@student.mdh.se

Examiner: Martin Ekstr¨

om

M¨

alardalen University, V¨

aster˚

as, Sweden

Supervisor: Fredrik Ekstrand

M¨

alardalen University, V¨

aster˚

as, Sweden

Company supervisor: Patrik Leissner,

Autoliv Sverige AB, Link¨

oping

(2)

Abstract

In this work, I present the performance of a extended target tracking algorithm that utilizes Gaussian processes. The extended target tracking algorithm is evaluated on objects corresponding to road users, with automotive use in mind. The measurements that defines the object to be tracked are derived from stereo image sensor disparity and is called Stick-pixels or Stixels. The process of generating these measurements are also presented in this work.It consists of two separate methods, one relying on stereo image frames and one purely defined by object characteristics and pose. The extended target tracking algorithm has been tested on three types of simulated road users, car, cyclist and pedestrian. To evaluate the performance of the target tracking algorithm three measures are used, error in position, orientation discrepancy over time and intersection over union.

Acronyms 3 Glossary 4 1 Introduction 5 1.1 Purpose . . . 6 1.2 Objective . . . 6 2 Problem description 7 2.1 Overview . . . 7 2.2 Data Description . . . 7 2.3 Problem Formulation . . . 9 2.3.1 Limitations . . . 9 2.3.2 Point in question . . . 9 3 Methodology 10 3.1 Evaluation . . . 10 3.1.1 Question 1 . . . 10 3.1.2 Question 2 . . . 10 3.1.3 Question 3 . . . 10

4 Measurement generation theory 11 4.1 Simulated measurements . . . 11

4.1.1 Stereo image matching . . . 11

4.1.2 Occupancy grid . . . 11

4.1.3 Free space computation . . . 13

4.1.4 Height segmentation . . . 14 4.1.5 Stick-pixel extraction . . . 15 4.2 Fabricated measurements . . . 15 4.2.1 Prerequisites . . . 15 4.2.2 Fabrication process . . . 15 5 Tracking theory 17 5.1 Tracking introduction . . . 17 5.2 Filtering theory . . . 17 5.2.1 Bayesian reasoning . . . 17

5.2.2 The Kalman filter . . . 18

5.2.3 Non-linear Kalman filter . . . 19

5.2.4 Extended Kalman filter . . . 19

5.3 Clustering . . . 20

5.4 Gaussian process . . . 20

(3)

5.4.2 Recursive Gaussian process regression . . . 21

6 Extended target tracking 22 6.1 Introduction to extended target tracking . . . 22

6.2 Existing methods . . . 22

6.3 Proposed method . . . 22

6.3.1 Gaussian process . . . 23

7 System modeling 25 7.1 Reference model . . . 25

7.2 Target state vector . . . 25

7.3 Measurement model . . . 25

7.4 Augmented state space model . . . 27

7.4.1 Measurement equation . . . 27

7.4.2 Motion equation . . . 28

7.4.3 Model inference . . . 28

7.5 Extended Kalman filter recursions . . . 29

7.5.1 Time update . . . 29 7.5.2 Measurement update . . . 29 7.5.3 Measurement gradient . . . 30 8 Implementation 31 8.1 Data generation . . . 31 8.1.1 Simulation . . . 31 8.1.2 Fabrication . . . 33

8.2 Extended target tracker . . . 35

8.2.1 Initialization . . . 35

8.2.2 Clustering and gating . . . 36

8.2.3 Time update/Prediction . . . 36 8.2.4 Measurement update . . . 36 9 Result 37 9.1 Car case . . . 37 9.1.1 Simulated measurements . . . 37 9.1.2 Fabricated measurements . . . 38

9.1.3 Extended target tracking . . . 41

9.2 Cyclist case . . . 46

9.2.1 Simulated measurements . . . 46

9.2.2 Fabricated measurements . . . 48

9.3 Pedestrian case . . . 56

9.3.1 Simulated measurements . . . 56

9.3.2 Fabricated measurements . . . 57

10 Discussion and conclusion 65 10.1 Discussion . . . 65

10.2 Conclusion . . . 66

10.3 Future work . . . 67

References 71

Appendix A Car figures 72

Appendix B Cyclist figures 79

(4)

Acronyms

ASF active safety function. EKF extended Kalman filter. ETT extended target tracking. GP Gaussian process.

GPS global positioning system. IoU intersection over union. KF Kalman filter.

PDF probability density function. Stixel stick pixel.

(5)

Glossary

Causality principle The causality principle, defines that between two events, one of them pro-duces the other.

ego car is defined as the car equipped with the camera setup.

field of view , also known as angle of view, describes the angular extension of some scene being captured by some camera.

Gaussian A Gaussian function,or simple a Gaussian, describes a symmetric shape, known as a bell shape and is defined as:f (x) = a exp −(x−b)_2c22 given a, b and c ∈ R.

ground truth describes the information, assumed to be the true, the real value.

Kronecker product The Kronecker product is denoted as ⊗ and gives the tensor product of two matrices.

LIDAR Light detection and ranging or light imaging, detection and ranging.

Markov property The term Markov property, defines the notion of a memory less property of a stochastic process.

Moore’s law Moore’s law describes the phenomena of how the number of transistors that one can fit inside a chip, increases exponentially.

pose The pose of an object, is in this work defined as the objects Cartesian position in R2_{and its}

orientation.

RADAR radio detection and ranging or radio direction and ranging.

unscented transformation The unscented transform, is a function used to avoid the local lin-earization problem, often encountered i.e., Kalman filters [1]..

V-REP The robot simulator V-REP, with integrated development environment, is based on a distributed control architecture.

(6)

1 Introduction

Automobiles today, come with a vast amount of safety features to assist the driver. They are intended to be an assisting utility for the driver and are called active safety function (ASF). The ASF are segmented in five categories(levels).

• 0: No-Automation

• 1: Function-specific automation • 2: Combined function automation • 3: Limited self-driving automation • 4: Full self-driving automation

Level 0 features no engagement of the vehicle, the driver has sole control of the brake, steering, throttle and motive controls at all time. At level 1 the features consist of a set of specific controls working independently, e.g. stability control or brake-distribution-control. Level 2 delivers more advanced interaction between at least two controls, for example adaptive cruise control with lane keep. In level 3, the driver has the ability to cede control of all safety-critical functions during some specific environments and doing so, relying heavily on the safety functions of the vehicle to monitor changes in the surroundings. One example of such a system is the Google car, shown in Figure 1.

Fully autonomous vehicles are reached in the fifth level, where the vehicle can complete full trips without intervention of the driver [2, 3]. As of today, the implementation of autonomous cars are limited to cars involved in different test projects, but the forecast suggest a steady increase in the number of autonomous cars [4]. This increase in autonomous cars is likely due the the large set of companies that are investing in this area. These companies consist of not only technology companies and companies connected to the automotive industry, but also a company like Uber is contributing to this area of research. Uber has indicated that their fleet of taxi’s will be autonomous by 2030 [5] and NuTonomy will start with driver-less taxis in Singapore by 2018 [6]. Most of the major car manufacturers has announced the delivery of autonomous cars in the near future according to [7–14], with years of delivery shown in Table 1. By delivering those types of ASF, especially the higher levels, the intention is to assist inattentive drivers and by that decrease the number of accidents due to driver-faults. Forecasts suggests that other benefits of self-driving cars could be reduction in congestion and increased efficiency of the work day for daily commuters as presented in [4].

For a ASF to make accurate decisions, there is a need for perception. The task for the perception, is to present the environment to the ASF and the higher the level of ASF, the more is required of the perception.

The perception must include detection of all kinds of dangers, including defects in the infrastructure

(7)

2019 Volkswagen

2020 Audi, General Motors, Toyota 2021 Ford, BMW, Tesla

Table 1: Shows estimated year for introduction of fully autonomous vehicles in some of the major auto-mobile manufacturers, according to [7–14].

e.g. road-work-sites, pot holes and missing lane marks,but also detect and track other road users, both automobiles and other, e.g. pedestrians and cyclists. A majority of target tracking literature focuses on point targets, that is targets defined by only one detection or measurement. Targets that is defined by multiple detections is referred to as an extended target, from this is defined the discipline of extended target tracking (ETT) [15].

To generate the required type of perception or description of the environment a diverse setup of sensors are used. Historically, the sensors used for this task are RADAR, LIDAR and camera setups. The diversity is needed due to the fact that different sensors come with different strengths and weaknesses. In general LIDAR is very sensitive to weather conditions, such as mud, rain and snow. The camera is good at classifying objects, but bad at estimating speed of objects [16]. While the LIDAR delivers better angular resolution than the RADAR, the RADAR delivers better robustness in bad weather situations and are by some considered an ”all-weather-sensor” [17]. The RADAR also performance well when measuring distance.

Given the discrepancy of the mentioned sensors, data with diverse origin can be fused, to get a higher level of accuracy. This is done with a process called sensor fusion.

According to [18], extensive resources has historically been invested in the market of hand-held devices. This has lead to decreased price for smaller image sensors significantly. The change in pricing leads to increased availability for these types of sensors and with that, a economical incentive to substitute reliable, but expensive hardware as LIDAR with cheaper hardware like image sensors.

The noise level of the image sensors requires computationally heavy algorithms for filtering, but according to Moore’s law, the hardware gets smaller, and with that cheaper, over time.

1.1 Purpose

The intended ETT is previously evaluated on LIDAR-based data and the purpose is to validate the performance on distance measurements derived from a stereo image sensor setup. This is interesting due to the LIDAR-systems physical and economical inconvenience, compared to image sensors. The goal is to track dynamic objects through stereo image sensor-based data in a robust way.

1.2 Objective

The objective for this thesis is to track dynamic objects located in front of the vehicle. The measurements are derived from a stereo image sensor setup. The focus of the work will be to apply the ETT on two types of measurement, simulated and fabricated, with the intention of a scalable method for use in a real time environment.

(8)

2 Problem description

This section provides a description of the problem studied in this thesis. Firstly a general overview of the whole problem is described, this is followed by a description of the expected data input to the system. Lastly, the problem of this thesis is formulated, together with its limitations.

2.1 Overview

In this work, a car is assumed to be equipped with a stereo camera setup. No real vehicle setup will be used, instead a combination of fabricated and simulated data are used. The process of generating simulated data is done in V-REP [19]. V-REP delivers the capability to render 3D environments, given some arbitrary designed scene. From this scene, one can extract simulated sensor output.

In this work, the sensors of interest are vision sensors, representing the image sensors and pose sensors, that indicates the pose and change in pose of the object of interest and the ego car. The indicated output from the object of interest is considered the real pose of the object, and denoted as its ground truth. The purpose was to create a set of pseudo real-world traffic situations in V-REP and test the ETT in those with the workflow described in Figure 2.

Even though as mentioned earlier, image sensors are most potent when used in classification approaches, I want to investigate the capabilities of image sensors in the realm of RADAR and LIDAR preferred applications [16]. The setup consists of two identical cameras with some lateral and vertical image sensor resolution. The cameras are also described with a focal length. The cameras are mounted horizontally with some distance apart. By analyzing the difference between points in the images produced by these cameras at the same time , one can estimate the depth for these points [20].

At each k a set of measurements describing the target are collected. These measurements are Cartesian coordinates in R2where the plane is the ground on where the car is situated.

zk,zTk,1, · · · , z T k,lm ,where, zk,i= xi yi (1) zk denotes the vector consisting of all measurements, of the object of interest, at time k and (xi, yi)

are the Cartesian coordinates for measurement i.

2.2 Data Description

The simulated data is in raw form, the shape of two matrices with size, number of pixles times three, representing the red, green and blue color channels from each camera as shown in Figure 3a. The first step in the process of refining the data for use with the tracker is to estimate the disparity of the two images. Calculating the disparity is done using one of the two well known algorithms, semi-global matching or block matching as described in Section 4.1.1. The resulting data is a matrix, with values inversely proportional to the distances, as seen in Figure 3b, to points observed in camera left image i.e., Figure 3a. Given these values and the physical characteristics of the camera setup, that is basline(b) and focal length(f ), one can now go from visual imagery to coordinates in R3_.

In [21], one way to simplify the complexity of a set of points in R3

, to a set of ”points” in R2

is proposed. This method also gives rise to better precision than single R3 _{point measurements.}

Due to the fact that all the objects to be tracked are supposed to be moving along the same plane as the car, the extra dimension is only a cost. The data output from this algorithm was named stick pixel (Stixel), due to the fact that the structure looks like vertical sticks. These sticks are considered a medium level representation of 3D-objects, less complex than full scale 3D-objects, but more complex than a single point in a point cloud.

(9)

Figure 2: Work-flow visualizing the overview of the course of action, to generate measurements and to test the ETT, given the generated measurements.

(a) Raw data from V-REP simulation, which corresponds to the two cam-era images captured by two vision sensors in V-REP

(b) The corresponding disparity image, describing the difference of pixels between the two images linear intensity mapping(gray-scaled) shown in Figure 3a, which is inversely proportional to the distances to the pixels. Figure 3: Images from a stereo camera setup with corresponding disparity image.

(10)

Figure 4: Displaying the Stixel measurements projected on the left image sensor, here in green. Note Stixel width to be equal to pixel width.

2.3 Problem Formulation

To provide the control algorithms of autonomous cars with reliable feedback, one must have a perception that is dynamic and robust. In this thesis I have studied the ability to track dynamic objects using only a stereo camera setup in a somewhat RADAR-like data tracking approach. The objective is, given a set of measurements defined in (1), to estimate the target state and extension of the target xk at every time k. The target state is defined as the target pose and

change in pose and the target extent xf_k defines the physical propagation of the tracked target.

xk , ¯xT_k (xf_k)T

(2) The ETT is done using only data gathered from visual imagery, where the images are generated in V-REP [19]. I will also use, what I call fabricated measurements. This is data, purely generated given the left cameras simplified physical characteristics and the position of the camera relative to the contour of the object of interest.

2.3.1 Limitations

Due to the fact that the objective is to track other potential road users, there are only an interest in tracking objects moving on the same Cartesian R2_{plane as the ego car.}

The tracking process will use previously calculated Stixel data and the process of generating these Stixels will not be taken into consideration, when evaluating the performance.

The main focus is on the generation of stixels and the tracking process of the objects defined by these measurements. There is no consideration of tracking multiple objects during this work, however there will be a consideration for the possibility of incorporating this method in an multiple object tracking algorithm.

2.3.2 Point in question

• Can V-REP be used as the first step in a process to generate Stixels, with resembling char-acteristics of the Stixels introduced in [21]?

• How does the proposed ETT perform on Stixel measurements, compared to evaluations, previously done on LIDAR measurements?

• In what measure does the ETT performance differ when comparing the resulting simulated measurements and the fabricated measurements?

(11)

3 Methodology

This section presents the research and development approach used to implement and test an ETT algorithm fitted for use on a subsection of road users. The goal was to use a tracking algorithm previously used successfully when applied to LIDAR measurements as sensor input [22]. The goal was to show that one could swap a LIDAR measurement unit for a smaller stereo vision camera setup, while keeping feasible tracking performance on a subset of road users.

The initial step in the research phase of the work consisted of personal interaction and exchanges with colleagues at Autoliv Sweden, Link¨oping. This was followed by a step of researching the state-of-the art in the area of tracking in general. In parallel to the state-of-the art research phase, a more general self education process where also undergoing, with focus on the fundamentals in tracking, filtering and statistics, with main focus on Bayesian statistics, but also general vector calculus.

3.1 Evaluation

To solve this problem a quantitative research approach based on deduction was used, due to the fact that I want to show that a LIDAR can be swapped for a stereo vision camera setup with feasible results. The deduction should answer the research questions formulated in Section 2.3.2 according to Figure 2 and as follows

3.1.1 Question 1

• Identify the result in each step of the process of creating the Stixel measurements. • Compare the resulting Stixel measurement error to the Stixel error in [21].

• Evaluate the usability of V-REP in the case of setting up a 3D environment, relating to setup time.

3.1.2 Question 2

For all chosen road user types

• Identify the accuracy of center of object estimation through multiple test scenarios.

• Identify the accuracy of orientation estimation of object of interest through multiple test scenarios.

• Determine the accuracy of ability to track object shape by observing intersection over union (IoU) for shape estimation and real shape.

3.1.3 Question 3 For all chosen road users

• How is the object center estimation accuracy affected by changing measurement type? • How is the object orientation estimation accuracy affected by changing measurement type? • How is the accuracy of the IoU relating to estimated shape and real shape affected by changing

(12)

4 Measurement generation theory

This section covers the theory for generating the Stixel measurements by going from stereo images to Cartesian R2coordinate measurements. It also defines the method for creating purely fabricated measurements.

4.1 Simulated measurements

The initial step in the process of creating simulated measurements, consists of setting up an 3D scene of desired traffic situation in the software V-REP. All test scenarios will consist of at least two objects, the ego car and the object to apply the algorithm on. Two cameras will be added to the ego car, according to prevalent practice. At each time interval, V-REP outputs two images from the vision sensors, to be run in the semi-global matching algorithm.

The semi-global matching algorithm, approximates a disparity between the two images, this dis-parity is inversely proportional to the distances seen by the cameras.

Given the disparity pixels, one can estimate the probability of occupation of a given, physical square, on the plane perpendicular to the focal plane of the camera. Given the probability of occupations, the amount of free space, from ego car to closest object can now be estimated. Even thought, that the measurements are all in R2_{, the Stixel object also contains a height,}

com-puted using a height segmentation process.

The distribution of multiple depth points, gives rise to individual Stixels, at every potential Stixel position. The entire process is represented in Figure 5 and implemented in Matlab 2017a.

4.1.1 Stereo image matching

In this work, two algorithms for stereo image matching will be utilized, block matching and semi-global matching. Block matching was first introduced in 1997 by [23], with the idea of real-time approximation of depth data, given data from two image sensors, with known setup characteristics. Semi-global matching was first purposed by [20] but was improved further in [24,25]. The algorithm assumes two images with known extrinsic and intrinsic parameters. Semi-global matching can be applied on both rectified and un-rectified images. The basis for semi-global matching is the idea of a pixel wise matching of mutual information in both images and finding the differences in position of those pixels.

4.1.2 Occupancy grid

An Occupancy grid M is defined as a matrix in R2 _{which models occupancy evidence of the some}

area. An environment defined in R3 _{is projected onto a plane P parallel to the ground, assuming}

a planar surface. P is then discretized into cells (i, j) , where every cell in P corresponds to some real-world tetragonal area Aij. Every intersection of every two cells are always empty, i.e.,

Aij∩ Akl = ∅, where (i, j) 6= (k, l). The sub-index ij consists of one lateral component i and a

depth component j. The value of every cell of M corresponds to the likelihood of occupancy for that physical area, and are defined as D(i, j). Two types of occupancy grids are shown in Figure 6.

The vector mk = (u, v, d)T, is defined as a measurement, where (u, v) are left image coordinates,

column and rows respectively and d is the disparity on corresponding pixel. mk is the projection

of some feature p = (x, y, z)T on the camera image sensor, such that mk= P (pk).

The projection is defined as

mk= P(pk) = 1 z   fux fuy fuB  +   u0 vo 0   (3)

(13)

V-REP

I

Left image Right image Stereo image Matching

Disparity image Probabilty distribution function

over disparity values Occupancy grid Free space computation

Road estimation Height segmentation

Height estimation Stixel extraction

Figure 5: The flow of generating measurements using the simulation process.

(a) Shows the Cartesian occupancy grid

model, where every cell corresponds

to some dispersion, dependent on

lat-eral and longitudinal resolution. Black

color indicates occupation and white non-occupation of specific cell-area.

(b) Shows the polar occupancy grid model on a Cartesian coordinate system, where the cells are defined by some longitudi-nal resolution corresponding to the rela-tion between camera field of view and im-age sensor resolution.

(14)

where fu and fv are the vertical and horizontal focal lengths, measured in pixels and B, is the

baseline of the cameras. (u0, v0) are the principal point of the image. In this work, due to lack of

physical characteristics, fu and fv are defined as follow

fu= Nw 2 × tan(Φh 2 ) fv= Nh 2 × tan(Φv 2 ) (4)

Where Nwand Nhdenotes the sensor width and height, respectively, in pixels. Φhand Φvdenotes

the horizontal and vertical field of view respectively.

The inverse of (3), gives the Cartesian coordinates, derived from the disparity and is defined as.

pk= P−1(mk) = B d   u − u0 (v − v0)f_fu v fu   (5)

If we assume the real measurement vector ¯mk, such that ¯ξk= mk− ¯mk is the real, but unknown

error vector where mk is noisy. ¯ξk is then assumed to have an occurrence that is random with

zero mean and probability density function (PDF) defined as Gm¯k(ξk). For generalization purpose Gm¯k(ξk) is assumed to be multivariate Gaussian.

Gm¯k(ξk) = 1 (2π)23|¯Γ| e−12ξ T kΓ¯ −1_ξ k ₍₆₎

Where Γ is the real measurement(disparity estimation) covariance matrix. The goal is to find the function Lij(mk) that formulates the likelihood of a cell (i, j) to be occupied, given that

measurement. For every measurement m the likelihood at every cell (i, j) is updated accordingly.

D(i, j) =

m

X

k=1

Lij(mk) (7)

In [26] multiple types of occupancy grids are defined, in this thesis, the focus are on polar occupancy grid. When defining the occupancy grid in the manner of a polar grid, one overcomes the problem with decreasing resolution in the polar grid, as a function of the distance from the origin. When using a polar grid, the cells are represented as discretized values of coordinates (u, z), where u corresponds to the column of the image and z the depth of the grid, in real-world coordinates. If we assume that the cell (i, j) corresponds to coordinate (uijzij) then the likelihood function will

be defined as Lij(mk) = Gmk(ξk) ξk= h uij− u 0 f_zuB ij − d iT (8)

4.1.3 Free space computation

The free space is defined as the integrated space located between the origin of a polar grid, and a set of position vectors, defined as closest objects in some range of angles. As defined in Section 4.1.2, every column correlates to a ray originating in origin. So the problem consist of finding the closest cell that is occupied in an occupancy grid, defined in polar coordinates as depicted in Figure 7. In [26] a method is suggested that reformulates the problem from a row independent problem, to a global optimization problem. It also introduces spatial and temporal smoothness, to penalize jumps in depth and deviations between prediction and current solution, respectively.

The problem to segment the free space, from the closest object is done by defining a graph, G(V, E) and finding the minimum path through it. V is a set of vertices, one vertex for every cell on the grid. E is defined as a set of edges, that connect a vertex from every cell on one column, to every cell on the previous column. Every edge is associated with a cost, that is defined by the cost of

(15)

Figure 7: Shows the free space computation result as the green colored cells, in the polar occupancy grid and the black cells, being the first occupied cell, relative the origin of the grid.

segmenting the grid following that specific vertex. The cost of connecting vertices Vij and Vkl are

defined as

cij,kl= Ed(i, j) + Es(i, j, k, l) (9)

Where Ed and Es are the data and smoothness term, respectively.

Ed(i, j) = 1 D(i, j) (10) Es(i, j, k, l) = S(j, l) + T (i, j) S(j, l) = ( Csd(j, l), if d(j, l) < Ts CsTs, if d(j, l) ≥ Ts T (i, j) = ( Ctd(j, j0), if d(j, j0) < Tt CtTt, if d(j, j0) ≥ Tt (11)

The term S(j, l) delivers spatial smoothness, while T (i, j) delivers temporal smoothness. The function d(j, l) returns the distance, in meters, between cell rows j and l on the grid. Cs is a

parameter that penalizes jumps in depth, while Tsis a threshold that saturates the cost function,

to ensure preservation of depth discontinuities. Ct is another cost parameter, while Tt is the

maximum distance of saturation.

4.1.4 Height segmentation

Approximating the height of closest object is done by finding the optimal segmentation between foreground and background. This is done by finding the optimal path through a cost image. Given a point set (u, du) and their corresponding world coordinates (xu, zu), obtained from the

free space computation in section 4.1.3, we want to find the row position vt, representing the upper

boundary of the object located at (xu, zu).

Every disparity point is to vote for their membership in the foreground, this can be done by simply using a Boolean questioning, giving some deviation from expected disparity. In this thesis, I instead use an approximation of the Boolean membership function of the form

Mu,v(d) = 2(1−

d− ˆdu

4Du)2_{− 1} ₍₁₂₎

Where ˆdu is the expected disparity, derived from the free space calculation in section 4.1.3 and

4Du is a parameter, computed for every column, defined as

4 Du= ˆdu− fd(zu+ 4Zu), where fd(z) =

Bfx

(16)

Where fd(z) is the disparity given depth z, B is the baseline, fx the focal length and 4Zu an

empiric parameter.

The cost image is computed according to

C(i, v) = i=v−1 X i=0 Mu,v(d(u, i) − i=vf X i=v

Mu,v(d(u, i)) (14)

where vf is the row, corresponding to the base point of the object, calculated in the free space

calculation. To find the optimal path, a graph is generated, Ghs(Vhs, Ehs), where Vhs is a set of

vertices with one vertex for every pixel in the image. Ehs is the set of edges, connecting every

vertex from one column, to every vertex on the previous column. The cost function to be minimized is formulated accordingly

cu,vo,v1= C(u, vo) + S(u, v0, v1) S(u, v0, v1) = Cs|v0− v1| × max(0, 1 −

|zu− zu+1|

NZ

) (15)

where Cs is the cost of a jump in height.

4.1.5 Stick-pixel extraction

Given the result from the free space and height segmentation, calculating the Stixels is straight-forward. Given a width of Stixels, wider than one image column, the heights are fused. The discretizing of the grid limits the resolution in depth of the Stixels, but spatial integration of the disparities leads to higher accuracy.

A parabolic fit is done around the maximum of a PDF of a set of values. These values are de-fined by the disparity values found inside the area, define by free space and height segmentation. This method is intended to suppresses outliers to suppresses noise, from the disparity estimation. According to [21], the uncertainty for these Stixel measurements are below 0.1 meters given a measurement distance of 28 meters.

4.2 Fabricated measurements

The measurements from the fabrication process are supposed to behave like the Stixels from the real world and the simulation process. The advantage with the fabrication is the ability to generate data with various error level and to produce test data at a higher rate.

4.2.1 Prerequisites

The process assumes that the intended object, to be tracked, can be described as a line, or succes-sion of lines L, constructed from a set of R2 _{Cartesian coordinates, that is on the same plane as}

the ego car.

4.2.2 Fabrication process

A set of position vectors, V ∈ R2 _{are defined in the same reference with equal radius. They}

represent the rays, parallel with the ground, that are hitting the camera sensor. These rays are evenly distributed over the field of view of the camera, with a resolution of, at most, the field of view divided by the pixel width of the image sensor. This definition of the vectors, gives a resembling structure, as that of the polar occupancy grid, described in Section 4.1.2, with the distinction of getting measurements instantly, without the free space and height segmentation computation needed.

(17)

At every time interval k, for every ray vk,l, the solution for the linear system that indicates

intersections between vector vk,l and L are sought. Given more than one intersecting point, the

point with the shortest Euclidean distance is stored as a measurement.

An error is also added to the measurement, as indicated by the uncertainty mentioned in section 4.1.5. The vectors representing the light rays are defined accordingly

Vk,vk,1, . . . , vk,nm vk,l,Ok θk,l ∞

T (16)

where Ok is Cartesian coordinate for the convergence point for all vectors and ∞ defines an

infinitely large radius. θk,ldefines the angle for every l measurement according to (17).

θk,1, . . . , θk,na = τ − fu 2 + h _f u na₋₁× 1, . . . , fu na₋₁ × (n a_{− 1)}i ₍₁₇₎

fu is the cameras approximated horizontal field of view, defined in (4), τ ∈ [0, 2π] and considered

(18)

5 Tracking theory

This section covers an introduction of filtering which is fundamental in the case of tracking. The role of Bayes statistics in filtering is introduced, together with its relation to nonlinear filtering and the last part of the this section introduces the use of Gaussian process (GP).

5.1 Tracking introduction

The area of target tracking describes a method for approximating pose, path and characteristics of interesting objects using some sensor setup. A sensor can be any kind of measuring device, the only requirement is that the sensor can gather information of its surroundings, e.g. light intensity, temperature, ultra sound.

Typical objectives in target tracking consists of identifying identity and state of unique objects, where the state describe, e.g. position, velocity and other features.

The large set of sources with high uncertainty, makes the solution for a target tracking problem highly non-trivial. These uncertainties comes from, for example, unpredicted disturbances in the motion of the target, noise in the sensors and the change in target quantity from one time to the next. [27]

5.2 Filtering theory

Filtering is the recursive task, using statistical models, to estimate desired parameters given data from some sensor. In the case of target tracking, the task is to describe the state of the target at every given time, using sensor data, describing that specific target. [28]

5.2.1 Bayesian reasoning

One well documented approach for modeling problems and solutions encountered in object tracking is Bayesian reasoning. The filtering methods utilizing this is called recursively Bayesian estimation. The posterior distribution of a state xk at time k, given a set of measurements z1:k = [z1, ..., zk],

is of interest in this case and can be formulated as

p(xk|z1:k) = p(xk|zk, z1:k−1) (18) Bayes theorem (19) p(A|B) = p(A, B) p(B) = p(B|A)p(A) p(B) (19)

formulates the probability of event A as the states of some object to be tracked and B as some sensor output and p(A) and p(b) as the probability for respective event to occur. One can apply the Bayes theorem (19) as the probabilistic solution to the tracking problem [28].

When analyzing object tracking, the complete probabilistic knowledge of the object is described by the joint PDF p(xk) = p(xk, xk−1, ..., x0), where xk, represents a state at time k. Define

zk _{= (z}

1, z2, ..., zk), where zi is the measurement at time index i, which is the output of the

system, then (19) derives p(xk_|zk_{) as}

p(xk|zk_{) =} p(z k_|xk_)p(xk₎ p(zk₎ (20) p(zk) = Z xk Z xo p(zk|xk, ..., x0)p(xk, ..., x0)dxk...dx0 (21) zk= (z1, z2, ..., zk) = (zk, zk−1) ⇒ (22)

(19)

p(zk) = p(zk, zk−1) = p(zk, zk−1)p(zk−1) p(xk) = p(zk, zk−1) = p(zk, zk−1)p(zk−1) p(zk|xk_{) = p(z} k, zk−1|xk) = p(zk|zk−1, xk)p(zk−1|xk) (23) Causality principle ⇒ p(zk|xk_{) = p(z} k|zk−1, xk)p(zk−1|xk−1) (24) Substituting into Bayes rule (19) gives

p(xk|zk_{) =} p(zk|zk−1, xk)p(xk|xk−1)

p(zk|zk−1)

p(xk−1|zk−1₎ ₍₂₅₎

Assuming that measurements at time k, only depend on the object state at the corresponding time and are conditionally independent, then the measurements at time k are also independent from measurements at time k − 1. These assumptions together with the assumption that the system obey the Markov property, one can write the recursive Bayesian solution as

p(xk|zk_{) =} p(zk|xk)

p(zk|zk−1)

p(xk|zk−1)p(xk−1|zk−1) (26)

and the state conditional density as p(xk|zk) =

p(zk|xk)

p(zk|zk−1)

Z

p(xk|xk−1)p(xk−1)dxk−1 (27)

These two equations are fundamental in the design of multiple popular filtering methods, including the Kalman filter (KF). [28]

5.2.2 The Kalman filter

Using the equations derived in Section 5.2.1 and the assumptions that the dynamics together with the measurements of some object are linear and Gaussian, one can consider a time-discrete Gaussian linear system as following

xk= Fxk−1+ vk

zk= Hxk+ wk

(28) with a xk∈ Rn being the state vector and zk ∈ Rp the system output, which is its sensor values.

The matrix F ∈ Rn×n describes the dynamics of the system, hence the evolution of the system over time. The matrix H ∈ Rp×h describes how the measurements relates to the state vector. The Gaussian random variable v ∈ Rn_{, named as process noise, is introduced to account for errors}

in the model, in this case F.

The projection matrix H also introduces error, hence the Gaussian random variable w. vk∼ N (0, Q), Q ∈ Rn×n

wk∼ N (0, R), R ∈ Rp×p

(29)

Equation (29) defines that vk and wk is normally distributed with zero mean and covariances Q

and R respectively. The KF is an optimal estimator of a system with process and measurement noise to be Gaussian. The KF utilizes two steps, the first is the prediction of the state, based on previous state.

ˆ

xk|k−1= Fˆxk−1|k−1

Pk|k−1= FPk−1|k−1FT + Qk

(20)

where ˆ_{x is the estimate of the state and P ∈ R}n×n _{is the covariance of the state estimation error}

and Qk is the covariance of the process noise. The notation k|k − 1 defines the estimate at time k

being based on information from time k − 1. The second step is the update step, which includes Vk= zk− Hˆxk|k−1

Sk= HPk|k−1HT+ Rk

Kk= Pk|k−1HTS−1k

(31)

where Vk is known as the innovation, which is the difference between the real measurements and

the predicted measurements. Kk is the Kalman gain, which maps the innovation to correct the

predicted state. The term Sk indicates an estimation of the covariance of the innovation. The

second step is to map the innovation to the predicted state. ˆ

xk|k= ˆxk|k−1+ KkVk

Pk|k= Pk|k−1− KkHkPk|k−1

(32)

Equations (30),(31) and (32) together, depicts the classic KF.

5.2.3 Non-linear Kalman filter

As depicted above, the KF is applicable when the system is linear and Gaussian. In multiple cases this is not the case, non-linearities can arise if either the motion or the measurement models are nonlinear. A nonlinear system can be modeled as

xk= f (xk−1) + vk

zk= h(xk) + wk

(33)

where f and h are now functions instead of constant matrices. f describes the new state given the previous state and h maps the the state vector to the measurements. xk represents the state

while, vk and wk represent the process- and measurement noise, respectively.

The standard KF can no longer be applied if the system is nonlinear. To be able to use the KF with a nonlinear system, a local linear approximation is required. This can be solved analytically, resulting in the extended Kalman filter (EKF) [29]. Utilizing unscented transformation in the linearizion process instead, gives rise to the Unscented Kalman filter [30]. In this thesis, only the EKF will be utilized and therefore derived in Section 5.2.4.

5.2.4 Extended Kalman filter

Given that the function f in equation (33) is nonlinear. To be able to predict the covariance of the prediction function, one must find the Jacobian of the prediction function at every time k, defined as. Fk= ∂f (x) ∂x _x=ˆ_x k−1|k−1 (34) Given the linearized function in (34), the prediction step for the EKF is

ˆ

xk|k−1= f (ˆxk−1|k−1)

Pk|k−1= FkPk−1|k−1FTk + Qk

(35)

The function h’s nonlinearity property requires a linearized sensor model to be derived in the same way as the prediction function, i.e.,

Hk=

∂h(ˆxk−1|k−1)

∂ˆxk−1|k−1)

(21)

where Vk is the innovation, Sk the approximation of the covariance of the innovation. The matrix

Kk is the Kalman gain, that is, due to linearization, no longer the optimal gain, as in the linear

case.

5.3 Clustering

Assuming noisy data, a method to separate measurement outliers from desired measurements are required. In this work, the data is generated with a high level of discrepancy between object and non-object measurements, this simplifies the gating of measurements from the environment significantly. The measurements are gated given their ”height” and then a PDF of a multivariate normal distribution, as shown in Figure 38 is applied to cluster the data based on their Cartesian measurements. N (x) = 1 p(2π)d_|_{P |}e (x−µ)TP_{−1 (x−µ)} 2 ₍₃₈₎

5.4 Gaussian process

A GP is a set of stochastic variables, where any subset have a joint Gaussian distribution. The GP is very suitable when modeling spatially derived measurements. One can consider GPs a distribution over functions. The fundamental characteristics of a Gaussian process, is its mean function µ(x) and covariance function k(x, x0_{) of the function f (x).}

µ(x) =E[f (x)]

k(x, x0) =E[(f (x) − µ(x)(f (x0) − µ(x0))] (39) and written as an GP

f (x) ∼ GP(µ(x), k(x, x0)) (40) where x is the function input. The GP can be seen as a generalization of the multivariate Gaussian, thus an n-dimensional multivariate normal distribution, due to the fact that the functions evaluated for a finite number of inputs x1, ..., xN are normally distributed. [31]

   f (x1) .. . f (xN)   ∼ N (µ, K), where, µ =    µ(x1) .. . µ(xN)    (41) K =    k(x1, x1) . . . k(x1, xN) .. . . .. ... k(xN, x1) . . . k(xN, xN)    (42)

The assumption of a zero-valued mean function, is the basis for the work. However the work can be generalized for work with a non-zero-valued mean function.

(22)

5.4.1 Gaussian process regression

The primary usage of the GP is to learn the output of a function, given some incorporated training data. If we consider a standard linear model

y = f (x) + , , where, ∼ N (0, R) (43) and y a noisy output of the function f at the input vector x and that is assumed to be independent distributed Gaussian output noise. The objective is to use the inputs and corresponding outputs of the function, to learn the function outputs of other, test inputs x0. The joint distribution can be described as before. y f = N 0,K(x, x) + Iσ 2 _{K(x, x}0₎ K(x0, x) K(x0, x0) (44)

Given the joint distribution in (44), one can easily formulate the conditional distribution of p(f |y) as p(f |y) = N (Ay, P ) (45) where A =K(x0, x)K_y−1 P =K(x, x) − K(x0, x)K_y−1K(x, x0) Ky=K(x0, x0) + Iσ2 (46)

5.4.2 Recursive Gaussian process regression

To be able to use a GP without running it off-line in batch-mode, one must formulate a recursive solution that allows a sequential input of training data, at every time interval k. One suggestion would be to transforming the regression into a KF, with the regression function represented as a finite set of basis vectors as described in [32].

The test inputs x0 are interpreted as basis points and f to be the corresponding outputs. By applying Bayes statistics in Section 5.2.1 on the posterior p(f |y1:N), one can obtain

p(f |y1:N) ∝ p(yN|f , y1:N −1)p(f |y1:N −1)

∝ . . . (yk|f , y1:k−1) . . . p(f )

| {z }

∝p(f |y1:k)

(47)

resulting in the following recursion

p(f |y1:k) ∝ p(yk|f , y1:k−1) × p(f |y1:k−1) (48)

Given that f is independent of past outputs y1:k−1, one can simplify

p(yk|f , y1:k−1) ≈ p(yk|f ) (49)

If one could guarantee that these input values is a subset of the training data, the approximation would be exact. How the recursive GP regression can be formulated as a filter to be utilized in this work, is further explored in Section 7.3.

(23)

6 Extended target tracking

The field of research related to the tracking of objects is well-documented, even though most of the documentation is related to single point tracking [15]. The research field of extended target tracking, presents several approaches to the solution of the extended target tracking problem. These approaches have different characteristics that make them more or less applicable on the problem presented in this thesis. This section discusses some of theses methods and introduce the method used in this thesis.

6.1 Introduction to extended target tracking

The problem of target tracking as described in Section 5.1 consists of the problem to approximate the pose and change in pose of some object given some kind of measurements, most often with high level of noise in both measurement and estimation. In the realm of tracking physical objects in the real world, the objects of interest have an spatial extent, which is true for both big and small objects. The use of extended target tracking rather than single-point tracking is not related to object characteristics, but the amount of detections a sensor system can expect to acquire. When the resolution of the sensor relative to the distance to the object and the objects size, is high enough, then one object can occupy multiple sensor detections and this leads to the possibility of implementing an ETT.

The state of an extended object models the objects pose and the spatial extension. A typical design could include the following in the state vector [33].

• Position, Cartesian in R2

or R3

• Kinematic state, such as velocity, acceleration and orientation.

• Extent state, parameters describing the spatial extension of the object.

6.2 Existing methods

Extended target tracking is a research area that gets interest from a large set of researchers. The resulting methods, have not necessarily been tested and applied in an automotive environment. This is one important factor to keep in mind, when evaluating the method to be used to solve the problem presented during this work.

An historically common way to model an extended target, is to use the random matrix approach, firstly presented in [34]. The idea is to model the measurements as a Poisson point process, assuming spatial measurements randomly distributed around some object as described in [35, 36]. The model of the target is defined with a state vector, consisting of a kinematic state vector xkand

a extent matrix Xk. xk defines the targets position and motion, while Xk is a d × d, describes the

extent of the target. Xk is modeled symmetric and positive definite, which implies the shape of an

ellipse. Applications and more detailed explanations is exemplified in the following works [37–41]. One could also formulate the object of interest as an rectangular shape, and track the target extent, as its width and length as described in [42].

6.3 Proposed method

Firstly, when tracking dynamic objects, in this case objects that can be expected to reside in a traffic environment, one must utilize a method that is dynamic, for those ranges of shapes. The method must also be computationally efficient, due to the limitations in hardware availability carried by the automobiles. During this work the method presented in [22] is applied and evaluated in a new measurement environment. The Stixel-measurements are gated based on physical characteristics and clustered assuming normal distribution.

(24)

-2 -1 0 1 2 -1

0 1

(a) Star-convex target shape

0 2

0 1 2

(b) Radial function

Figure 8: Depiction of the star-convex shape with a radial function r = f (θ)

with the targets extent and updated with the filtered data. The full derivation of this method, applied to LIDAR-based data is presented in [22] and the targets tracked in [22] also share the same physical characteristics, as the ones that are tracked in this work.

6.3.1 Gaussian process

The idea is to use a GP to model the unknown boundary of an object of interest. The GP is flexible enough to represent the type of objects, described in the problem formulation. This kind of dynamic model should deliver more precise predictions of measurements, then both elliptical and rectangular models. This method also has the ability to learn and track the shape of unknown objects, which also provides better accuracy.

The extent of the object will be defined as a star-convex shape, otherwise known as radially convex shape.

A set S ∈ Rn _{is radially convex if there exist an x}

0 such that all for all points on −x−→0x are confined

within S. One can illustrate this like S being a physical region surrounded by a wall, then S is a star-convex shape if one can find a position in S where any point of the wall can be observed. One example is shown in Figure 8.

Given the star-convex shape, the measurement model used in this thesis will assume detections on the contour off the object. The measurement model is defined as,

zk,l= xkc + p(θk,l)f (θk,l) + ek,l (50)

where xk

c is the center of the target at time k, zk,l is the nk measurements at time index k. θk,l

represents the angles on the object, describing the shape of the target, as shown in Figure 8. ek,l∼ N (0, R), represents the measurement noise, assuming Gaussian noise with covariance R and

p(θk,l) is the orientation vector defined as

p(θk,l) ,cos(θk,l) sin(θk,l)

T

(51) As mentioned earlier, the aim is to describe the contour of some object of interest, with a GP. The input to this model will be the angles θ of a star-convex shape, as presented in Figure 8 and the output the radii r. As mentioned in Section 5.4, the GP is defined by it’s mean and covariance functions.

In this thesis, the mean function µ(θ) = r is considered constant but unknown and can be consid-ered the mean radius of the target contour.

f (θ) ∼ GP(r, k(θ, θ0)), where, r ∼ N (0, σ2_r) (52) The other crucial part of the GP is the covariance function, which decodes the assumptions made, on the functions to be learned. The most common choice is the squared exponential [31].

(25)

where l is the length scale factor of the function, sought to learn and σ2

f is the signal amplitude

variance. One can note that the correlation increases if the angles are far apart and decreasing if the angles are close. The covariance function needs to take in to account that the correlations k(θ, θ0) = k(θ, θ0 + 2π), this is achieved by slightly modifying the initial covariance function in (53), also, the contribution of the mean function is added [43] to the squared exponential function, resulting in the following covariance function

k(θ, θ0) = σ_f2e−

2sin2 |θ−θ0|₂ l2 + σ2

(26)

7 System modeling

The proposed algorithm requires a sensor model and a model that compensates for the position of the ego car. For the measurements to be used for tracking, they need to be clustered and gated accordingly, as mentioned in Section 5.3

7.1 Reference model

Given that the objective is to track objects relative to the ego car, the reference point is chosen to be the Cartesian world coordinate in R3_{of the center of the left camera. The left camera is used,}

due to the fact that the stereo matching algorithms uses the left image as its projection surface. The size of the image sensor, is said to be negligible, hence the area of the sensor surface is set to zero and all incoming light rays converges at the same point, as shown inf Figure 9a.

In the tracking algorithm, the R3 coordinate system, will projected on to the plane on which the ego car and the object of interest are located on, rendering a coordinate system in R2, as shown in Figure 9b.

The center and extension of the tracked object, are defined inside the simulation environment V-REP at the process of measurement generation and approximated with the suggested method.

7.2 Target state vector

Due to the fact that we want to estimate the extent of the target xf_k jointly with the target state ¯

xk we consider the augmented target state xk as

xk , ¯xTk (x f k) TT , where, (55a) ¯ xk ,(xck)T ψk (x∗k)T T (55b) xc

k is defined as the center of the target and ψk is the target orientation, together the pose and x∗k

some optional states. In this work, this corresponds to the rate of change of the pose. Note that xf_k is the set of radii r describing the star-convex shape, that is the object as depicted in Figure 8.

7.3 Measurement model

Given the processes of generation of measurements, multiple detection’s of the object of interest are gathered in R2_{. Measurements are detected throughout the specified horizontal field of view of the}

(a) Description of coordinate reference system.

(b) Illustrates the field of view of the ego car and how the image sensor in the camera perceives the object of interest. The red coloring is indicating continuous detection’s of the object.

(27)

Figure 10: yG/xGand yL/xLare the basis vectors for the global and local coordinate systems respectively.

xckis the position vector of the target center. zk,l is the measurements on the contour of the target. This

measurement is relative to some angle θk,lin relation to xck, which can be expressed in global coordinates

θGk,l or local coordinates θ L

k,l. The angle ψk

is the orientation of the object.

camera as illustrated in Figure 9b. To process these detection’s inside a tracking framework, the definition of a measurement model is desirable. The measurement model describes the relationship between the state of the object and the detection’s, or measurements on the object. In this thesis I define the measurement model as follows

zk=zTk,1, . . . , zTk,Nk

(56)

for Nk _{measurements at time k.}

As presented in Section 5.4.1 , one can formulate the likelihood of p(zk|f ) as

zk f ∼ N 0,K(uk, uk) + R K(uk, u f₎ K(uf_{, u} k) K(uf, uf) (57a) p(zk|f ) = N (zk|Hkff , R f k) (57b) p(f ) = N (0, P₀f) (57c)

where uk is the angles associated with the measurement points in local coordinates R2 relative to

xc

k, and uf the angels of the training data, hence the basis points as shown in Figure 10. The

measurement model and co-variance matrix is defined as

Hf(uk) = K(uk, uf)[K(uf, uf)]−1 (58a)

Rf(uk) = k(uk, uk) + R − HfK(uf, uk)

P₀f = K(uf, uf); (58b)

This likelihood is exploited to formulate the following recursion for the measurement model zk = Hf(uk)xfk+ e

f

(28)

With this measurement model, one can formulate a KF for recursive regression accordingly xf_k+1= xf_k

zk= Hf(uf_k)xf_k+ ef_k, ef_k ∼ N (0, Rf(uk))

xf₀ ∼ N (0, Pf₀)

(60)

and with added dynamical behavior

xf_k+1= F xf_k+ wk, wk∼ N (0, Qf) (61)

where

Ff = e−αTI, Qf = 1 − e−2αT K(uf, uf) (62)

α determine the speed of the dynamics and is used as a forgetting factor, the higher α, less weight is given to older measurements.

7.4 Augmented state space model

In this section the complete augmented state space model is presented. The augmented state space model consists of the dynamic model (61), the motion model (69) and the measurement model (50) in the form of

xk+1= F xk+ wk, wk∼ N (0, Qk)

zk,l= hk,l(xk) + ek,l, ek,l∼ N (0, Rk,l)

x0∼ N (µ0, P0)

(63)

where the state vector xk is presented in section 7.2.

7.4.1 Measurement equation

For every measurement (zk,l) l at time k , there is an associated angle θk,lG, given by the angular

position relative xc

k, illustrated in Figure 10, where

θ_k,lG _{= ∠(z}k,l− xck) (64a)

the angle can also be expressed in the terms of local coordinates(xL_/yL_{) by}

θL_k,l= θG_k,l− ψk (64b)

The angles can now be used to describe the relation between state and measurements, accordingly zk,l= xck+ pk,l(xck)f (θk,lL ) + ¯ek,l, e¯k,l∼ N (0, R) (65)

Given (51) and (64a) gives

pk,l(xck) = pk,l θGk,l(x c k) = zc_k,l− xc k ||xc k,l− x c k|| (66)

By using the model derived in Section 7.3, the measurement equation can be formulated as zk,l= xck+ pk,l(xck) h Hf(θL_k,l(xc_k, ψk))xf_k+ ef_k,l i + ¯ef_k,l = xc_k+ ˜Hl(xck, ψk)xf_k | {z } hk,l(xk) + pk,l(xck)e f k,l+ ¯e f k,l | {z } ek,l = hk,l(xk) + ek,l, ek,l∼ N (0, Rk,l) (67)

(29)

where the new measurement model and noise co-variance are given by. ˜ Hl(xck, ψk) = pk,l(xck)H f_(θL k,l(x c k, ψk)) (68a) hk,l(xk) = xck+ ˜Hl(xck, ψk)xf_k (68b) Rk,l= pk,lRfk,lp T k,l+ R (68c) pk,l= pk,l(xck), R f k,l= R f_(θL k,l(x c k, ψk)) (68d) 7.4.2 Motion equation

The target state as described in (55b) is described using the same linear state space model, as discussed earlier. ¯ xk+1= ¯F ¯xk+ ¯wk, w¯k∼ N (0, ¯Qk) ¯ x0∼ N (¯µ0, ¯P0) (69) which is a common practice formulation of a motion model of a target tracker. This equation together with the dynamical description of the target extent, given in (61), one can formulate the augmented description as

xk+1= F xk+ wk, wk∼ N (0, Q)

x0∼ N (µ0, P0)

(70) given the following

xk = ¯xk xf_k , F = _¯ F 0 0 Ff , Q = _¯ Q 0 0 Qf (71a) µ = ¯µ0 0 , P0= _¯ P0 0 0 P₀f (71b)

where Ff and Qf are given by (62) and P₀f by (58). F and ¯¯ Q are defined by implementing a time-discrete version of the well known constant velocity model. σ_q2ψ and σ

2

q are the process noise

variance of position and orientation, respectively. T is the sampling time.

¯ F =1 T 0 1 ⊗ I3, Q =¯ " T3 3 T2 2 T2 2 T # ⊗   σ2 q 0 0 0 σ2q 0 0 0 σ2 qψ   (72)

where ⊗ denotes the Kronecker product.

7.4.3 Model inference

The state space model presented in earlier sections, gives the ability to utilize standard and well documented methods for computation of posterior distribution of the state vector. In this thesis, as mentioned earlier, the method chosen for this task is the EKF. The EKF will be used on the augmented state, where all measurements are collected at time k before the filter is applied.

zk =zTk,1, . . . , z T k,nk T (73a) Rk = diag Rk,1, . . . , Rk,nk (73b) hk(xk) =hk,1(xk)T, . . . , hk,nk(xk)T T (73c)

(30)

Given the measurement equation and the motion equation, one can formulate the corresponding augmented state space description as follows.

xk+1= F xk+ wk, wk ∼ N (0, Q)

zk= hk(xk) + ek, ek ∼ N (0, Rk)

x0∼ N (µ0, P0)

(74)

Given a nonlinear filtering technique, this description can be used to find an estimate ˆxk.

7.5 Extended Kalman filter recursions

After the initialization of the estimate ˆx0|−1 = µ0 and corresponding co-variance P0|−1 = P0

the EKF is updated for every time k. The time update is followed by the measurement update, sequentially for every k.

7.5.1 Time update

The time update step of the EKF is defined as. ˆ

xk|k−1= F ˆxk−1|k−1

Pk|k−1= FkPk−1|k−1FTk + Qk

(75)

Note that F is a constant matrix due to linear dynamic model, hence there is no need to find the Jacobian.

7.5.2 Measurement update

Note that a gradient of the measurement function is required dhk(x)

dx , as required by the EKF, this

(31)

7.5.3 Measurement gradient

The derivation of hk(xk) is as follows, and is solved for every time k.

Hk= dhk(xk) dxk (77a) =_dxd khk,1(xk) T_{, . . . ,} d dxkhk,nx(xk) TT (77b) d dxk hk,l(xk) = h_dh_k_(x_k₎ dxc k dhk(xk) dψk 0 dhk(xk) dxf_k i (77c) dhk,l(xk) dxc k = I + ∂pk,l(w) ∂w) H f_(u)xf_{+ p} k,l(w) ∂Hf_(u) ∂u x f∂θ G k,l(w) ∂w (77d) dhk,l(xk) dψk = −pk,l(w) ∂Hf_(u) ∂u x f (77e) dhk,l(xk) dxf_k = pk,l(w)H f_(u) _(77f) Where w = xck and u = θLk,l(xck, ψk) ∂θG k,l(w) ∂w = 1 ||zk,l− w||2 [zy_k,l− wy_{, w}x_{− z}x k,l] (77g) ∂pk,l(w) ∂w) = (zk,l− w)(zk,l− w)T ||zk,l− w||3 − 1 ||zk,l− w|| I (77h) ∂Hf_(u) ∂u = ∂K(u, uf₎ ∂ K(u f_{, u}f₎−1 (77i) ∂K(u, uf₎ ∂ = h ktot(u, u f 1), . . . , ktot(u, u f Nf) i (77j) ktot(u, u f i) = − 1 l2sin(u − u f i)k(u, u f i) (77k)

(32)

8 Implementation

In this section, the implementation of this work will be presented. The implementation describes the tracking of position as well as the extent of an object of interest. The presented tracker are presented in a set of environments, as well as measurements derived with two separate methods. Also the processes for generating these measurements are presented. Firstly the simulation process that, given a 3D-simulation environment with vision sensors, captures images, rendered in this environment. Secondly the method for creating measurements, defined by expected characteristics.

8.1 Data generation

V-REP simulates the movement of a target and its extent by utilizing a 3D-object. The 3D-object comes in three different configurations representing, a car, a person and a cyclist as seen in Figure 11. This shapes are then animated along some predetermined path, this path has all it’s point segments on the same plane, as the ego car is situated on, but below the reference frame of the camera. At a defined sample rate T , the position and orientation of the object is forwarded from V-REP. Figure 12 shows an example of this setup.

The ego car is equipped with two cameras, mounted in a stereo camera rig, with a given baseline. Both these cameras captures an image, respectively at the same sample rate as the position of the target is advanced.

The images are forwarded to the simulation process and the position is forwarded to the fabrication and the two identical evaluation steps.

The difference in output from the simulation process and the fabrication process, is that the R2

measurements from the simulation process, has a physical height approximated. This property is not utilized in the tracking method, used in this thesis, but is required during the simulation process.

In the tracking-block the pose and change in pose together with the extent are of the target are estimated.

At the evaluation block the ground truth from the target descriptor is compared to the target pose estimation from the tracking algorithm. An overview of the entire process of data generation is visualized in Figure 13.

8.1.1 Simulation

The computation of the simulated data requires a high level of complex computations in the way it is implemented in this thesis. This required a possibility to segment the simulation runs in to smaller parts, to verify the steps, respectively. The final design of the simulation process is separated in two stages, as seen in Figure 14a.

The first stage consists of running a stereo matching algorithm, as described earlier on the stereo images as described in section 4.1.1. The resulting disparity image, together with the two images

(a) Car (b) Person (c) Cyclist

(33)

Figure 12: A car on a predetermined path(cyan) of movement in the 3D-environment of V-REP. One can also see the cameras perspective cones(blue), describing their field-of-view.

Figure 13: The method for generating measurements and evaluating the proposed tracker are devided into two separate cases, simulated and fabricated measurements.

(34)

(a) The simulation process is separated in two stages with intermittent storage. The simulation process is a sequential set optimization operations to deliver mea-surements with real-world characteristics.

(b) The fabrication process is defined as the linear equation describing the relation between optical rays hitting the camera sensor and the linear equations describing the contour of the ob-ject of interest

Figure 14: The two different methods for generating measurements.

are stored locally. The two stereo images are stored in the case of verifying or testing the resulting disparity against some parameter tweaking. An example of the left image with corresponding disparity is shown in Figure 15a and Figure 15b, respectively.

Stage two loads the disparity and calculates the occupancy grid according to section 4.1.2, with an occupancy grid example in Figure 15c.

Calculating the occupancy grid has, in it’s worst case the time complexity ∼ O(n4_).

Given the occupancy grid, the basis of the Stixels, the free space, are approximated,followed by the approximation of the height as shown in Figure 15d. The last part is to define each Stixel in world coordinates, as shown projected on to the image in Figure 15e and with a top view in Figure 15f.

8.1.2 Fabrication

The fabrication of measurements where introduced to give a faster, more reliable alternative to the simulation process, that is very computational heavy. The fabrication process is in the range of more than 60 times faster than the simulation process.

The first step in the fabrication process is to define the position vectors, representing the light rays, hitting the camera optics. As defined earlier, the sensor area is assumed to be zero. So all these rays are converging at exactly the same point. This allows the vectors to be defined by only the convergence point and the relative angle. Due to this fact, that the measurements are assumed to be in R2_{, where the plane is the road. The only interesting rays, are those parallel to the ground.}

The vectors are assumed evenly distributed over the cameras specified field of view. Radius is assumed ∞, in practice some value to guarantee complete overlap with the object of interest over the entire simulation period. These vectors are described in Figure 16a.

The second step is to define the contour of the object of interest as a linear function. This is done with the target descriptor, which contains position, orientation and extent.

(35)

(a) Left image, produced by the camera that consti-tutes the reference point for the measurements

(b) The disparity image, calculated with the semi-global matching algorithm. Color bar indicating dis-tance in meters.

(c) Depicting the Occupancy grid, the probability for occupancy in each pixel/cell is defined as increasing yellow from blue. Note that the occupancy grid is visualized in polar coordinates.

(d) Explicitly visualizing approximated free

space(green) together with approximated height(red). Note unfiltered data with high level of noise.

(e) The Stixels projected on to the image.

-2 -1 0 1 2 3 4 5 10.04 10.08 10.12 10.16 10.2

(f ) The calculated measurements corresponding to the green Stixels in Figure 15e as seen from a birds eye view, perpendicular to the ground. Note the scale on the y-axis, which represents measurements in depth. Figure 15: Displays the multiple stages of the process of generating simulated measurements.

(36)

-10 -5 0 5 10 Lateral distance(m) 0 2 4 6 8 10 12 14 16 18 Depth distance(m)

(a) Representing the optical rays hitting the sensor, given some field-of-view, discretized relative to some sensor resolution, at most.

-2 -1 0 1 2 3 4 5 6 Lateral distance(m) 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15 Depth distance(m)

(b) Shows the solution(red circles) for the relation be-tween the linear function that describes the contour of an object(green), and the vectors(multiple color rep-resentation), representing the optical rays hitting the camera sensor. That is the fabricated measurements with zero measurement noise.

Figure 16: Description of the light ray vector representation, at low level of discretization and short radii,

for illustrative reasons. In this case the vectors are separated with a resolution of ₁₂₀π radians.

Figure 17: zk are the measurements at time k and z0k, are the subset of zk, where gating and clustering

are applied. xk|k−1 and xk|k are the state vector at time k, given prediction and measurement at time

k − 1 and k respectively.

solving the linear equation describing the relation between every vector and the function describing the contour of the object. A representation of this solution is exemplified in Figure 16, assuming semi-transparent target.

8.2 Extended target tracker

As mentioned before, the idea for the ETT is to implement a GP on the extent of the target of interest. Where the target is the described as a function of a set of angles and corresponding radii, relative the center and orientation of the object. The extended target tracker assumes measurements delivered to the tracker to be gated and clustered as described in Fig 17.

8.2.1 Initialization

Using the real target position and orientation with added Gaussian noise, the initialization is approximated. This corresponds to some unknown external input to the tracker. The initial GP distribution is described as a set of angles in the range [0, 2π] with corresponding radii. For

Extended target tracking using Gaussian processes on stick-pixel defined objects

School of Innovation Design and Engineering

V¨

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics

30.0 credits

EXTENDED TARGET TRACKING

USING GAUSSIAN PROCESSES ON

STICK-PIXEL DEFINED OBJECTS.

Anders Olsson

aon11013@student.mdh.se

Examiner: Martin Ekstr¨

om

M¨

alardalen University, V¨

aster˚

as, Sweden

Supervisor: Fredrik Ekstrand

M¨

alardalen University, V¨

aster˚

as, Sweden

Company supervisor: Patrik Leissner,

Autoliv Sverige AB, Link¨

oping

Table of Contents

Acronyms

Glossary

1

Introduction

1.1

Purpose

1.2

Objective

2

Problem description

2.1

Overview

2.2

Data Description

2.3

Problem Formulation

3

Methodology

3.1

Evaluation

4

Measurement generation theory

4.1

Simulated measurements

I

I

4.2

Fabricated measurements

5

Tracking theory

5.1

Tracking introduction

5.2

Filtering theory

5.3

Clustering

5.4

Gaussian process

6

Extended target tracking

6.1

Introduction to extended target tracking

6.2

Existing methods

6.3

Proposed method

7

System modeling

7.1

Reference model

7.2

Target state vector

7.3