Tracking Vehicles using Multiple Detections from a Monocular Camera

(1)

1

Examensarbete 30 hp Juni 2015

Tracking Vehicles using Multiple

Detections from a Monocular Camera

Viktor Bäck

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

Tracking Vehicles using Multiple Detections from a Monocular Camera

Viktor Bäck

This thesis concerns image based tracking of vehicles using a monocular camera. A classifier is used to detect and classify objects in the images from the camera. For each detected object the classifier outputs several classifications, each including a confidence value. The objective of this thesis is to investigate how these classifications and confidence values can be used in a single target tracking framework in the best possible way. This is achieved by evaluating several tracking methods that utilize the classifications and confidence values in different ways. The relationship between the confidence values and the accuracy of the corresponding classifications is also investigated.

The methods are evaluated using data from real-world scenarios. It is found that classifications with high confidence values are more accurate on average than those with low confidence values. The differences in the average performance for the considered methods are found to be small.

Image based tracking of vehicles is a key component in active safety systems in vehicles. Such systems can warn the driver or automatically brake the vehicle if a collision is about to happen, thereby preventing accidents.

ISSN: 1401-5757, UPTEC F15 055 Examinator: Tomas Nyberg Ämnesgranskare: Thomas Schön

Handledare: Daniel Ankelhed, Niklas Ollesson

(3)

Sammanfattning

Att köra bil är bland det farligaste en genomsnittlig person gör. Förarens säkerhet beror både på hennes egen och andras förmåga att vara uppmärksam i trafiken.

Detta medför problem eftersom människors förmåga att upptäcka och reagera korrekt i farliga situationer av sin natur är begränsad. Så länge det är en människa som kör bilen kommer dessa problem att kvarstå. Passiva säkerhetssystem, så som bilbälten och krockkuddar, kan lindra skadorna till följd av en kollision. För att förhindra att olyckan inträ↵ar krävs dock så kallade aktiva säkerhetssytem, vilket är vad detta arbete handlar om.

På samma sätt som en människa använder sina sinnen kan en bil använda sensorer för att få information om omgivningen. Sådana sensorer kan till exempel vara en eller flera kameror, eller radar. Informationen från dessa sensorer bearbe- tas därefter stegvis. Först tas information fram om vart intressanta objekt så som bilar och fotgängare befinner sig, samt vilken hastighet dem rör sig med. Därefter kan man utifrån denna information bedöma om en kollision är påväg att inträf- fa. Om så är fallet kan olika strategier för att undvika kollisionen tillämpas. Till exempel så kan föraren varnas med en ljudsignal, eller så kan bilen automatiskt aktivera bromsarna.

I det här arbetet undersöks olika metoder för att bestämma position och hastighet för bilar. Den enda sensoren som används av en kamera. En klassifice- rare används för att detektera och klassificera objekt i bilderna från kameran.

För varje detekterat objekt ger klassificeraren ifrån sig ett konfidensvärde. Målet med det här arbetet är att undersöka hur dessa klassificeringar och konfidensvär- den på bästa möjliga sätt kan användas för målföljning av bilar. Detta uppnås genom att utvärdera målföljningsmetoder som använder klassificeringarna och konfidensvärdena på olika sätt. Förhållandet mellan konfidensvärdena och nog- grannheten av motsvarande klassificeringar undersöks även.

Metoderna utvärderas med hjälp av data från verkliga scenarion. Utvärdering- en visar att klassificeringar med hög konfidens i genomsnitt är noggrannare än klassificeringar med låg konfidens. Skillnaden i den genomsnittliga prestandan för metoderna visas vara liten.

i

(4)

(5)

Acknowledgments

First of all I would like to thank my supervisors Daniel Ankelhed and Niklas Ollesson at Autoliv Electronics AB, as well as my subject reviewer at Uppsala University, Thomas Schön. You always pointed my in the right direction when I got o↵ track.

Special thanks to Karl Granström and Gustaf Hendeby for taking your time and answering my questions. Your ideas and advise have been truly helpful.

Also, thank you Peter Hall and Jacob Roll for the opportunity to do my thesis at Autoliv. It has been really great to get to know Autoliv as a company, and all the wonderful people working there.

I would also like to thank David Molin, who was doing his thesis at Autoliv at the same time as me, for all the interesting conversations and for accompanying me on the climbing walls of Hangaren.

Last but definitely not least, I thank my family for their unfaltering support.

Linköping, June 2015 Viktor Bäck

v

(6)

(7)

Notation

Classification, measurement and region of interest (ROI) are words that will be used somewhat interchangeably in this thesis. Although, in actuality a classification is transformed into a measurement, which corresponds to a ROI than can be drawn as a rectangle on the image. Also, image and frame both refer to a single image from the camera mounted on the ego vehicle.

All variables that represent vectors, matrices or sets are written in bold font.

Vectors are by default column vectors.

A table with the most frequently used abbreviations in this thesis are presented below.

Abbreviations

Abbreviation Explanation

5-LTE 5-Largest Tracking Error AC Average Cluster

CE Clustering Error EKF Extended Kalman Filter GNN Global Nearest Neighbour

KF Kalman Filter

NN Nearest Neighbour

PDA Probabilistic Data Association PDF Probability Density Function

RC Regression Cluster ROI Region of Interest

SC Simple Cluster SO Spatial Overlap SSM State Space Model

TO Temporal Overlap TE Tracking Error

WSC Weighted Sum Cluster

ix

(10)

(11)

Introduction 1

Active safety systems in vehicles is a field that has gained more and more attention in recent years as a result of embedded electronic systems becoming cheaper and more powerful. It is estimated that more than 90% of all vehicle accidents are caused by human error [1]. The potential of such systems to prevent injuries and save lives is therefor immense.

Systems are currently in use that are able to warn the driver if a threatening situation occurs, or brake the car using autonomous emergency braking (AEB) in order to prevent accidents. The possibility of fully autonomous cars sharing the road with human drivers has also gained a lot of attention in media lately, not at least due to the Google Self-Driving Car project [2].

The European New Car Assessment Programme (Euro NCAP) is a programme that assesses the safety of vehicle models. A rating is assigned to a vehicle model based on how it performs in tests that simulate real-world accident scenarios.

In 2014 AEB was included in the Euro NCAP rating system [3]. This acts as a catalyst which increases the demand for active safety systems.

1.1 Previous Work

Target tracking is a challenging problem that has occupied the research commu- nity for several decades. It is also a highly relevant problem, with applications such as active safety system for vehicles [4], air traffic control [5] and ballistic missile surveillance [6]. The challenges come from the fact that sensors, as well as models, are always inaccurate to some degree. In target tracking applications this inaccuracy typically results in (i) measurements not originating from a target, known as clutter (ii) measurements originating from another target (iii) lack of measurement due to occlusion (iv) probability of detection less than unity.

Several methods that handle these problems have been considered. A category 1

(12)

2 1 Introduction

of commonly used non-probabilistic methods are based on data association, i.e.

explicit association of measurements to tracks. One simple but common method is the Nearest Neighbour (NN) method which associates the closest gated measurement to the target [7]. The Global Nearest Neighbour (GNN) method is a gen- eralization of NN which handles association of measurements to multiple targets by solving a optimal assignment problem [7]. The Probabilistic Data Association (PDA) filter, first presented in [8], handles single-target tracking in a cluttered environment by considering all events where at most one of the measurements in the target gate is target originated. A multi-target extension of PDA called the Joint Probabilistic Data Association (JPDA) filter is presented in [9]. Track initiation and deletion is added to this framework in the Integrated Probabilistic Data Association (IPDA) filter [10], and its multi-target version the Joint Integrated Probabilistic Data Association (JIPDA) filter [11]. In [12] the Multi-Hypothesis Tracking (MHT) method is presented, which considers the data association problem over time by considering all measurement to track association events. The MHT method su↵ers from combinatorial growth of the number of association events, and is known to be more computationally demanding and more compli- cated to implement than e.g. JPDA, which limits its use in real-time systems.

Multi-detection versions of PDA, JPDA and IPDA have recently been presented in [13], [14], [15], respectively. A common application for these methods are the over-the-horizon radar (OTHR) problem [16].

Another category of methods uses random finite sets (RFS) in order to for- mulate and solve the problem in a Bayesian setting. Explicit data association is thereby avoided. Examples of such methods are the Probability Hypothesis Density (PHD) filter [17] and the Bernoulli filter [18]. Various extension of these methods exist that handles multi-target tracking, initialization/deletion of tracks, etc.

1.2 Problem Formulation

The system setup is as follows. A monocular camera is mounted in the front of the ego-vehicle and is directed in the vehicle’s forward direction. Images from the camera are processed in real-time by a detection algorithm, which determines class and pixel location of objects in the image, as well as a confidence value for each detected object. The confidence value gives a measure of the certainty of the classification. The output from the detection stage is then used as measurements in the tracking stage, according to the tracking-by-detection paradigm. The detection is thus done independently of the tracking. The images from the camera have a resolution of 1024x592 pixels.

The objective of this thesis is to investigate how the classifications and the corresponding confidence values can be used in a single target tracking framework in the best possible way. This is achieved by investigating di↵erent clustering methods (Section 3.2) and a modified version of the Probabilistic Data Associa- tion (PDA) method (Section 3.3). The relationship between the confidence values and the accuracy of the corresponding classifications is also investigated (Section

(13)

1.3 Limitations 3

4.4).

The tracking is done using a model which describes the target motion in 3- dimensional world coordinates. The tracking is evaluated in world coordinates (see Chapter 4) using manually marked reference tracks in the image plane. A simple state transition model is used in order to make it easier to investigate the best use of the classifications. Indeed, using a simple model allows one to isolate the contribution from the classifications, without having to consider contribu- tions from e.g. optical flow, etc. This is done in the hope that the analysis is still valid for more advanced models.

1.3 Limitations

To limit the scope of this thesis, the following assumptions are made.

1. Single target tracking

2. Preceding traffic on highway roads

3. Same model tuning can be used for all implemented methods

Only single target tracking methods are considered. Measurement association problems that arise due to the presence of multiple targets are handled in a way that is independent of the used tracking method, as described in Section 3.1. This is done so that the focus can be on investigating how the single target tracking can be improved. In order to limit the e↵ect of measurement association problems due to multiple targets, difficult scenarios such as traffic in cities are avoided.

Hence only data from highway traffic is considered.

It is also assumed that the same model (as given in Section 1.4) can be used for all considered methods.

1.4 Model

For each image frame from the camera, the detection algorithm outputs several classifications for each detected object in the image. Each classification contains information about the class of the object, a region of interest (ROI) which contains information about the pixel location of its corners, as well as an estimate of the certainty of the classification,

P(object is of class T| true detection of object). (1.1) The height of the ROI is not detected. Instead it is derived from a constant width- to-height ratio which is dependent on the object class. An example of a set of classifications for a car is shown in Figure 1.1. The confidence value is shown in the upper left corner of each ROI. Note that the ROI of each classification is assumed to be right-angled.

(14)

4 1 Introduction

Figure 1.1: Classifications with corresponding confidence values (shown in the upper left corner of each ROI) for a car.

The classifications are then transformed and used as measurements y in the tracking stage according to

y = 0BBBB BBBBBB@

xHCP [pixels]

width [pixels]

y_bottom [pixels]

size [meters]

1CCCC

CCCCCCA, (1.2)

where xHCPis the horizontal center pixel, width is the width in pixels, and ybottom

is the y-coordinate of the bottom edge of the ROI. The component size is an as- sumed width of the vehicle in world coordinates and is constant for each vehicle class. This prior knowledge about objects based on their class allows one to esti- mate the distance to the objects. Since size does not contain any new information, it can be considered an artificial measurement.

As a model for the target dynamics we consider a constant velocity model, where the acceleration is modeled as normally distributed noise. With measurement signal as in (1.2), the model can be formulated using a state space model (SSM) as

x_k+1 = f (x_k,u_k) + w_k (1.3a)

y_k = h(xk) + vk, (1.3b)

where w_k ⇠ N (0, Qk) and v_k ⇠ N (0, Rk) are process noise and measurement noise, respectively, which are assumed to be normally distributed with covariance

(15)

1.4 Model 5

x y z

forward direction

Figure 1.2: Sketch of the ego-vehicle and the xyz-coordinate system. The origin of the coordinate system coincides with the location of the camera, which is placed on the windscreen. The forward direction of the vehicle is indicated by an arrow.

matrices Q_k and R_k. The input vector u_k contains information about the ego- vehicle motion:

uk = vego

˙✓ego

!

, (1.4)

where vegoand ˙✓egoare the velocity and and yaw rate of the ego-vehicle, respectively. The state vector, which contains information about a target, has the following components

xk = 0BBBB BBBBBBBB BBBBBBB@

w x v_x

y v_y

z 1CCCC CCCCCCCC CCCCCCCA

, (1.5)

where w is the width of the vehicle in world coordinates, v_xand v_yare the velocity in the x- respectively y-direction in the xyz-coordinate system that is attached to the camera in the ego-vehicle, as given in Figure 1.2. The point (x, y, z), which is also measured in world coordinates, is located on the lower center part of the ROI that frames the backside of the target vehicle (see Figure 1.3).

The measurement model (1.3b) models the fisheye e↵ect of the camera and transforms world coordinates (x, y, z) to image coordinates (xp, yp) according to the pinhole camera model, which is given by

x_p y_p

!

=1 x

f_xy f_yz

!

, (1.6)

where f_xand f_yare the focal points in the x- and y-direction, respectively.

(16)

6 1 Introduction

1.5 Dataset

All considered tracking methods are implemented and evaluated in Chapter 5 using data that have been recorded in real-world scenarios. The data includes classifications, motion data of the ego-vehicle and images from the camera. The methods are then evaluated by comparing estimated tracks with reference tracks, known as markings, that represent the true position of objects in the image plane.

Each marking contains the class of the object and an ROI in pixel coordinates (see Figure 1.4),

y^M = 0BBBB BB@

x_HCP [pixels]

width [pixels]

ybottom [pixels]

1CCCC

CCA , (1.7)

with notation as in (1.2). The markings have been obtained by manually drawing ROIs for objects in the recorded image frames, which is done with almost pixel precision.

1.6 Outline

An outline of the chapters is given below.

• Chapter 2 gives an introduction to single tracking.

• Chapter 3 presents all methods that are considered in this thesis.

• Chapter 4 gives a detailed description of how the methods are evaluated.

• Chapter 5 presents method evaluation results.

• Chapter 6 presents comments on the results from the method evaluation.

• Chapter 7 presents thoughts on future work.

(17)

1.6 Outline 7

Figure 1.3: ROI corresponding to the state vector x_k shown in blue, where the point (x, y, z) is marked with a white dot.

Figure 1.4: Car with a marking shown in red.

(18)

(19)

Target Tracking 2

Target tracking amounts to estimating the state of one or many targets over time given measurements. In a typical scenario measurements arrive sequentially over time, and whenever a new measurement is available the state estimate is updated accordingly.

The di↵erent stages of a typical target tracking algorithm are described in detail in the following sections.

2.1 Filtering

In the filtering step the state estimate is updated based on the previous state estimate and new measurements. This is achieved by using a model of the system, such as the state space model (SSM)

x_k+1= f (x_k,u_k) + w_k (2.1a)

y_k = h(xk) + vk, (2.1b)

where w_k ⇠ N (0, Qk) and v_k ⇠ N (0, Rk) are process noise and measurement noise, respectively, which we assume are normally distributed. The state transi- tion function f is in general a nonlinear function of the previous state x_k and, possibly, an input signal uk. It should capture the dynamics of the system, and makes it possible to predict future states based on a previous state. The observa- tion function h is also in general a nonlinear function of the state estimate xkand contains information about how the state is measured.

2.1.1 Kalman Filter

The filtering problem amounts to estimating p(x_k| y_1:k), i.e. the probability den- sity function of the state at time k given all measurements up to time k. In the

9

(20)

10 2 Target Tracking

case when f and h are linear, the SSM (2.1) can be written as

xk+1= Fkxk+ Bkuk+ wk (2.2a) y_k = H_kx_k+ D_ku_k+ v_k, (2.2b) where Fk, Bk, Hk and Dk are constant matrices. Unlike (2.1), an exact solution to the filtering problem exists for (2.2) and is given by the normal distribution N (ˆxk|k,Pk|k), with parameters given by the recursive equations

ˆx_k_|k= ˆx_k_{|k 1}+ K_k˜y_k, (2.3a) Pk|k= (I KkHk)Pk|k 1, (2.3b)

˜y_k= y_k H_kˆx_k_{|k 1}, (2.3c) K_k= P_k_{|k 1}H^T_kS_k¹, (2.3d) Sk= H_kPk|k 1H^T_k + R_k. (2.3e) The equations (2.3) are collectively known as the Kalman filter (KF). State predic- tions can then be calculated from the filtered state estimate using the dynamics of the model according to

ˆx_k+1_|k= F_k+1ˆx_k_|k+ B_k+1u_k+1, (2.4a) P_k+1_|k= F_k+1P_k_|kF^T_k+1+ Q_k+1. (2.4b)

2.1.2 Extended Kalman Filter

If the state transition model f or the observation model h in (2.1) are nonlinear, then there exist no closed form solution to the filtering problem. Hence one has to rely on methods that solve the problem approximately. Several such methods have been considered, e.g the unscented Kalman filter (UKF) and particle filters (PF). In this thesis one of the most common approaches is considered; the so called extended Kalman filter (EKF). The EKF utilizes a quite intuitive approach;

instead of using H_k in the update equations (2.3), use the Jacobian J_{H, k}= @h

@x _ˆx_{k|k 1}. (2.5)

Similarly, F_k+1is replaced by the Jacobian of f in the prediction equation (2.4b), J_{F, k+1}= @f

@x _ˆx_k_|k_,_u_k. (2.6)

When calculating the innovation ˜y_k and the state prediction ˆx_k+1_|k, however, the nonlinear functions h and f should be used;

˜y_k = y_k h(x_k_{|k 1},u_k), (2.7a) ˆx_k+1_|k = f (ˆx_k_|k,uk+1). (2.7b) The EKF is ad-hoc in the sense that there is no guarantee that it converges to the correct solution. That being said, it has been empirically shown to work well for many systems with mild nonlinearities.

(21)

2.2 Data Association 11

2.2 Data Association

If more than one target is being tracked or clutter is present, one faces the problem of how the measurements should be assigned to the di↵erent targets, and how they should be used to update the state of each target. Data association methods present a solution to this problem.

2.2.1 Gating

The first step in data association methods often consists of gating the measurements, which means that for each target we only consider measurements that are sufficiently ’close’ based on the current estimate of the targets state and some distance norm. This reduces computational cost, and is motivated by the observation that measurements ’far away’ from a target are unlikely to have originated from that target. Several di↵erent gating strategies exist which consider di↵erent distance norms. One of the most commonly used is ellipsoidal gating, where a measurement y_kis considered to be inside the gate if

d_k²:= (y_k ˆy_k)^TS_k¹(y_k ˆy_k)  G (2.8) is fulfilled, where G > 0 is a parameter that determines the size of the gate, ˆy_k is the predicted measurement and S_k is the covariance of the innovation, as calculated in (2.3) (see Figure 2.1a). The volume of the ellipsoidal gating region is

V_G = C_Mp

|Sk|G^M/2, (2.9)

where M is the number of elements in y_k, and CMis given in terms of the Gamma function as

C_M = ⇡^M/2

⇣M 2 + 1⌘ =

8>

>>

><

>>

:

⇡^M/2

⇣M 2

⌘!, M even

2^M+1⇣

M+12

⌘!⇡⁽^{M 1}² ⁾

(M + 1)! M odd.

. (2.10)

The distance norm d_k², known as the Mahalanobis distance, of the innovation takes into account the uncertainty of the predicted measurement, resulting in a larger gating region if the uncertainty is large. If the state and measurement models are accurate, then d²_k is approximately ²distributed with M degrees of freedom. Hence it is common to choose a value of Gthat corresponds to the, say, 99th percentile of the ²distribution, which means that approximately 99% of the measurements are expected to fall into the gate. This often gives a reasonable trade-o↵ between having a gate which is too large and thus resulting in a high computational burden, and not gating all measurements that originate from the target.

Gating alone does not necessarily solve all data association problems. Con- sider the following scenarios. (i) Suppose that more than one measurement falls

(22)

d²_k= G

y¹_k

y²_k

y³_k ˆ yk

(a)

d²_k= G

y_k¹

y_k²

y_k³ yˆ_k¹

ˆ

y²_k d²_k= G

(b)

Figure 2.1: (a) Predicted measurement ˆy_k and an ellipsoidal gating region given by d_k²  G. Measurements y¹_k and y²_k are inside the gate, while y³_k is not. (b) Two overlapping gating regions corresponding to two di↵erent tracks, with predicted measurement ˆy¹_k and ˆy²_k, respectively. Measurement y²_kis inside both gates.

into a target gate; it could be a measurement from another target or it could be clutter. In this case it is not clear how the target state should be updated using the gated measurements. (ii) Another problematic situation is shown in Figure 2.1b, where a measurement is in the gate of two di↵erent target gates.

In the following sections some common data association methods that address problem (i) are presented. However, since the emphasis in this thesis is on single- target tracking, rather than multi-target tracking, methods that handle problem (ii) are not considered. Instead, measurement association problems due to multiple targets are handled as described in Section 3.1.

2.2.2 Nearest Neighbor Association

A simple data association method that is common in radar applications is Nearest Neighbour (NN) data association, where the distance to each measurement inside the gate is calculated according to (2.8). The measurement with the smallest distance value, i.e. which is closest to the predicted measurement, is then used to update the state of the target. The other measurements inside the gate are simply neglected. This resolves the association problem in Figure 2.1a.

There exist global nearest neighbor association (GNN) methods that deal with the problem in Figure 2.1b by associating measurements with tracks in such a way that the total cost of all associations are minimized, thus obtaining an optimal solution to the association problem.

(23)

2.2.3 Probabilistic Data Association

The probabilistic data association (PDA) method estimates the posterior state PDF by considering all possible measurement association events. In each association event it is assumed that at most one measurement is target originated and the other measurements are clutter. A derivation (which is inspired by [15]) of the PDA method now follows.

Suppose that N measurements {y}k := {y^j_k}^N_j=1 are inside the gate of a track at time k, and let Yk := {{y}t| t  k} be the set of all measurements up to time k.

Furthermore, let Ajbe the event that measurement y^jis target originated, and A0

be the event where all measurement are clutter. Let A be the set of all mutually exclusive and exhaustive events A_j.

The Probabilistic Data Association (PDA) filter, first presented in [8], solves the measurement association problem by marginalizing the posterior state PDF with respect to all possible association events,

p(x_k|Yk) = X

A_j2A

p(x_k, A_j|Yk)

= X

A_j2A

p(xk|Aj,Yk)P(Aj|Yk), (2.11)

where P(A_j|Yk) is the probability of the association event A_jgiven measurements up to time k.

2.2.3.1 Association Probability

The association probability P(Aj|Yk) can be expressed in terms of a likelihood and a prior conditioned on the previous measurements according to

P(A_j|Yk) = ⌘p({y}k|Aj,Yk 1)P(Aj|Yk 1), (2.12) where all terms that do not depend on Aj are included in the normalization con- stant ⌘. The association Aj at time k is assumed to be independent of previous measurements, i.e. P(A_j|Yk 1) = P(A_j). Also, the prior P(A_j) is assumed to be uniformly distributed, which means that it can be included in the normalization constant.

For j 1 the likelihood in (2.12) is proportional to

p({y}k|Aj,Y_{k 1}) / PDP_G ^{N 1}⇤^j_k, (2.13) where P_Dis the probability of detection, and P_Gis the probability that a detected measurement is inside the track gate. The clutter density is assumed to Poisson distributed with parameter (see [7])

= N

V_G, (2.14)

(24)

where V_Gis the volume of the track gate (2.9). The spatial likelihood of the true measurement ⇤^j_k can be obtained by calculating the corresponding innovation and evaluating the innovation PDF N (y^j_k h(x_k) | 0, Sk), which is given by

⇤^j_k = e^d²^j^/2 P_G(2⇡)^M/2p

|S|k

, (2.15)

where M is the dimension of the measurement vector y_k and d_j is the Maha- lanobis distance (2.8). The innovation PDF is restricted to the gating region by including the gating probability P_Gin (2.15).

For j = 0 the likelihood is proportional to

p({y}k|Aj,Yk 1) / (1 PDP_G) ^N. (2.16) The normalization constant ⌘ can now be obtained by marginalizing (2.12) with respect to the association events A_j, using (2.13) and (2.16). This yields the following expressions for the association probability,

P(A_j|Yk) = 8>

>>

><

>>

:

(1 P_DP_G) (1 PDPG) + PDPGP_N

j=1⇤^j_k j = 0 P_DP_G⇤^j_k

(1 P_DP_G) + P_DP_GP_N

j=1⇤^j_k

j 1.

(2.17)

2.2.3.2 State Update

A weighted sum of the innovations is calculated as

˜y_k= XN

j=1

P(A_j|Yk)˜y^j_k, (2.18)

where ˜y^j_k is the innovation corresponding to measurement y^j_k. The sum of PDFs (2.11) is now approximated using a single Gaussian distribution N (xk|ˆxk|k,P_k_|k), where

ˆx_k_|k = ˆx_k_{|k 1}+ K_k˜y_k, (2.19) and

P_k_|k= P⁰_k_|k+ dP_k, (2.20) where

P⁰_k_|k= P(A0|Yk)P_k_{|k 1}+ (1 P(A0|Yk))P^⇤_k_|k (2.21)

dPk = Kk

26666 664

XN j=1

P(A_j|Yk)˜y^j_k(˜y^j_k)^T ˜y_k˜y^T_k 37777

775 K^T^k (2.22) P^⇤_k_|k= (I KkHk)Pk|k 1. (2.23)

(25)

The Kalman gain K_k and the state prediction are calculated as in the standard EKF filter presented in Section 2.1.

2.2.4 M/N Initiation and Deletion of Tracks

If a measurement is not associated to an existing track, it could either be clutter or be a measurement from a new target. For this reason a tentative track is created which is evaluated to make sure that it really is a new target. If the tentative track receives sufficiently many measurements over a period of time, it will be considered a confirmed track, otherwise it will be deleted.

A common track initiation procedure is M/N initiation, which is now explained. In order for a tentative track to be confirmed, it first needs to receive measurements on N₁ consecutive time steps. Then, the track must receive mea- surements on M₂out of the next N₂time steps. If these conditions are fulfilled, the track is considered confirmed.

When a confirmed track stops receiving new measurements, a strategy which deletes the track is employed. One simple strategy is to delete the track if it has not received a new measurement for N_Dconsecutive time steps.

(26)

(27)

Methods 3

This chapter presents a detailed description of all considered methods.

3.1 Measurement Group Partitioning and Association

As the objective of this thesis is not to evaluate the multi-target tracking performance of di↵erent methods, but rather their single target tracking performance, all implemented methods utilize an algorithm which forms groups of measurements, and associates at most one such group to each target. Each group of measurements should then correspond to an object in the image. While this is not necessarily always true, it resolves the association problem due to adjacent objects in the image in a consistent manner, which is independent of the used tracking method. The measurement group partitioning and association is described in detail below.

In order to successfully use a single target tracking algorithm in situations where vehicles are located close to each other in the image, groups of measurements are formed and assigned to tracks as explained in Algorithm 1. The overlap in Algorithm 1 is calculated using the spatial overlap norm

SO = area (A1\ A2)

area (A₁[ A2), (3.1)

where A₁and A₂are the two measurement ROIs (see Figure 3.1). An example of a measurement group partitioning is shown in Figure 3.2.

Each track is then associated with at most one measurement group by choos- ing the closest group inside the track gate. The distance to the measurement group is defined as the Mahalanobis distance (2.8) to the measurement in the group with the highest confidence value. This association is done so that a measurement group can be associated to at most one track, thus reducing the risk

17

(28)

18 3 Methods

Algorithm 1 Measurement Group Partitioning

1: measurements initially have no id

2: count 0 . Number of measurement groups

3: while 9 measurements with no id do

4: count count+1

5: find measurement m1 with highest confidence and no id

6: assign id count to m1 and all measurements that overlaps with m1

7: end while

8: for each measurement m2 with more than one id do

9: for each id i of m2 do

10: find measurement m3 with highest confidence and id i

11: calculate overlap of m2 and m3

12: end for

13: assign m2 the id i that corresponds to the largest overlap on line 11

14: end for A1

A2

A1\ A2:

A1

A2

A1[ A2:

Figure 3.1: Intersection of two ROIs (left), and union (right).

that two tracks are tracking the same target by sharing the same measurement group. It should be noted though that in the case when several tracks compete over the same measurement group, the group is associated to one of the tracks more or less at random. A more optimal approach could be used, e.g. where the group is associated with the closest track. However, this has not been considered, since empirical results suggest that this situation is rare and should not a↵ect the evaluation of the considered methods.

3.2 Clustering Methods

After the measurements have been partitioned into measurement groups, one approach is to cluster each measurement group into one single measurement, which is then filtered using a standard EKF. These methods are referred to as clustering methods in this thesis. All considered clustering methods are presented below.

(29)

3.2 Clustering Methods 19

0.95 0.8

0.7

0.9

(a)

0.95 0.8

0.7

0.9 id = 1

id = 1

(b)

0.95 0.8

0.7

0.9 id = 1

id = 1

id = 1, 2

id = 2

(c)

0.95 0.8

0.7

0.9 id = 1

id = 1

id = 2

(d)

Figure 3.2: (a) 4 measurements with corresponding confidence values. (b) The id 1 is assigned to the measurement with confidence value 0.95 and to all measurements that it overlaps. (c) The id 2 is assigned to the measurement with confidence value 0.9 and to all measurements that it overlaps.

(d) The measurement with confidence value 0.7 that was assigned two id’s is now assigned id 1, since the spatial overlap with the measurement with confidence value 0.95 is greater than that with confidence value 0.9.

(30)

20 3 Methods

3.2.1 Average Cluster

In Average Cluster (AC) the mean value of the measurements is calculated and used as output. Hence the confidence values are not used.

3.2.2 Simple Cluster

In Simple Cluster (SC) the measurement with the highest confidence value is used as output, thereby discarding the other measurements. This method is sensitive to outliers and tends to produce a noisy output. Also, useful information might be lost when discarding measurements.

3.2.3 Weighted Sum Cluster

In Weighted Sum Cluster (WSC) a weighted sum of the measurements is calculated for each measurement group according to

y = 1 P

jp^j X

j

p^jy^j, (3.2)

where y^j are the measurements in the measurement group, with corresponding confidence values p^j. The WSC is less sensitive to outliers than SC since it uses more measurements.

In e↵ect it assumes that the confidence values gives an indication of the accuracy of the measurements. Note that a large group of low confidence measurements can give the same contribution to the output as a smaller group of high confidence measurements.

Variations of this method are also considered, where only the m measure- ments with the highest confidence values are used in (3.2). Note that if m = 1, then WSC is reduced to SC.

3.2.4 Nonlinear Weighted Sum Cluster

In Nonlinear Weighted Sum Cluster (NWSC) a weighted sum of the measurements is calculated for each measurement group according to

y = 1 P

j ˜p^j X

j

˜p^jy^j, (3.3)

where y^j are the measurements in the measurement group and ˜p^j are transformed confidence values given by

˜p^j = 8>

>>

<>

>>

: 0.5p^j

0.85 p^j< 0.85 c₁E(p^j) + c2 p^j 0.85.

(3.4)

(31)

3.2 Clustering Methods 21

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p

˜p

Figure 3.3: Transformed confidence values (blue line) plotted against confidence values. The black dashed line is a guide for the eye.

An explanation of (3.4) now follows. In Chapter 5 the mean error, E(p), for a large number of measurements is calculated and plotted against their correspond- ing confidence values p (see Figure 5.1b). By doing the transformation (3.4), p^j should contribute to the sum (3.3) in a way that corresponds to the shape of E(p).

This can be seen by noting that E(p) is approximately linear for p < 0.85. It also holds that

E(0.85)⇡ max E(p) min E(p)

2 . (3.5)

This results in (3.4), where c₁and c₂are constants such that ˜p^jattains the values 0.5 and 1 at p^j = 0.85 and and p^j = 1, respectively. The transformation (3.4) is visualized in Figure 3.3.

3.2.5 Regression Cluster

In Regression Cluster (RC) a function g(y, ✓) is fitted to the measurements in a least squares sense by solving the optimization problem

min✓

X

j

⇣p^j g( ˆy^j,✓)⌘₂

+X

i

r_i(✓)², (3.6)

where ✓ = (✓1, ✓2, . . . , ✓n_p)^Tis a vector with npparameters and ri(✓) are quadratic penalty terms which impose constraints on ✓. The measurement vector in (3.6)

(32)

22 3 Methods

is given in the alternative form

ˆy^j= 0BBBB BBBBBB@

x_left^j x_right^j y_bottom^j

1CCCC

CCCCCCA, (3.7)

where x_left^j and x^j_rightare the x-coordinate of the left and right edge of the ROI of measurement j, respectively. We consider a Gaussian shaped function as regres- sion function,

g(y, ✓) = ✓1exp⇣

(y µ(✓))^T⌃(✓) ¹(y^j µ(✓))⌘

, (3.8)

where

µ(✓) = 0BBBB BB@

✓₂

✓₃

✓₄ 1CCCC

CCA (3.9)

⌃(✓) = 0BBBB BB@

✓₅ 0 0

0 ✓₆ 0

0 0 ✓₇

1CCCC

CCA , (3.10)

with constraints

0 ✓1 1 (3.11)

minj x_left^j ✓2 max

j x_left^j (3.12)

minj x_right^j ✓3 max

j x_right^j (3.13)

minj y_bottom^j ✓4 max

j y_bottom^j (3.14)

0  ✓5, ✓₆, ✓₇ 10⁴. (3.15) Hence the number of parameters n_p= 7.

Note that the confidence values of the measurements can not be seen as re- alizations of a PDF. For this reason the normalization constant of the Gaussian distribution is replaced by the parameter ✓₁ in (3.8). The Levenberg-Marquardt algorithm [19], which is a nonlinear least squares method, is used to find a local minimum of (3.6). When a solution is found the vector (3.9) is used as output from RC.

A requirement for the RC method to work is that sufficiently many measurements are available when solving (3.6), otherwise the Levenberg-Marquardt algorithm will result in a system of equations that is badly conditioned. For this reason the WSC method is used if the number of measurements are less than 8.

(33)

3.3 Probabilistic Data Association 23

3.3 Probabilistic Data Association

A modified version of the PDA method in Section 2.2.3 is presented here. As it is suspected that measurements with high confidence are more accurate, the confidence values are included in the modified method. This is done by multiplying the likelihood (2.13), where y^j is the target originated measurement, with the corresponding confidence value p^j,

p({y}k,{p}k|Aj,Yk 1) / PDP_G ^{N 1}⇤^j_kp^j, (3.16) where {p}k is the set of all confidence values at time k. This results in a slightly di↵erent expression for the association probability

P(A_j|Yk,{p}k) = 8>

>>

><

>>

:

(1 PDP_G) (1 P_DP_G) + P_DP_GP_N

j=1⇤^j_kp^j j = 0 P_DP_G⇤^j_kp^j

(1 P_DP_G) + P_DP_GP_N

j=1⇤^j_kp^j j 1.

(3.17)

The state update and prediction is identical to the standard PDA given in Section 2.2.3.

(34)

(35)

Error Analysis 4

This chapter presents the error norms that are used in the method evaluation in Chapter 5. The errors are calculated for estimated tracks using reference tracks, known as markings (see Section 1.5). All estimated tracks, however, do not have a corresponding marking, so there is a need to pair up tracks with markings before the errors can be calculated. This procedure is presented in detail below.

4.1 Track-Marking Pairs

In order to calculate the error for an estimated track, there is a need to determine if a marking exists that corresponds to the tracked object. It is also necessary to determine the time interval that they both have in common. This is achieved by introducing the concepts of spatial overlap and temporal overlap (see [20]). The spatial overlap of a confirmed track and a marking for a single image frame is given by

SO = area (T \ M)

area (T [ M), (4.1)

where T and M are the ROIs of a track and a marking, respectively (as defined in (3.1)). The temporal overlap is defined as

T O = overlap in frame span. (4.2) A track and a marking are then paired up if the following criteria are fulfilled:

SO 0.2 T O 18 frames, (4.3)

where SO is the average spatial overlap during the time interval given by the temporal overlap. The criteria (4.3) has empirically been shown to result in good track-marking pairs.

25

(36)

26 4 Error Analysis

4.2 Error Norms

The following sections present error norms that are used in the method evalua- tion in Chapter 5. Suppose that a marking y^M, as given in (1.7), is available for a given track-marking pair at a given frame. The di↵erence y^M y in image coordi- nates is then transformed to a di↵erence P (y^M y) in world coordinates using the transformation (1.6). This is done under the assumption that the x component of the marking is the same as that of the estimated track. The vector y can ei- ther be a measurement, the output from the clustering, or obtained using the measurement model (1.3b). The error is then calculated in world coordinates as

E(y) := P (y^M y) ₂, (4.4)

where k · k2 denotes the 3-dimensional Euclidean norm. Note that the artificial measurement size, as given in (1.2), is not included when calculating the error (4.4).

4.2.1 Tracking Error

The tracking error (TE) is presented in Algorithm 2.

Algorithm 2 Tracking Error

1: a empty array

2: for each track-marking pair p do

3: for each frame in the temporal overlap of p do

4: calculate E(y), where y is obtained from the measurement model (1.3b)

5: concatenate the error calculated on row 4 to a

6: end for

7: end for

8: result mean value of a . tracking error

4.2.2 5-Largest Tracking Error

The 5-largest tracking error (5-LTE) is presented in Algorithm 3. The 5-LTE gives a measure of the worst performance of a method.

4.2.3 Clustering Error

The clustering error (CE) is presented in Algorithm 4. The CE can only be calculated for clustering methods, and not for e.g. PDA.

4.3 Overtaking Scenarios

In Section 5.2 all methods are evaluated in overtaking scenarios where the target vehicle is seen from an angle (see Figure 4.1). This case is especially difficult since the classifications tend to be non-symmetrically distributed around the target.

(37)

4.4 Confidence Value Error Analysis 27

Algorithm 3 5-Largest Tracking Error

1: a empty array

4: calculate E(y), where y is obtained from the measurement model (1.3b)

5: end for

6: concatenate the 5 largest errors calculated on row 4 to a

7: end for

8: result mean value of a . 5-largest tracking error Algorithm 4 Clustering Error

1: a empty array

4: calculate E(y), where y is the output from the clustering

5: concatenate the error calculated on row 4 to a

6: end for

7: end for

8: result mean value of a . clustering error

The overtaking scenarios are selected using the following conditions,

x < 30 m (4.5a)

10 m < y < 1.7 m or 1.7 m < y < 10 m, (4.5b) where x and y are elements in the state vector x_k (see Section 1.4).

4.4 Confidence Value Error Analysis

Most of the tracking methods in this thesis relies on the assumption that the confidence values give an indication of the accuracy of the measurements. It is therefore of interest to investigate whether or not this assumption is true, and if so, to what extent.

The dependence of the measurement accuracy on the confidence values are evaluated by computing the error according to the norm (4.4) for measurements and then plotting the error against the corresponding confidence values. In order to compute the error for a measurement of an object, a reference marking of the object must be available. For each frame in the temporal overlap of each track- marking pair, as given in Section 3.1, the error is calculated for all measurements in the measurement group that was assigned to the track at that frame. This allows one to plot, for example, the mean value and the standard deviation of the error for di↵erent confidence value intervals, as is done in Section 5.3.

(38)

28 4 Error Analysis

Figure 4.1: Example of an overtaking scenario. The truck is located at (x, y, z) = (23, 3.1, 1.4).

(39)

Results 5

In this chapter the methods presented in Chapter 3 are evaluated. All input sig- nals to the tracking stage have been recorded in real-world scenarios as described in Section 1.5. The error of each method is calculated in world coordinates using the error norms presented in Chapter 4. An error analysis of the confidence values is also given.

All methods are implemented using M/N initiation (see Section 2.2.4) of tracks, with N1 = 3, M2 = 2 and N2 = 3. Tracks are deleted after N_D = 4 consecutive misses.

5.1 Method Evaluation

The methods are evaluated using data that consists of about 2 · 10⁵ frames. For each method about 686 track-marking pairs were created, with an average length of about 305 frames.

The TE, 5-LTE and CE are shown in Table 5.1 for each method.

5.2 Method Evaluation in Overtaking Scenarios

The methods are evaluated using data that consists of about 2 · 10⁵frames. Only track-marking pairs where the track satisfies (4.5) are considered. For each method about 253 track-marking pairs were created, with an average length of about 75 frames.

The TE, 5-LTE and CE for overtaking scenarios are shown in Table 5.2 for each method.

29

(40)

30 5 Results

Table 5.1: Estimated error in centimeters for all methods. The smallest and largest mean values are shown in green and red, respectively.

TE 5-LTE CE

Method Mean Std Mean Std Mean Std

SC 21.84 11.02 38.36 14.46 24.43 12.05

WSC, m = 2 21.27 10.74 37.05 14.39 22.55 11.25 WSC, m = 3 21.04 10.62 36.58 14.26 21.84 10.94 WSC, m = 4 20.93 10.55 36.39 14.19 21.52 10.81 WSC, m = 5 20.89 10.56 36.35 14.22 21.37 10.79 WSC, m = 6 20.91 10.58 36.42 14.22 21.32 10.79 NWSC 21.06 10.59 36.47 14.21 21.19 10.77

WSC 21.20 10.65 36.72 14.25 21.46 10.90

AC 21.38 10.78 37.05 14.50 21.78 11.11

PDA 22.58 11.30 37.91 14.80 - -

RC 20.91 10.67 38.09 15.69 22.73 11.83

Table 5.2: Estimated error in centimeters for all methods in overtaking scenarios. The smallest and largest mean values are shown in green and red, respectively.

TE 5-LTE CE

Method Mean Std Mean Std Mean Std

SC 17.43 10.69 28.78 14.17 21.10 12.57

WSC, m = 2 16.96 10.22 26.98 13.02 19.10 11.55 WSC, m = 3 16.94 10.09 26.68 12.45 18.46 11.11 WSC, m = 4 16.96 10.15 26.67 12.61 18.19 10.96 WSC, m = 5 17.19 10.27 26.85 12.59 18.25 10.99 WSC, m = 6 17.37 10.37 27.11 12.64 18.32 11.00 NWSC 17.81 10.38 27.24 12.35 18.32 10.90

WSC 18.22 10.55 27.83 12.33 18.85 11.13

AC 18.63 10.73 28.41 12.36 19.41 11.43

PDA 19.34 11.76 28.91 14.00 - -

RC 17.65 10.28 28.14 12.35 20.08 11.86

Tracking Vehicles using Multiple Detections from a Monocular Camera

Examensarbete 30 hp Juni 2015

Tracking Vehicles using Multiple

Detections from a Monocular Camera

Viktor Bäck

Abstract

Tracking Vehicles using Multiple Detections from a Monocular Camera

Viktor Bäck

Sammanfattning

Acknowledgments

Contents

Notation

Introduction 1

1.1 Previous Work

1.2 Problem Formulation

1.3 Limitations

1.4 Model

1.5 Dataset

1.6 Outline

Target Tracking 2

2.1 Filtering

2.1.1 Kalman Filter

2.1.2 Extended Kalman Filter

2.2 Data Association

2.2.1 Gating

2.2.2 Nearest Neighbor Association

2.2.3 Probabilistic Data Association

2.2.4 M/N Initiation and Deletion of Tracks

Methods 3

3.1 Measurement Group Partitioning and Association

3.2 Clustering Methods

3.2.1 Average Cluster

3.2.2 Simple Cluster

3.2.3 Weighted Sum Cluster

3.2.4 Nonlinear Weighted Sum Cluster

p

˜p

3.2.5 Regression Cluster

3.3 Probabilistic Data Association

Error Analysis 4

4.1 Track-Marking Pairs

4.2 Error Norms

4.2.1 Tracking Error

4.2.2 5-Largest Tracking Error

4.2.3 Clustering Error

4.3 Overtaking Scenarios

4.4 Confidence Value Error Analysis

Results 5

5.1 Method Evaluation

5.2 Method Evaluation in Overtaking Scenarios