Improved Data Association for Multi-Pedestrian Tracking Using Image Information

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2020

Improved Data Association

for Multi-Pedestrian

Tracking Using Image

Information

(2)

Master of Science Thesis in Electrical Engineering

Improved Data Association for Multi-Pedestrian Tracking Using Image Information:

Frida Flodin LiTH-ISY-EX--20/5329--SE

Supervisor: Mikael Persson

ISY, Linköpings university

Patrik Leissner

Veoneer Sweden AB

Examiner: Mårten Wadenbäck

ISY, Linköpings university

Computer Vision Laboratory Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden

(3)

Abstract

Multi-pedestrian tracking (MPT) is the task of localizing and following the trajectory of pedestrians in a sequence. Using an MPT algorithm is an important part in pre-venting pedestrian-vehicle collisions in Automated Driving (AD) and Advanced Driv-ing Assistance Systems (ADAS). It has benefited greatly from the advances in computer vision and machine learning in the last decades. Using a pedestrian detector, the track-ing consists of associattrack-ing the detections between frames and maintaintrack-ing pedestrian identities throughout the sequence. This can be a challenging task due to occlusions, missed detections and complex scenes. The number of pedestrians is unknown, and it varies with time. Finding new methods for improving MPT is an active research field and there are many approaches found in the literature. This work focuses on improv-ing the detection-to-track association, the data association, with the help of extracted color features for each pedestrian. Utilizing the recent improvements in object detec-tion this work shows that classical color features still is relevant in pedestrian tracking for real time applications with limited computational resources. The appearance is not only used in the data association but also integrated in a new proposed method to avoid tracking errors due to missed detections. The results show that even with simple models the color appearance can be used to improve the tracking results. Evaluation on the commonly used “Multi-Object Tracking”-benchmark shows an improvement in the Multi-Object Tracking Accuracy and identity switches, while keeping other mea-sures essentially unchanged.

(4)

(5)

Acknowledgments

First of all, I want to thank Veoneer in Linköping for giving me the opportunity to per-form this master’s thesis. Especially, I want to give my appreciations to my onsite su-pervisor Patrik Leissner for taking his time and giving me the support and insights I needed.

I would also like to thank my examiner Mårten Wadenbäck and my university supervi-sor Mikael Persson. Mikael, thank you for sharing your knowledge on the subject and your honest feedback on the writing and methods. Mårten, thank you for taking your time to read and give valuable comments on the report.

Last but not least, I want to thank the people I have met at Veoneer for making the experience of finishing my studies a memorable and fun one. Karin, Christelle and Sara, thank you for your support, enthusiasm and for not excepting any excuses to skip running on Tuesdays. The “Trackingfika”-group, thank you for keeping my blood sugar high with your tasty pastries. Finally, to my fellow thesis worker Jakob Jerrelind, thank you for great discussions and for keeping me company during this journey.

Linköping, June 2020 Frida Flodin

(6)

(7)

List of Figures

1.1 Conceptual overview of multi-pedestrian tracking. . . 2

2.1 An overview of a general multi-object tracking system. . . 6

2.2 Examples of pedestrian detections from a Faster R-CNN . . . 7

2.3 Example of a two stage M /N -logic. . . . 9

4.1 Samples of bounding boxes where the chosen ROI is plotted. . . 18

4.2 Examples from the pedestrian database used for feature evaluation. . . 20

4.3 An overview of the tracking algorithm used for experiments. . . 21

4.4 A flowchart showing the missed detection logic. . . 24

5.1 The results from the feature evaluation. . . 28

5.2 Scenario in sequence MOT16-09, where the data association is improved. 29 5.3 Scenario in sequence MOT16-05, where the data association is improved. 30 5.4 Scenario in sequence MOT16-05, where the updated data association makes the wrong association. . . 31

5.5 Scenario showing the results of adding the missed detection logic. . . . 32

List of Tables

4.1 A summary of the color features used in the experiments. . . 19

4.2 The values of the parameters used in the baseline tracker. . . 21

4.3 The values for the M /N -logic used in the baseline tracker. . . . 21

4.4 Notations and their meaning for the missed detection logic. . . 23

4.5 Example of a CSV-file used to represent the tracking results. . . 26

5.1 MOT benchmark results with different color features used in data asso-ciation. . . 33

5.2 MOT benchmark results with missed detection logic added. . . 33

(10)

(11)

Notation

ABBREVIATIONS

Abbreviation Meaning

CNN Convolutional neural network

MOT Multi-object tracking, a more general term than MPT. The

focus on this thesis is tracking of pedestrians, thus the terms will be used interchangeably.

MPT Multi-pedestrian tracking

RGB Red, green and blue color channels in a digital image.

ROI Region of interest.

TRACKING VOCABULARY

Notation Explanation

Tracking The task of localizing and following an object in a video

sequence. Online

tracking The tracking is performed in real time and no future

frames can be used. As opposed to offline tracking where future frames are used to globally optimize the solution.

Track The trajectory of the object of interest. Here a track is

al-ways a trajectory for a pedestrian.

Bounding box Usually an axis aligned rectangle in the image. Often repre-sented as top-left x y-coordinates together with width and height in image coordinates.

Detection A bounding box in the image where a object-detector has

found a pedestrian. Data

association The task of matching a number of new detections to

al-ready existing tracks.

Ego car When the tracking is performed from a car the ego car is

the car where the camera is mounted.

(12)

(13)

1

Introduction

Multi-pedestrian tracking (MPT) is the computer vision task of localizing and following the trajectory of pedestrians in a video sequence. There are many applications for MPT such as robotics, video surveillance and autonomous driving. This thesis will focus on improving an MPT-algorithm, utilizing the image information, for the application of automated driving.

To perform MPT a pedestrian detector detects all pedestrians in each frame of the se-quence. The tracking algorithm connects these detections between frames so that each pedestrian has a consistent identity throughout the sequence. A conceptual overview of MPT is found in Figure 1.1. Connecting new detections to already tracked pedes-trians is called data association. This is a challenging task, for example, the pedestri-ans can be occluded and thus missed by the detector. Also, the movement patterns of pedestrians are usually non-linear and the number of pedestrians is unknown and varies over time.

Finding ways to improve MPT-algorithms is an active research field and there are many approaches found in the literature. This thesis will focus on improving the data associ-ation with the aid of image informassoci-ation. Finding a robust way to represent pedestrian appearance under different circumstances is investigated in this work. During a se-quence the image information varies due to illumination changes, shadows and spec-ularities. Despite this, the appearance representation should be relatively constant.

1.1 Motivation

This master’s thesis was performed at Veoneer AB in Linköping. Veoneer develops so-lutions for Automated Driving (AD) and Advanced Driving Assistance Systems (ADAS). To make a car aware of its surroundings it is equipped with cameras and the vehicle

(14)

2 1 Introduction Pedestrian detection 1 2 Data association 1 2 Previous frame Current frame Assign ID Pedestrian detection ti m e

Figure 1.1: Conceptual overview of multi-pedestrian tracking using the method

tracking-by-detection. In each frame a pedestrian detector provides bounding boxes for all pedestrians. The tracking consists of assigning the correct identity to each pedestrian throughout the sequence. The data association step connects new detections with the identities from the previous frame.

where the camera is mounted is called the ego car. These cameras make it possible for the ego car to detect and track pedestrians and other traffic participants and use this information to avoid traffic accidents. Creating reliability and trust in these systems is of key importance for the automotive industry.

Multi-pedestrian detection and tracking is an important part in preventing pedestrian-vehicle collisions. It has benefited greatly from the advances in computer vision and machine learning in the last decades. The goal for these systems is to correctly estimate all traffic participants’ positions and velocities so that the ego car can avoid dangerous situations. If the system misinterprets situations and, for example, believes that a per-son is moving towards the ego car, it can lead to unnecessary breaks which in turn can lead to accidents. One cause of this is that two or more pedestrians are close to each other and their detections are confused. If these mismatches are avoided, it would lead to an increased robustness of the tracking system.

1.2 Aim

The aim is to investigate how the image information can be taken into consideration to improve the data association step in MPT. This thesis will investigate how to use the image information to avoid associations that lead to tracking errors. The goal is to avoid situations where inaccurate matches lead to faulty decisions by the ego car. In some situations, it is more desirable for the system to skip the detection and possibly lose the pedestrian for a few frames rather than causing a false velocity towards the ego car.

(15)

1.3 Problem Formulation 3

1.3 Problem Formulation

The objective of this thesis is to find out what image features can be used in a multi-pedestrian tracker to avoid incorrect data association. Following this, the question on what methods there are to integrate this features in an existing tracker must be investi-gated. Thus, the following research questions will be answered:

• How can image information be used to improve the data association in a multi-pedestrian tracker?

• What features in the images are useful for distinguishing one pedestrian from another?

• How should the image information be integrated in an existing multi-pedestrian tracker to achieve better results?

1.4 Limitations

To complete this thesis in the given time frame of 20 weeks, some limitations have been introduced.

A simple baseline tracker is used as a starting point for this work and the goal is to modify it using image information in the data association step. Therefore, other gen-eral tracking methods may be overlooked since they do not fit the application. Entirely different tracking approaches will not be tested in this work. The goal is not to find a tracker that outperforms art tracking algorithms. A survey on state-of-the-art tracking algorithms is out of the scope of this thesis.

In the current state of research, attention is often given to the problem of re-identifi-cation of pedestrians after longer occlusions. The aim of this thesis is not to solve this kind of problems, since in the automotive application this is not the greatest concern. These re-identification tasks are common to solve in surveillance applications, for ex-ample following a suspect between different cameras. These methods are often time consuming and a large quantity of historical appearance data needs to be saved be-tween frames. This also requires the tracks that have been unseen for a longer duration of time to be saved. This increases the complexity of the data association considerably and is not suitable for an application where tracking is in real time and with limited computational resources.

1.5 Thesis Outline

The rest of the thesis is structured as follows. In Chapter 2 the background information on multi-object tracking is provided. Following that a summary of related work regard-ing feature extraction and integration in the trackregard-ing framework is found in Chapter 3. In Chapter 4, details of the evaluation and implementation of a baseline tracker and the improvement strategies are presented. The results are found in Chapter 5. A discussion

(16)

4 1 Introduction

of the results and the methods is presented in Chapter 6. Finally, the conclusions are found in Chapter 7.

(17)

2

Multi-Object Tracking: Basic

Principles

This chapter provides a short summary of the basic principles needed for multi-object tracking (MOT)1. Presenting a survey on available MOT methods lies outside the scope of this thesis and therefore this chapter will focus on the tracking framework that is used in the experiments of this work.

The goal for an MOT algorithm is, given a video sequence, to detect and follow each object in the scene [22]. Each object should maintain its unique identity throughout the sequence, and the goal is to estimate their trajectories. This is a complex task. All objects are not visible in all frames and they might change appearance and motion pattern during the sequence. In some applications, such as surveillance footage, this can be done in a batch manner, that is, given the entire sequence at once and using future frames to optimize globally. In a real time application, such as an automotive application, this is not the case. This requires other methods to construct the tracks sequentially with every new frame, also known as online tracking.

One common method for MOT is tracking-by-detection [4, 5, 39]. The idea is to sepa-rate the object detection part and the tracking part. This means that an object detector, usually based on deep learning systems, provides anonymous detections for all objects in each frame. The tracking is performed by associating detection between frames and giving each track a unique identification. However, no detection algorithm is perfect. This means that the tracker needs to ignore spurious detections and fill the gaps for missed detections.

An overview of a general MOT system is shown in Figure 2.1. This is a simple and common way of performing MOT. Each step will be explained in the following sections.

1_{In this thesis “objects” are the same as pedestrians. This chapter will use the general term “object”, this}

to be consequent with other theory sources. There is generally no difference to the tracking algorithm de-pending on what object to track, it mainly depend on what the detector is trained to detect.

(18)

6 2 Multi-Object Tracking: Basic Principles Detect objects in current frame Object detection Detections Predicted tracks from last frame

Remove unlikely matches to optimize data assocation

Gating

Find best match for the new detections

and current tracks

Data association

Update all tracks with their new detection

Single object tracking

Delete bad tracks, keep promising and

comfirmed tracks

Track logic

Predict tracks for next frame

Figure 2.1: An overview of a general multi-object tracking system.

2.1 Object Detection

Object detection is the problem of locating and classifying objects such as pedestrians or cars in the field of view of a sensor. For example, the sensor could be range based (RADAR, LIDAR) or image based (IR-images, visible light images). The object detec-tion method available in this thesis uses a visible light-camera as sensor and therefore this section will focus on object detection in visible light images. The object detector should, given an image, output position and size of each object from the sought class. Usually the detector outputs this as a bounding box in the image.

In recent years, due to the rapid development of machine learning and convolutional neural networks (CNNs), the ability to accurately detect objects in images has increased. One state-of-the-art method for object detection is the so-called Faster R-CNN [27]. It is built on the same idea as the original R-CNN [11] proposed by Girschick et al. R-CNN is short for “Regions with CNN features” and the idea is to, at random, select region proposals in the image. The CNN is used to extract features and a support vector ma-chine (SVM) is used to classify the regions. The Faster R-CNN provides a faster solution as a result of introducing the idea of Region Proposal Networks (RPN). Examples with detections produced by a Faster R-CNN are found in Figure 2.2. The examples show the problem when pedestrians are close to each other and some pedestrian are unde-tected.

2.2 Gating

Gating is a method to reduce the complexity of the data association. Gating serves as a first step to remove detections that are unlikely to belong to a known track. The detection algorithm is likely to produce false positives, (i.e, detections where no object is present), and new objects can enter the scene and therefore generate detections far from known tracks. If these detections are ignored in the association step the number of possible associations is reduced and thus reducing the overall complexity.

One way to perform gating is to use an elliptical model around the position of a known track. The ellipse is created using the uncertainty of the track position in different di-rections. If the detection is within this ellipse, then there is a possibility that it belongs to the track. This elliptical distance is also known as the Mahalanobis norm [23] and the gating criterion is given as

(19)

2.3 Data Association 7

(a) All fully visible pedestrians are

correctly detected.

(b) Some of the pedestrians are

missed by the detector even when they are visible to the camera.

Figure 2.2: Two example frames with detections from a Faster R-CNN. The images

and detections are taken from the public dataset in the MOT17 challenge [24].

(yt− ˆyt |t−1)TSt−1(yt− ˆyt |t−1) < γG, (2.1)

whereγG is the gating threshold, St is the innovation uncertainty at time t , yt is

po-sition of a detection and ˆyt |t−1is the predicted position of a track given the previous

frame.

2.3 Data Association

Data association is the task of connecting new detections to existing tracks. This can be seen as solving and optimizing an assignment problem. One common algorithm to solve this optimally is the Hungarian method [18]. The Hungarian method uses a cost matrix where each row represents a detection dj and each column a known track ti. The cost matrix is initialized with a large positive value. Each element in the cost

matrix, where the detection dj passed the gating step for track ti, is assigned the cost

for associating detection dj to track ti. The result from the Hungarian method is the

detection-to-track assignment that minimizes the total cost.

The solution only gives one-to-one matches, therefore the cost or similarity measure must be well chosen for optimal performance. One cost used in the Simple Online Real-time Tracker (SORT) [4] is the negative Jaccard index or intersection-over-union (IoU) given by

I oU =A ∩ B

A ∪ B =

Area of overlap

Area of union , (2.2)

where A and B are two bounding boxes. The IoU-based association is a simple, yet accurate, method that performes well in most cases. However, there will be problems when two or more objects are close to each other and almost of the same size. To make

(20)

8 2 Multi-Object Tracking: Basic Principles

this association more robust, more information must be taken into consideration. The aim of this thesis is to improve such situations with image information, and therefore, methods for doing this are discussed in detail in Chapter 3.

Even if the detection passed the gating step the solution can yield an association with a high cost. A solution to this is to extend the assignment problem so that the detection could be associated with an “external source” instead of one of the existing tracks. This by adding columns to the assignment matrix, as many as there are detections. For each detection the corresponding element for external source is given a fixed cost cext.

External source is there to cover for spurious detections and if there is a new track entering the scene. If the cost for an existing tracks is larger than cextthe Hungarian

method will associate the detection as an external source instead.

2.4 Single-Object Tracking

Once each track is associated with a detection the problem is seen as single-object tracking. This means that each track is processed individually. A filtering method, such as the Kalman filter [16], is used to estimate the state of each object together with an uncertainty. The single-object tracking predicts the positions and velocities. This pre-diction is used in the data association in the next frame.

For bounding box estimation in image coordinates a reasonable state used in the work by Wojke et al. [39] is given by,

X=£xc yc a h x˙c y˙c a˙ h˙¤T, (2.3)

where (xc, yc) is the center x y-coordinates of the bounding box, h is the height and a is the aspect ratio. The corresponding velocities are given by ( ˙xc, ˙yc, ˙a, ˙h). Since the

detector outputs the position and size of the object in the image, (xc, yc, a, h) is the part

of the state that is measured directly. The velocities must be estimated by a motion model. For the purpose of this work a constant velocity motion model is assumed sufficient for the Kalman filtering.

2.5 Creation and Deletion of Tracks

Since the number of tracks is unknown and varies over time there is a need for a system that maintains the creation and deletions of tracks. Tracks which are not detected for a couple of frames might have left the scene and should therefore be deleted. There is also a need to avoid creating new tracks from spurious detections.

One method for doing this is called M /N -logic [6]. This method considers the track history and how well the track is detected in the past frames. A track is considered a true track if it is successfully detected M times the last N frames. A track that has not yet been confirmed is saved as a tentative track. If the track has no chance of meeting the M /N -criterion it is deleted from the tracking system. Each track is in one of the following stages:

(21)

2.5 Creation and Deletion of Tracks 9

• Tentative – Possible new track, needs more successful detections to be consid-ered a true track.

• Confirmed – A track that passed the M /N -criterion and thus is considered a true track.

• Deleted – When the track has been undetected for too many frames it is marked as deleted.

The M /N -logic can be separated in different stages where each stage has different M and N -values. The track is in different stages depending on the number of frames since the track was first detected. This is used to be more strict in the early stages. For exam-ple, to say that a track must be detected two frames in a row to move on to the next stage in the beginning. On the other hand, an old track can use a more forgiving M /N -logic. For example, one can accept the track to be detected 6 out of the 10 last frames and still be considered a true track. An example of the M /N -logic with two stages is shown in Figure 2.3. t t + 3 t + 2 t + 1 t + 4

Detected Not detected

Stage 1: M/N = 2/2 Stage 2: M/N = 2/3 Tentative Deleted Confirmed T rac k li fe ti m e

Figure 2.3: Example of a two stage M /N -logic. The track must be detected two

frames in a row to move from stage 1 to stage 2. In stage 2 the track can be unde-tected one frame and still become confirmed. The dashed arrow indicate a missed detection, the solid arrow is a successful association with a detection.

(22)

(23)

3

Related Work

Since the recent improvements in machine learning, the object detection algorithms have become more accurate. This has led to research shifting focus towards the data association step in the tracking framework. Doing this, a more accurate affinity mea-sure needs to be extracted and an idea is to use the appearance to distinguish different pedestrians. This chapter summarizes relevant methods for extracting and integrating image features in the tracking framework.

3.1 Region of Interest for Feature Extraction

When using image features for tracking, a region of interest in the image must be cho-sen. In most tracking frameworks the detections are given as rectangular axis aligned bounding boxes in image coordinates. These bounding boxes should cover the entire pedestrian and thus a part of the background will occupy the bounding box as well. The background will typically vary during a sequence but the image features needs to be relatively constant to use for identifying different pedestrians. The appearance of the same pedestrian can also change over time, for example when the arms move or the pedestrian turns around. It can also be interesting to separate upper and lower body since people can wear different looking clothes on those parts. This can give more information for each person. Several methods for finding a suitable region of interest are found in the literature.

3.1.1 Bounding Box Division

Choosing a static smaller rectangle within the bounding box is a simple way to avoid the background problem. This follows the assumption that the pedestrian usually is centered in the bounding box. Trying to find a part of the bounding box that always

(24)

12 3 Related Work

covers the shirt the pedestrian is wearing can in most cases prove to be successful. This approach was used by Imanuddin et al. [14]. This is a straight forward method that can work well in most frames. However, the pedestrian is not always centered in the bounding box and sometimes the pedestrian is semi-occluded. This method will fail in those cases.

Kuo et al. [19] randomly select 654 different rectangles within the bounding box, with constraints on the aspect ratio. By using an AdaBoost framework they determine which boxes that gives the most discriminating features. Once the best set of rectangles are selected, they are used for extracting features. In their work they found that depending on the feature, different sizes and positions of the boxes were selected.

3.1.2 Foreground Segmentation

Using the cityscapes dataset [10] a Mask R-CNN [13] can be trained to make the pedes-trian segmentation on a pixel level. This kind of networks can be used for both detect-ing and segmentation of the pedestrians.

A more detailed segmentation is to use human parsing that segments out different body parts. A method for doing this was proposed by Gong et al. [12]. These different body part could be used to make the appearance of each parts represent the pedes-trian. In this way people with different clothes on upper and lower body could be dis-tinguished.

3.2 Appearance Features for Tracking

Once a region of interest is selected it is used to extract features in the image. There are several classical method for extracting color and texture features. In recent years the use of deep features has become increasingly popular. In this section a selection of popular appearance features are summarized.

3.2.1 Color Features

Color is frequently used in the field of visual tracking, and many models of representing color are found in the tracking literature. The idea of using color features is to represent the clothes a pedestrian is wearing and achieve a compact model of its appearance. To provide useful information of the appearance the color features need to be discrimina-tive and invariant to scale and illumination changes.

RGB Histograms

Most digital images today are represented with three color channels: red, green and blue (RGB). Therefore a straight-forward approach is to use features in RGB-space di-rectly. The RGB histogram [33] has been applied to represent object appearance in several tracking applications [3, 9, 19].

The RGB representation is beneficial as it does not require extra conversions to other color spaces. However, the RGB color space does not provide a way to measure color

(25)

3.2 Appearance Features for Tracking 13

difference in a way that is close to how humans perceive color [21]. Therefore it is hard to get a compact representation of the color information.

HSV Histograms

The HSV (Hue, Saturation, Value) color space is a popular color representation since it separates the color information from the intensity. This makes it relatively robust to illuminance variations. Several histogramming techniques using HSV-space are found in the literature.

Pérez et al. [26] introduced a histogram of N = NhNs+ Nv bins where they first

popu-lated a 2D Hue-Saturation histogram. This histogram is created from the pixels with saturation and value larger than a threshold. The other pixels, not in the HS histogram, do not provide reliable color information but are important when the area is mainly black or white. Therefore Nv value only bins are populated with these pixels. They

used this histogram technique together with a particle filter to detect the target in con-secutive frames.

A more compact way for histogramming in HSV-space was proposed by Wang et al. [38]. They created a 12 binned histogram of the hue component. Using the saturation com-ponent, they weighed each entry in the histogram by the saturation value of the pixel. This is to make pixels with purer color affect the histogram more.

Regardless of which histogramming method is used, there is a need for quantization. Shao et al. [30] used a non-interval quantization of the HSV color space. The Hue part was divided into 8 different quantization levels since hue represents the color informa-tion. Saturation and value only occupies 3 quantization levels each since this is what the human eye can distinguish [15].

Color Names

Color names is a method for representing color in a way which is inspired by how color is used in natural language. In a study by Berlin and Kay [2] they proposed that in English there are eleven basic color terms; black, blue, brown, gray, green, orange, pink, purple, red, white, and yellow. When using this in image processing applications the RGB values must be mapped to one of these color names. Van de Weijer et. al [36] provided an automatic method for learning color names from real-world images from Google Images with satisfactory results. Using this learned mapping from RGB to color names, an image is described by the 11-binned histogram of color names. Color names are, to some extent, invariant to illumation changes since a light or dark variant of a color is mapped to the same color name. It also allows for mapping of colorless pixels such as white, gray and black pixels.

3.2.2 Deep Features and Siamese Networks

A modern approach to feature extraction is to use deep learning based features, more particularly convolutional neural networks (CNNs). Ciaparrone et al. [8] performed a survey on how deep learning was used in multi-object tracking applications. They

(26)

14 3 Related Work

found many works that used CNNs for feature extraction to be used in the data associ-ation problem with promising result.

One way is to use the CNN directly to extract image feature and using a distance mea-sure to see if the feature vectors are from the same person. Yu et al. [41] used a network similar to GoogLeNet [34] to extract a 128-dimensional feature vector. This was used to represent the appearance of each detection. They trained the network on a custom re-identification dataset.

Another way to utilize CNNs is to use so called Siamese networks [7]. The input to a siamese network is two images and the output is usually a probability that the images originate from the same person. The network consists of two CNNs that share weights and the training is performed by feeding the network pairs of images, both positive and negative samples. The network learns what features are best at separating different pedestrians. There are several works on using the Siamese network for data association in multi-object tracking [1, 20, 29].

3.3 Combining Multiple Cues for Data Association

There are many previous works where more than one measure is used in the data asso-ciation. There are different ways of doing this. The different measures can be motion, shape, appearance and so on.

Yoon et al. [40] use three different affinity measures and all scores have the same im-portance. Since they interpret the affinity measures as likelihoods they calculate each measures separately and combine them in one joint likelihood by multiplying them together.

Wang [38], on the other hand, weights the different distance measures together with weighting coefficients that sum to 1. The three distance measures are all formed from image information.

Kuo et al. [19] have a number of features and regions and combines them using a linear combination. The weights are learned by the AdaBoost algorithm. This learning of the weights is performed offline. Only the features and regions with the largest weights are used in the online tracking framework. In their work they found that the color histogram features were selected most. Other features for texture and shape were not selected as much.

3.4 Feature Updating Strategies

One important thing when using image information in the tracking application is the question when to update the image information of the track. A track changes appear-ance during a sequence due to illumination changes or viewpoint changes. This re-quires a well thought-out updating strategy of the track features. Should new infor-mation be saved every time the track is successfully matched with a detection? The

(27)

3.5 Some Notes on Other Strategies for Multi-Pedestrian Tracking and Evaluation 15

problem is when the track is detected even if it is partly occluded or viewed from an-other viewpoint than earlier. Then the saved information of the track should preferably not be updated.

Nummiaro et al. [25] only save one feature vector per target and uses a forgetting factor between frames. This means saving a linear combination of the old and the new feature vectors between frames. A threshold is also used looking at the observation probability of the target. If it is larger than a given threshold the update is made, otherwise it is ignored.

Takala et al. [35] update the histogram by weighting the N last histograms by Gaussian weights and adding them together in a linear combination. The latest histogram re-ceives the highest weight, and the older the histogram is, the smaller the weight. The Gaussian weights were normalized such that they sum to one.

Yoon et al. [40] use a historical appearance method where a series of appearances are saved. The appearance in one frame is only saved when the confidence of the track is sufficiently high. They also avoid saving very similar appearances by having an update interval depending on the frame rate of the sequence.

3.5 Some Notes on Other Strategies for

Multi-Pedestrian Tracking and Evaluation

This thesis uses the MOT-benchmark created by Milan et al. [24] to evaluate the track-ing performance. This benchmark follows the classical tracktrack-ing approach of tracktrack-ing object by their bounding boxes in the image. The ground truth consists of bounding boxes in each frame even when the person is not visible or partly occluded. This cre-ates a stagnation in the state-of-the-art tracking results. There is no way of telling if a tracking algorithm has unsatisfactory results for not tracking visible or non visible pedestrians. One work that aims on solving this problem is MOTS: Multi-Object Track-ing and Segmentation by Voigtlaender et al [37]. They extend the MOT task to include pixel level segmentation of the objects. Together with new datasets they also present a new baseline algorithm for MOTS. In this new baseline they combine the tasks of detec-tion, segmentation and tracking into one unified problem. The new approach to the problem lead to a more concrete evaluation of complex scenes. For example, during occlusions it is clear which pedestrian that is closer to the camera. Training systems on this pixel level ground truth will solve the problem when pedestrians are close to each other.

(28)

(29)

4

Method

The data association is improved by modifying the affinity measure used for the assign-ment problem. Each track and detection is represented by a feature vector. The simi-larity between a track and a detection is determined by a simisimi-larity measure. It is seen as a likelihood of the detection and the track originating from the same pedestrian. First the method for extracting image features is described in section 4.1. To obtain comparable results, a baseline tracker is implemented as a reference tracker. An overview of the tracking system is found in Figure 4.3 and the details concerning the baseline tracker are found in Section 4.2. In this thesis the important step of the system is the data association step. Changes done to this module are described in Section 4.3. The influence of the changes is evaluated using the popular Multi-object tracking bench-mark described in Section 4.4.

4.1 Feature Extraction

In this section the method for extracting features from a bounding box is presented. First a region of interest is selected as described in Section 4.1.1. Following the assump-tion from Secassump-tion 3.3 that the color of the clothes the pedestrian is wearing is a suitable cue for appearance, four different color features are chosen to be investigated further. The chosen features are presented in Section 4.1.2. The method for evaluating their discriminatory power is found in Section 4.1.3.

4.1.1 Region of Interest

The detections and tracks are given as bounding boxes in the image. If the entire box is used when extracting image information the background will interfere. Since the pedestrians and the camera move, the background will change from frame to frame. As

(30)

18 4 Method

Figure 4.1: Examples of bounding boxes from a MOT-challenge sequence where

the region of interest is marked as a white rectangle. This simple region success-fully selects an area of the pedestrian’s upper body clothes. The last example to the right, on the other hand, is half occluded and an example when this simple approach fails.

discussed in Section 3.1 this segmentation can be done with varying accuracy. Many of the discussed approaches are applied to re-identify pedestrians after long occlusions or absences from the scene. Since this is not a main priority in this thesis, a simple and coarse ROI-extraction is chosen instead. The ROI is inspired by the one used by Imanuddin et al. [14], but elongated to capture more pixels of the pedestrian. Assum-ing that the pedestrian is centered in the boundAssum-ing box it captures the pedestrian’s upper body clothes color without requiring any heavy computations.

If the bounding box is given as the top-left x y-coordinates together with the width and height£xleft, ytop, w, h¤, then the region of interest is given as

ROI =£xleft+w3, ytop+ h 6, w 3, h 3¤ . (4.1)

Examples of this ROI on different pedestrians are shown in Figure 4.1.

4.1.2 Selected Features

Four different color features are selected to be evaluated for the use in a multi-pedes-trian tracker. The choice to look into these is made for their different compactness and strategies for representing color. The features investigated in this thesis are shown in Table 4.1. There are different motivations for choosing these features. The RGB his-togram is interesting since it is the most trivial color descriptor given a digital image. The HSV color space is traditionally chosen due to its color separation properties. The S-weight histogram is chosen for its compactness. Color Names has been proposed to provide a good discriminating ability despite the small dimensionality.

(31)

4.1 Feature Extraction 19

Table 4.1: A summary of the color features used in the experiments.

Feature Abbreviation Number of bins

Seperate Hue-Saturation and Value [26] HS_V 10 × 10 + 10 = 110

S-weight histogram of H-channel [38] SWH_H 12

Color Names [36] CN 10

RGB histogram RGB 6 × 6 × 6 = 216

Following the results by Kuo et al. [19] where they found color features to be more dis-criminative, the focus on texture and shape is not followed from this point forwards. The deep features and Siamese networks are left for future investigation. The compu-tational and memory requirements lie outside the scope of this thesis.

4.1.3 Evaluating Discriminatory Power

To investigate which of the features that are more promising for this application, the ability to uniquely separate different pedestrians is evaluated. A database of pedestri-ans is extracted from sequence MOT16-09 from the MOT benchmark [24] using the ground truth tracks. In Figure 4.2 samples from the database are shown. From the ground truth tracks a set of 10 tracks where the pedestrian is visible more than 20 frames is selected. For each track, a random set of 10 frames is drawn and the bounding box image is saved. This creates a database of annotated images of different pedestri-ans with 10 samples for each pedestrian. This is used to evaluate how well each feature achieves high similarity within the class and low similarity with samples from other classes.

A uniqueness ratio inspired by Khan et al. [17] is used for the evaluation. Using the set of bounding boxes Pmbelonging to pedestrian m and the samples j from other classes the uniqueness ratio is defined as

ratio = P k∈Pmi ∈Pmaxm_,i_,_ksim(Hi, Hk) P k∈Pmmaxj ∉Pmsim(Hj, Hk) (4.2) where, sim(Hi, Hj) = Λ(Hi, Hj). (4.3)

The similarity measureΛ is given in (4.5) and higher value means more similar. The samples j from other classes consist of three times the number of samples from Pm, chosen at random from the database.

This ratio gives an indication of how well the feature can separate samples from the wanted pedestrian from other pedestrians. A large ratio means that the maximum sim-ilarity within the class is larger than the maximum simsim-ilarity with sampled from other classes. If the ratio is close to one it means that the feature results in equal similarities

(32)

20 4 Method

0

1

2

3

4

Figure 4.2: Examples from the pedestrian database used for feature evaluation.

The ground truth tracks are taken from the MOT-challenge sequence MOT16-09.

within the class as with samples from other classes. In other words, the feature is not good at telling different pedestrians apart.

4.2 Baseline Tracker

The baseline tracker is inspired by the Simple Online Realtime Tracker (SORT) [4] and follows the framework described in Chapter 2. An overview of the baseline tracker is shown in Figure 4.3. The goal is to make a simple tracker where extensions and im-provements are easily integrated. The time complexity is also of concern. Following the popular method of tracking-by-detection the baseline tracker uses given detections in each frame and the tracking uses these detections and connects them into tracks. The detections are given as bounding boxes in image coordinates.

(33)

4.2 Baseline Tracker 21 Data association Detection Gating Matching Unmatched detections Matched Unmatched tracks

Possible new track Update

Kalman filter MN-logic Confirmed tracks Tentative tracks Deleted tracks

Save result

Kalman filter predict next frame

Figure 4.3: An overview of the tracking system used. The gray box named “Data association” is the part where experiments with improvement strategies are added.

Table 4.2: The values of the parameters used in the baseline tracker.

Parameter Value Explanation

γG 37 The gating threshold, see Section 2.2.

cext 0.3 The external source cost, see Section 2.3.

The value for the gating threshold and the external source cost is given in Table 4.2. These values are chosen empirically.

To predict the future position of a track a Kalman filtering framework is implemented, as described in Section 2.4. For new tracks the filter is initialized with a high uncer-tainty since the motion is unknown from the start. If a track is missed by the detector in a frame the Kalman filter skips the measurement update and only perform the pre-diction step.

The baseline tracker uses a three stage M /N -logic. The first stage assures that only tracks that are successfully detected two frames in a row are allowed to become con-firmed tracks. The second stage is where the track is concon-firmed or deleted. This is a more forgiving stage since a few misses will not delete the track. The last stage is used for confirmed tracks. Here the track is deleted if it is not seen in enough frames in the last couple of frames. The M /N -values used in the baseline tracker are given in Table 4.3.

Table 4.3: The values for the M /N -logic used in the baseline tracker.

M N

Stage 1 2 2

Stage 2 4 8

(34)

22 4 Method

4.3 Data Association Improvement Strategies

In this section the improvement strategies for the baseline tracker are presented. First the method for updating the track appearance is presented in Section 4.3.1. In Sec-tion 4.3.2 the modificaSec-tion to the affinity matrix in the data associaSec-tion step is pre-sented. Section 4.3.3 proposes a new logic for tracks when there are frames with missed detections. This new logic use the appearance cues to decide how long the track stays confirmed if it is missed in consecutive frames. If the self appearance similarity is suf-ficiently high the track can stay confirmed for longer. On the other hand, if the self appearance similarity is low the pedestrians is considered lost and deleted.

4.3.1 Updating the Track Appearance

Each track keeps its own feature vector to represent the historical appearance of the track. This historical appearance is updated every time the track is successfully associ-ated with a detection. This is done by using a forgetting factorβ ∈ [0,1] as described in Section 3.4. The updated histogram for the current frame t is given by

Ht_{= β · H}t −1_{+ (1 − β) · H}new, (4.4)

where Hnewis the feature vector of the matched detection. If the track is not detected the feature vector is not updated.

This forgetting factor filters out fast changes in the appearance since some of the old histogram is saved. If the appearance changes over time, for example when the lighting conditions change, the appearance is slowly adapted. In the experiments the forgetting factor_{β is set to 0.5.}

4.3.2 Data Association Using Color Information

The data association is extended by extracting the feature vector for all new detec-tions and compare them to the appearance of all confirmed and tentative tracks. The method for extracting features is described in Section 4.1. The choice of feature is based on the feature evaluation described in Section 4.1.3.

Using the feature vectors, a new affinity matrix is constructed. The similarity between the feature vector Ftifor track ti and the feature vector Fdjfor detection dj is given by

Λ(Fti, Fdj) = 1 − D(Fti, Fdj), (4.5)

where D(Fti, Fdj) is the Hellinger distance given by,

D(Fti, Fdj) = v u u t1 − 1 q ¯FtiF¯djN 2 X u q Fti(u) · Fdj(u) . (4.6)

In (4.6) the number of bins is denoted N and the mean is denoted with a bar as ¯Fk.

(35)

4.3 Data Association Improvement Strategies 23

Table 4.4: The notations and their meanings used in the missed detection logic

shown in Figure 4.4.

Notation Meaning

Noccluded Number of consecutive frames where the track has been occluded

Nlost Number of consecutive frames where the track is missed but not occluded

Λself Self similarity

λgood Good self similarity threshold

λbad Bad self similarity threshold

distance i.e. high similarity. In the affinity matrix the affinity score should be small if the features are dissimilar. That is the reason for subtracting the Hellinger distance from one in (4.5).

The new appearance affinity matrix is used together with the IoU matrix described in Section 2.3 by multiplying them element wise. This combined matrix is then used for solving the assignment problem. Using this new affinity matrix the external cost is changed to cext= 0.22. This value was chosen empirically.

4.3.3 Missed Detection Logic

No detector is perfect and there will be frames when the detector fails to detect all pedestrians. This is due to, for example, occlusion or people being too close to each other. The M /N -logic handles this to some extent, since it allows confirmed tracks to be undetected for a few frames. The problem is when a track is undetected for consec-utive frames, the Kalman filter becomes uncertain and starts to drift.

The idea for improving these situations where the detector gives false negatives, is that the image information could be used. This can be done by checking if the Kalman predicted box is similar to the historical appearance of the track. If the appearance is too dissimilar it is reasonable to assume that the track is not visible to the camera. Since the system is aware of other people in the scene this is used to check if a target possibly occludes another target. To check if a track is occluded by another track the IoU between the bounding boxes of all tracks is calculated. If any IoU to another track is above a certain threshold both tracks are considered occluded. This is used in the missed detection logic shown in Figure 4.4.

(36)

24 4 Method Missed detection? Nlost = 0 Noccluded = 0 Meas. update Track occluded? Noccluded += 1 Noccluded > 7?

Delete track Keep track for now

Λself?

Meas. update Nlost += 1

Nlost > 2?

Delete track! Keep track for now

yes _no

yes no

> λgood < λbad

yes no

Figure 4.4: A flowchart showing the missed detection logic for a track. The

(37)

4.4 Multi-Object Tracking Evaluation 25

4.4 Multi-Object Tracking Evaluation

To obtain quantitative results of the overall tracking performance the commonly used

MOT benchmark [24] is used. On their webpage1a number of publicly available

se-quences with given detections are provided. There are several different challenges available and they are named by which year they where introduced. The recent MOT20-sequences are more crowded by pedestrians, and the view point of the camera is high. The earlier MOT17 is the same as MOT16 but with three different sets of public detec-tions. The sequences in MOT17 are more similar to sequences captured by a car, since the view angle is lower. Therefore the MOT17-challenge will be used for the evaluation. The detection quality is critical for the tracking results [8]. Therefore the Faster R-CNN detections from MOT17 was used in the experiments. This is to use the more modern detections in order to achieve as good tracking result as possible.

The data from the MOT-challenges is divided into training and testing sequences. For the testing sequences only the detections are available to the public, the training se-quences have public ground truth. For parameter tuning the training data was divided into training and validation sets.

4.4.1 Metrics

When evaluating the performance of the tracker in the MOT challenge a multiple num-ber of metrics are employed. Below follows a list of the most important metrics for multi-object tracking evaluation:

• Multiple Object Tracking Accuracy (MOTA) [32]:

MOT A = 1 − P t(F Nt+ F Pt+ I Dswt) P tGTt (4.7)

• Mostly Tracked (MT) [24]: Percentage of ground truth tracks which keep their identity 80% or more of their respective life spans.

• Mostly Lost (ML) [24]: Percentage of ground truth tracks which keep their iden-tity 20% or less of their respective life spans.

• Partly Tracked (PT) [24]: Percentage of ground truth which are neither mostly tracked or mostly lost.

P T = 1 − ML − MT (4.8)

• Track Fragmentations (FM) [24]: Count the number of times a track is inter-rupted (untracked). Each time a track is interinter-rupted and later the same track is resumed a fragmentation is counted.

• Identity Switches (IDsw) [24]: The number of times a track changes its identity. An identity switch is counted every time a ground truth track i is associated with track j and the last assignment was k,j .

(38)

26 4 Method

• IDF1-score (IDF1) [28]: Was introduced to compensate for the limitation of the IDsw-metric. The IDsw only quantify the number of identity switches while the IDF1 is used to quantify the identity precision and recall over the entire se-quence.

• False Positive (FP): Number of bounding boxes in the tracking result where no ground truth exists.

• False Negative (FN): Number of missed ground truth bounding boxes in the track-ing result.

4.4.2 Evaluation Environment

All evaluations are performed using the publicly available scripts2created by A. Milian and E. Ristani [24]. The data format for tracking results, ground truth and detections used in the MOT challenge are simple comma-separated value (CSV) files. Each row in these files represent one bounding box in one frame. An example of such CSV-file is found in Table 4.5. The bounding box consists of its top-left x y-coordinates together with its height and width. Each box has its unique identification number. This is the format that the tracker needs to output in order to evaluate using the MOT benchmark.

Table 4.5: Example of a CSV-file for the tracker results. The three rightmost

columns are −1 since in the tracker evaluation these values are ignored.

Frame ID xLeft yTop Width Height Conf. Class Vis.

6, 1, 29.81, 135.66, 71.49, 218.46, -1, -1, -1

6, 0, 145.91, 154.04, 60.60, 185.81, -1, -1, -1

7, 1, 31.09, 134.61, 72.50, 221.49, -1, -1, -1

7, 0, 155.60, 155.75, 59.90, 183.69, -1, -1, -1

(39)

5

Results

In this chapter the results are presented. Section 5.1 presents the results from the fea-ture evaluation. Following these results the HS_V feafea-ture is used to improve the base-line tracker. In Section 5.2 examples of the tracking results in specific scenarios are shown. Finally the quantitative results using the MOT-benchmark are presented in Section 5.3.

5.1 Evaluation of Features

In Figure 5.1 the uniqueness ratio described in Section 4.1.3 is shown for each of the five pedestrians in Figure 4.2. Figure 5.1a and Figure 5.1b show the difference using the entire bounding box and using the smaller ROI described in Section 4.1.1. The results show that the smaller bounding box provides higher discriminatory power, over-all. The best feature considering the discriminatory power seems to be the HS_V feature.

5.2 General Tracking Results

To show the general tracking results, specific scenarios are found in some of the MOT-challenge sequences. The scenarios are selected to show the difference of the baseline tracker and the improved tracker. This section is divided into two parts. First, the re-sults with the improved data association is found in Section 5.2.1. In Section 5.2.2, tracking results showing the influence of the missed detection logic are presented. In these experiments the appearance feature used is the HS_V, following the results in Section 5.1.

(40)

28 5 Results 0 1 2 3 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Ratio

Uniqness ratio for each person

HS_V SWH_H COLOR_NAMES RGB

(a) The uniqueness ratio using the entire

bounding box. 0 1 2 3 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Ratio

Uniqness ratio for each person

HS_V SWH_H COLOR_NAMES RGB

(b) The uniqueness ratio using the region

of interest given by (4.1).

Figure 5.1: Comparison of four different color features on five pedestrians from

the pedestrian database. The ratio is given by (4.2). Example images from each of the pedestrians are shown in Figure 4.2. It is clear that the smaller ROI, used in (b), produces more unique features. The ratio for the higher dimensional features, HS_V and RGB, is consistently higher. For the more compact features CN seems promising.

5.2.1 Improved Data Association

This section presents three different scenarios showing the impact of using the new affinity matrix for data association. In general, there are situations where the improved data association can help when two or more pedestrians are close to each other and there are missed detections. In these situations the baseline tracker will not be able to tell which pedestrian the detector missed. This leads to tracks drifting over from one pedestrian to another. Examples of such scenarios are shown in Figure 5.2 and Fig-ure 5.3. In both scenarios the improved tracker is capable of associating the detection to the correct track.

Another scenario that shows the drawbacks of the new data association is shown in Figure 5.4. Here the IoU affinity is just above the external sources cost cext. When

multiplied with the appearance affinity the resulting total affinity is lower than the cext.

This leads to a new track being created instead.

5.2.2 Improvement with Missed Detection Logic

An example of the tracking result when the missed detection logic is added is shown in Figure 5.5. In this example, the pedestrian in focus is occluded by a passing car and therefore missed by the detector. This results in a track that drifts off the pedes-trian in question both in the baseline tracker and in the improved data association tracker. With the missed detection logic the drifting track is successfully removed af-ter two frames. Without this extra logic the track is allowed to drift in all four frames resulting in many false positives.

(41)

5.2 General Tracking Results 29

11

Frame: 1 Frame: 2 11 _{Frame: 3} 11 _{Frame: 4} 11

11

Frame: 1 Frame: 2 11 Frame: 3 11 13

11 Frame: 4

Frame: 1 Frame: 2 Frame: 3 Frame: 4

Figure 5.2: Top row: baseline. Middle row: improved data association. Bottom row: detections. In frame 3 the baseline matches the blue track to the detection

with an IoU of 0.33 which is larger than cext. When integrating the appearance

the detection has appearance affinity of 0.42 to the blue track. This results in a combined measure of 0.14 which is not enough for a match. Instead a new track is created, the green track. (The scenario is taken from the MOT16-09 sequence. The frames are indexed from 1 in the figure, but the tracker has run 170 frames before the scenario.)

(42)

30 5 Results 48 46 Frame: 1 4846 Frame: 2 48 46 Frame: 3 48 46 Frame: 4 52 49 Frame: 1 52 49 Frame: 2 52 49 Frame: 3 52 49 Frame: 4

Figure 5.3: Top row: baseline. Middle row: improved data association. Bottom row: detections. In this scenario, the detections are from the girl represented by

the red track. The man behind her, the green track, is missed in all frames shown here. In frame 2, for the baseline tracker, the green box matches the detection. The green box has IoU = 0.78 with the detection while the red box only has IoU = 0.76. Using the appearance on the other hand the girl with the red box keeps her identity. In frame 2, the red box has appearance score 0.87 and the green box has 0.45. This means that the red box wins the association with the detection. The green track is undetected and starts to drift, this is another problem. (The scenario is taken from the MOT16-05 sequence. The frames are indexed from 1 in the figure, but the tracker has run 318 frames before the scenario.)

(43)

5.2 General Tracking Results 31 41 Frame: 1 Frame: 241 41 Frame: 3 41 Frame: 4 41

Frame: 1 Frame: 241 Frame: 341

44 Frame: 4

Figure 5.4: Top row: baseline. Middle row: improved data association. Bottom row: detections. This scenario shows an example of when the updated data

as-sociation makes the wrong asas-sociation. The man in black suit is detected in all frames. In frame 3 the IoU is 0.32 with the detection, which is enough for base-line. Adding the color affinity, in this case it is only 0.65, the combined affinity measure is 0.2. This is smaller than cextin the improved tracker. This makes the

tracker lose the man in black and start a new track for him in frame 4. (The sce-nario is taken from the MOT16-05 sequence. The frames are indexed from 1 in the figure, but the tracker has run 282 frames before the scenario.)

(44)

32 5 Results 10 Frame: 1 10 Frame: 2 10 Frame: 3 10 Frame: 4 10 Frame: 1 10 Frame: 2 10 Frame: 3 10 Frame: 4 10 Frame: 1 10

Frame: 2 Frame: 3 Frame: 4

Figure 5.5: Top row: baseline Upper middle row: improved data association Lower middle row: with missed detection logic. Bottom row: Detections. In this

scenario the pedestrian is occluded by a passing car and thus missed by the detec-tor from frame 2 and forward. For baseline and the improved data association the tracks starts to drift. With the missed detection logic added the track is removed already in frame 3. When adding the missed detection logic the drifting track is successfully removed and thus the number of false positives is lower. (The sce-nario is taken from the MOT16-05 sequence. The frames are indexed from 1 in the figure, but the tracker has run 40 frames before the scenario.)

(45)

5.3 Multi-Object Tracking Evaluation 33

Table 5.1: Results on the MOT benchmark for the different color features used in

data association. The best result for each measure is indicated with bold green font. The arrow beside the measure indicates whether higher (↑) or lower (↓) is better.

Method MOTA ↑ MT ↑ ML ↓ IDF1 ↑ IDsw ↓ FM ↓ FP ↓ FN ↓

Baseline 52.91 16.24 32.48 54.92 158 152 817 9234

HS_V 52.93 15.81 31.62 55.07 144 146 836 9224

SWH_H 52.80 15.81 31.62 54.97 155 151 837 9240

CN 52.96 16.24 32.05 55.14 146 149 823 9229

RGB 52.96 15.81 31.62 55.12 145 148 828 9225

Table 5.2: Results on the MOT benchmark for the baseline tracker and the

ad-ditions. First only the data association is changed (+HS_V), then the missed de-tection logic is added to that (+ Miss Logic). The best result for each measure is indicated with bold green font. The arrow beside the measure indicates whether higher (↑) or lower (↓) is better.

Method MOTA ↑ MT ↑ ML ↓ IDF1 ↑ IDsw ↓ FM ↓ FP ↓ FN ↓

Baseline 52.9 16.2 32.5 54.9 158 152 817 9234

+ HS_V 52.9 15.8 31.6 55.1 144 146 836 9224

+ Miss logic 54.0 15.8 33.3 55.2 139 140 538 9303

5.3 Multi-Object Tracking Evaluation

In this section the results on the MOT-benchmark are presented. The results are di-vided in data association improvements in Section 5.3.1 and the addition of missed detection logic in Section 5.3.2.

5.3.1 Improved Data Association

A comparison of the four different color features and the baseline tracker are shown in Table 5.1. These results show again that the HS_V receives the best results in most metrics. Especially looking at the number of IDsw the best improvement is achieved by the HS_V feature.

5.3.2 Improvement with Missed Detection Logic

The results of adding the missed detection logic to the tracker using HS_V for data association are shown in Table 5.2. The most noteworthy result is the big decrease in false positives while keeping competitive values on the other metrics. This results in an higher MOTA-score than any other methods evaluated in this work.

(46)

(47)

6

Discussion

This chapter contains a discussion on the results in Section 6.1 and the chosen meth-ods in Section 6.2.

6.1 Results

The improvements made to the baseline tracker gives promising results. Looking at the MOT-benchmark results some metrics are improved more than others. In the specific scenarios found, the improved tracker handles difficult situations in the desired way. Looking at the sequences, only one situation was found where the new updated tracker makes a mistake that the baseline tracker handles correctly. The MOT-benchmark is used to obtain quantitative results and the metrics show that the updates indeed im-proved the tracker to some extent.

This work did not focus on the re-identification task after longer occlusion. This is an important part to achieve higher results on the MOT-challenge since the identity consistency is valued highly. The decision not to focus on this was made in the begin-ning of this work. This is because it seemed not very important for the application and problem formulation. For the tracking system in this work, keeping the correct identi-fication of all pedestrians the whole sequence is not the main priority. It is more impor-tant, for the application, to correctly track the pedestrians that are posing a possible collision risk, i.e. are visible to the camera. This is so the ego car can make decisions on what actions needs to be taken. If it is the same person that reappears behind an obsta-cle that disappeared behind it a few minutes earlier does not matter from the point of view of the ego car. On the other hand, keeping tracks alive longer and introducing an “occluded” state could result in a faster confirmed track again after the occlusion. This is because the tracker was expecting the occluded pedestrian to reappear and thus not

Improved Data Association for Multi-Pedestrian Tracking Using Image Information

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2020