Evaluation of Multiple Object Tracking in Surveillance Video

(1)

Evaluation of Multiple

Object Tracking in

Surveillance Video

Axel Nyström

(2)

Axel Nyström LiTH-ISY-EX--19/5245--SE

Supervisor: Anderson Tavares

isy_{, Linköping University} Niclas Appleby

National Forensic Centre

Examiner: Michael Felsberg

isy_{, Linköping University}

Computer Vision Laboratory Department of Electrical Engineering

(3)

Visuell objektföljning av flera objekt är en process där ett flertal objekt tilldelas unika och konsekventa identiteter i en videosekvens. En populär metod för ob-jektföljning är en teknik som kallas följning-genom-detektion. Följning-genom-detektion är en tvåstegsprocess där en objektFöljning-genom-detektionsalgoritm först hittar ob-jekt i varje bild i en videosekvens, de hittade obob-jekten associeras sen med redan följda objekt av en följningsalgoritm. Ett av huvudsyftena med det här examensar-betet är att undersöka hur olika objektdetektionsalgoritmer presterar på övervak-ningsvideo som Nationellt forensiskt centrum vill använda visuell objektföljning på. Examensarbetet undersöker också korrelationen mellan objektdetektionsal-goritmers prestanda och prestandan på hela följning-genom-detektionsystemet. Slutligen så undersöks också hur användandet av visuella deskriptorer i följnings-algoritmer kan påverka träffsäkerheten i ett följning-genom-detektionsystem. Resultat som presenteras i det här arbetet visar att objektdetektionsalgoritmens kapacitet är en stark indikator för hur hela följning-genom-detektionsystemet presterar. Arbetet visar också hur användandet av visuella deskriptorer i följ-ningssteget kan minska antalet identitetsbyten och därmed öka träffsäkerheten för hela systemet.

(4)

(5)

Multiple object tracking is the process of assigning unique and consistent iden-tities to objects throughout a video sequence. A popular approach to multiple object tracking, and object tracking in general, is to use a method called tracking-by-detection. Tracking-by-detection is a two-stage procedure: an object detection algorithm first detects objects in a frame, these objects are then associated with already tracked objects by a tracking algorithm. One of the main concerns of this thesis is to investigate how different object detection algorithms perform on surveillance video supplied by National Forensic Centre. The thesis then goes on to explore how the stand-alone alone performance of the object detection al-gorithm correlates with overall performance of a tracking-by-detection system. Finally, the thesis investigates how the use of visual descriptors in the tracking stage of a tracking-by-detection system effects performance.

Results presented in this thesis suggest that the capacity of the object detec-tion algorithm is highly indicative of the overall performance of the tracking-by-detection system. Further, this thesis also shows how the use of visual descriptors in the tracking stage can reduce the number of identity switches and thereby in-crease performance of the whole system.

(6)

(7)

I would like to thank NFC, and specifically Niclas Appleby, for giving me the op-portunity to work on this thesis as well as supplying me with adequate hardware. I would also like to thank my supervisor Anderson Tavares and my examiner Michael Felsberg for providing me with great feedback. Finally, a shout-out to my coffee break mates, you made writing this thesis enjoyable.

Linköping, 2019 Axel

(8)

(9)

Notation xi

1 Introduction 1

1.1 Background . . . 1

1.1.1 Object Detection . . . 2

1.1.2 Multiple Object Tracking . . . 2

1.2 Problem formulation . . . 2

1.3 Limitations . . . 3

2 Theory and Related Work 5 2.1 Tracking-by-Detection . . . 6

2.2 Image Classification Networks . . . 7

2.3 Object Detection Algorithms . . . 7

2.3.1 R-CNN . . . 7

2.3.2 Fast R-CNN . . . 8

2.3.3 Faster R-CNN . . . 9

2.3.4 Region Proposal Networks . . . 10

2.3.5 Mask R-CNN . . . 10

2.3.6 YOLO . . . 11

2.3.7 YOLOv2 . . . 12

2.3.8 YOLOv3 . . . 12

2.3.9 Feature Pyramid Network . . . 13

2.3.10 Single Shot Detector . . . 13

2.3.11 RetinaNet . . . 14 2.4 Tracking Algorithms . . . 16 2.4.1 SORT . . . 16 2.4.2 Deep SORT . . . 16 3 Method 19 3.1 Data Annotation . . . 20 3.2 Evaluation . . . 21

3.2.1 Classification of Predicted Bounding Boxes . . . 22

3.2.2 Object Detection Evaluation . . . 22 ix

(10)

3.2.3 Object Tracking Evaluation . . . 24

3.3 Test Environment . . . 25

3.3.1 Testing Object Detection Algorithms . . . 26

3.3.2 Testing Object Tracking Algorithms . . . 27

3.3.3 Hardware . . . 27

3.4 Algorithms and Implementations . . . 28

3.4.1 Deep Learning Libraries . . . 28

3.4.2 Implementations . . . 29

4 Results 31 4.1 Object Detection Results . . . 32

4.2 Object Tracking Results . . . 35

5 Discussion 41 5.1 Object Detection . . . 41 5.2 Object Tracking . . . 43 6 Conclusions 45 6.1 Future Work . . . 45 6.2 Ethics . . . 46 Bibliography 47

(11)

Abbrevations

Abbrevation Meaning

cnn _{Convolutional Neural Network}

svm Support Vector Machine

nfc National Forensic Centre

roi Region of Interest

iou Intersection over Union

yolo You Only Look Once

ssd Single Shot Detector

sort _{Simple Online and Real-time Tracker}

rpn _{Region Proposal Network}

fpn _{Feature Pyramid Network}

cpu _{Central Processing Unit}

gpu _{Graphical Processing Unit}

mot _{Multiple Object Tracking}

csv Comma Separated Value

tp True Positive

fp False Positive

tn True Negative

fn False Negative

(12)

(13)

1

Introduction

Video tracking is an area of computer vision that deals with localization of mov-ing objects in video. There are many applications of video trackmov-ing in fields such as robotics, sport analysis, and video surveillance. These applications often re-quire multiple objects to be tracked at the same time, which is referred to as multiple object tracking.

A popular approach to object tracking is to use a method called tracking-by-detection. Tracking-by-detection uses an object detection algorithm to detect ob-jects present in a frame. These obob-jects are then tracked by associating obob-jects in the current frame with objects from previous frames using a tracking algo-rithm. Having a reliable method for object detection is crucial since the tracking algorithm is dependent on objects being detected in each frame. Lately, object detection algorithms based on convolutional neural networks have been able to achieve greater accuracy than traditional object detection methods. This improve-ment in object detection accuracy has facilitated the use of tracking-by-detection methods for multiple object tracking.

1.1 Background

National Forensic Centre (NFC) is an organization within the Swedish police au-thority that is responsible for forensics for the Swedish police. The section for information technology at NFC handles, among other things, forensic image anal-ysis. NFC wants to investigate how video tracking in surveillance cameras can be used to support criminal investigations and to ease the work of surveillance cam-era opcam-erators.

NFC has previously hosted a student project related to video tracking, that project 1

(14)

was part of the courseImage and Graphics, Project Course CDIO at Linköping

Uni-versity (see Appendix A). The result of that project was a tracking-by-detection system, which uses a detection algorithm called YOLO [33] to detect objects and a tracking algorithm called Deep SORT [44] to track detected objects. The sys-tem is also able to perform person re-identification using an algorithm called AlignedReID [45]. Person re-identification is the practice of identifying the same person in multiple different cameras. This system is relevant since it is used as test environment throughout this thesis.

1.1.1 Object Detection

Object detection is the process of localizing and classifying objects present in an image. Today’s state of the art object detection algorithms typically utilize con-volutional neural networks in some way. There are two main categories of such object detection algorithms, single-stage detectors and two-stage object detectors. The major difference between the two categories is that two-stage object detectors first find regions of interest in the image and then classify the regions separately, whereas single-stage detectors predict bounding boxes and classify objects simul-taneously [5]. This gives single-stage detectors, in general, increased speed at the cost of accuracy when compared to region based object detectors.

1.1.2 Multiple Object Tracking

The tracking stage of a tracking-by-detection method can be seen as solving two different tasks. First, future positions of tracked objects are predicted, this is com-monly done using methods such as the Kalman filter [19]. Next, objects detected in a new frame are associated with already tracked objects based on the predicted future positions of the already tracked objects. If there are as many detections as there are already tracked objects, this association can be seen as an assignment problem, which can be solved using the Hungarian method [21]. Deep SORT [44], which is used in the system already implemented at NFC, predicts future positions using a Kalman filter and then solves the assignment problem with the Hungarian method. In addition to this Deep SORT also uses a visual descriptor to improve the accuracy of the tracking. This visual descriptor is a 128-dimensional vector obtained by feeding an objects bounding box into a convolutional neural network. The convolutional neural network used has been trained to distinguish pedestrians from each other, this means that it is especially suited to track people.

1.2 Problem formulation

NFC is interested in investigating how the accuracy of their tracking-by-detection system can be improved. Since tracking-by-detection systems are limited by the performance of the detection algorithm it could be relevant to examine how the choice of detection algorithm effects the accuracy and speed of the system. This trade-off between speed and accuracy has been studied in papers such as [17], where several modern object detection algorithm are compared.

(15)

Another aspect of the tracking-by-detection method that could effect the overall accuracy is the use of visual descriptors in the tracking stage. Deep SORT [44], which is used in the existing system at NFC, is an extension of the the algorithm SORT [4], which does not use any appearance information in the tracking stage. Comparing SORT to Deep SORT would give insights into how tracking perfor-mance and, by extension, the whole system’s perforperfor-mance is effected by the use of visual descriptors.

All quantitative analyzes will be done on the same form of surveillance video data on which NFC plans to use the system. The research questions should also be seen in this context, they do not seek to answer the general case but rather the specific case when NFC’s data is used. The research questions this thesis aims to answer are the following:

• How does the choice of detection algorithm effect a tracking-by-detection method that is used to track people?

• How does the use of visual descriptors in the tracking stage of a tracking-by-detection method effect the accuracy?

1.3 Limitations

The major limitation of this thesis is that NFC does not have any annotated data. Some data is annotated as a part of this thesis in order to be able to quantitatively measure performance. This annotated data is however not nearly enough to train object detection algorithms with and is therefore only used as test data. Instead, pre-trained weights supplied with algorithm implementations are used through-out this thesis.

This thesis also limits the number of tested tracking algorithms to two, SORT and Deep SORT. Other algorithms could have been tested but SORT and Deep SORT were chosen due to how similar they are, apart from the use of visual descriptors in Deep SORT. Testing more algorithms would also have further broadened the scope of the thesis and made it more time-consuming.

(16)

(17)

2

Theory and Related Work

This chapter will present the theory, related work, and key concepts, which are relevant to this thesis. Theoretical concepts are purposefully presented at a rather abstract level so that this chapter does not become overly lengthy. This means that the thesis will not delve into the basics of computer vision or deep learning. Therefore, it is beneficial for the reader to have a basic understanding of both deep learning and computer vision. If the reader feels the need to brush up on these subjects, a good start for deep learning is [12] and for an overview of com-puter vision [41] is recommended. Further, the reader could also need to read up on some general machine learning concepts, in which case [13] is suggested. The chapter begins with an introduction to tracking-by-detection and a short presentation of some image classification networks. It then continues with de-scriptions of different object detection algorithms. The ambition is to present the object detection algorithms in such a way that it is clear how they differ from each other. Thus, details about the training procedure have mostly been omitted unless the training procedure is a central characteristic of the algorithm. Algo-rithms are presented in sequence if they are related to each other, otherwise the algorithms are presented in the chronological order in which they were published. The chapter finishes with descriptions of tracking algorithms used in this thesis.

(18)

2.1 Tracking-by-Detection

Multiple object tracking is the task of assigning consistent and unique identities to multiple objects in a video sequence. This thesis examines an object tracking technique calledtracking-by-detection. Tracking-by-detection is a two-stage

pro-cess: an object detection algorithm first detects objects present in a frame; these objects are then associated with already tracked objects by a tracking algorithm [27]. Normally, the object detection algorithm and the tracking algorithm are completely separated from each other and can therefore be analyzed individu-ally.

Object detection is the process of detecting particular classes of objects in an image, examples of classes are things such as people or bags. The aim of an ob-ject detection algorithm is to both localize and classify obob-jects belonging to any of the sought-after classes [5]. Thus, for each detected object, an object detection algorithm produces estimates of position, size and class of the object. Position and size of detected objects are often represented by a bounding box, which is a rectangular box encompassing the object. The extent of a detected object can also be defined by a segmentation mask, which is a pixel-level mask of the object [41]. Object detection has developed significantly in the last decade due to advances in the closely related field of image classification. This progress is owed to break-throughs [20] in how convolutional neural networks (CNNs) can be utilized to classify images. Object detection algorithms considered in this thesis usually consists of a CNN designed for image classification and then have additional algorithm-specific structure around the CNN. The CNN is referred to as the back-bone of the algorithm and the algorithm-specific structure is called meta architec-ture. This thesis will, as is custom, identify object detection algorithms by their

meta architecture. CNN based object detection algorithms can be split into two different groups: single-stage and two-stage detectors [5]. Two-stage detectors first generate possible bounding boxes by segmenting an image into regions of interest, these regions are then separately classified by a CNN in a second stage. Single-stage detectors produces estimates of both bounding boxes and classes in a single forward pass of an image through a CNN. Traditionally, two-stage detec-tors have achieved higher accuracy at the cost of speed compared to single-stage detectors. However, the recently introduced loss functionFocal loss [24] has made

single-stage detectors able to near two-stage detectors in terms of accuracy. The trade-off between speed and accuracy is a major design choice and has been stud-ied in papers such as [17].

The tracking algorithm in a tracking-by-detection framework is responsible for assigning unique identities to tracked objects and to make object associations be-tween frames. This thesis main focus is on object detection algorithms and only two different tracking algorithms will be considered, SORT [4] and Deep SORT [44]. SORT stands forSimple Online and Realtime Tracking, it is a deliberately

(19)

of objects and makes frame-to-frame associations using the Hungarian method [21]. Deep SORT is an extension of SORT that incorporates appearance informa-tion when doing object associainforma-tions between frames.

2.2 Image Classification Networks

As described in section 2.1, object detection algorithms studied in this thesis con-sist of algorithm-specific meta architecture and a backbone, with backbones be-ing CNNs originally constructed for image classification. This thesis focuses on the meta architectures and thus only a brief explanation of different backbones will be provided. The following list is a short introduction to relevant backbones and how they compare to each other in terms of accuracy and complexity. Com-plexity is measured in floating-point operations per second (FLOPs) and is used as an indication of how fast a network is.

• VGG-16: A convolutional neural network with 16 layers that performed well on the 2014 ILSVRC challenge [38]. Forward passing an image with resolution 224 x 224 pixels requires roughly 15 ∗ 109FLOPs [15]. It achieves 71.93% accuracy on the ImageNet validation dataset [15].

• ResNet-50: A residual network with 50 layers, it achieves 77.15% accuracy on the ImageNet validation dataset [15]. Forward passing a 224 x 224 image requires 3.8 ∗ 109FLOPs [15].

• ResNet-101: A residual network with 101 layers achieving 78.25% accuracy on ImageNet [15]. Forward passing a 224 x 224 image requires 7.6 ∗ 109 FLOPs [15].

• Darknet-53: A 53-layered network with residual layers, it was designed specifically for use in YOLOv3 [33]. On ImageNet its accuracy of 77.2% is similar to ResNet-101’s 77.15%. It also requires roughly the same amount of operations to perform a single forwards pass, forward passing an image with resolution 256 x 256 pixels requires roughly 18.7∗109FLOPs. Darknet-53 is however significantly faster since it utilizes the GPU more effectively [33].

2.3 Object Detection Algorithms

This section provides a theoretical introduction to object detection algorithms tested in the experimental part of this thesis. The tested algorithms’ predecessors are also explained since it makes it easier to understand tested algorithms.

2.3.1 R-CNN

R-CNN, short forRegions with CNN features, is a method for object detection

(20)

parts: a region proposal, a convolutional neural network and a set of support vec-tor machines (SVMs). Figure 2.3.1 below shows the interaction of the main parts of R-CNN. First, the region proposal method segments an image into category-independent regions. This generates approximately 2000 regions per image. Af-ter segmenting the image, each region is warped to a fixed size to fit the required input size of the CNN. Next, the 2000 warped regions are separately fed through the CNN and a feature vector is extracted for each region. The feature vector is then classified by a set of linear SVMs, where each SVM is trained to classify one specific class. Finally, given the class predicted by the SVMs, ridge regression is used to improve the predicted shape of the bounding box. When all regions are scored, non-maxima suppression is applied to remove predicted bounding boxes that overlap with predictions with higher score. R-CNN is not restricted to any specific segmentation method or a specific CNN architecture. In [10], a segmen-tation method called selective search [43] is used and results are demonstrated for when CNN architectures presented in [20] and [38] are used.

Figure 2.3.1:Schematic of the R-CNN pipeline.

The authors of R-CNN also showed that supervised pre-training on a similar prob-lem is an effective way to initialize the weights of a CNN. In [10], initial weights of the CNN were obtained by pre-training the CNN to perform image classifica-tion on data from ILSVRC2013 [37]. The CNN was then fine-tuned to perform object detection by training it on the Pascal VOC 2012 [8] dataset, which is an object detection dataset. This is a form of transfer learning that has shown to be an effective approach when adapting CNNs to domains where training data is sparse [42].

2.3.2 Fast R-CNN

A major drawback of the R-CNN method is that it is slow. This is primarily due to each region proposal being passed through the CNN separately, which is time consuming. In order to increase the speed, an object detection method called Fast R-CNN was introduced by Girshick in [9]. Fast R-CNN increases the speed of object detection, mainly by passing the image forward through the CNN only once instead of once for every region which R-CNN does.

(21)

As in R-CNN, a region proposal method first segments the image into category-independent regions, creating regions of interest (RoI). The whole image is then processed by a CNN that produces convolutional feature maps of the image. Next, for each region proposal, an RoI pooling layer that uses spatial pyramid pooling [14] is applied to the feature maps. This converts each RoI into a fixed size vec-tor. The feature vector is then processed by fully connected layers that branches out into two different output layers. One of the output layers is a softmax layer that produces probability estimates for the object classes. The other layer is a bounding box regressor that outputs refined estimates of the bounding boxes for each of the object classes. Figure 2.3.2 shows how an image is processed by Fast R-CNN.

Figure 2.3.2:Schematic of the Fast R-CNN pipeline.

Another advantage of Fast R-CNN is that it can be trained in a single stage, as op-posed to R-CNN that requires its modules to be trained separately. Fast R-CNN accomplishes this by using a single loss function to account for both classifica-tion and bounding box regression simultaneously. This loss funcclassifica-tion enables Fast R-CNN to jointly train classification and bounding box regression and thus the whole network can, except for the region proposals, be trained end-to-end.

2.3.3 Faster R-CNN

Fast R-CNN improved the speed of object detection by forward passing an image through a CNN only once, instead of forward passing every region of interest in an image. For Fast R-CNN the bottleneck lies instead in the image segmentation methods that are usually implemented on the CPU. To resolve this, Renet al.

pro-posed a method called Faster R-CNN in [35]. Faster R-CNN removes the need for CPU computations by introducing the idea of Region Proposal Networks (RPNs). RPN is a region proposal method that is described further in section 2.3.4. For each position, the RPN makes several predictions relative to a fixed

(22)

num-ber of reference boxes, these reference boxes are called anchors. Anchors can be thought of as suggested bounding boxes for each sliding window location. In [35], the anchors are created at 3 scales and with 3 different aspect ratios, giving a total of 9 different anchors for each location. This means that the RPN produces 9 bounding boxes at every sliding window location, one for each anchor.

Figure 2.3.3:The Faster R-CNN network.

The regions generated by the RPN are then used as region proposals in Fast R-CNN, which was described in section 2.3.2. By using an RPN, Faster R-CNN re-moves the time consuming image segmentation that was needed in Fast R-CNN. The speed is further increased by using a single CNN for both the RPN and Fast R-CNN. This also means that Faster R-CNN can be trained end-to-end by first training the RPN to propose regions and then use the region proposals to train Fast R-CNN.

2.3.4 Region Proposal Networks

Region Proposal Networks generate region proposals by sliding a small network over convolutional feature maps. At each location, the small network takes a window of the convolutional feature maps and converts it to a feature vector. This feature vector is then input into two different fully connected layers, one layer performs bounding box regression and the other layer is a classification layer that predicts an objectness score. Objectness score is a prediction of how likely it is that the predicted bounding box contains an object compare to just being background.

2.3.5 Mask R-CNN

Mask R-CNN is an extension of Faster R-CNN that was presented by Heet al. in

[16]. In addition to object detection, Mask R-CNN is also able to perform object instance segmentation. Segmentation is implemented by adding a third branch to Faster R-CNN, this branch outputs an object mask for each detected object. To improve segmentation, a method called RoIAlign is introduced in order to extract a more precise feature map for each RoI. RoIAlign computes exact values of the

(23)

feature map using bi-linear interpolation instead of quantizing the feature map. The authors of [16] found that Mask R-CNN achieves a higher average precision than Faster R-CNN in object detection. This was shown to be partially due to the use of RoIAlign and partially due to the multi-task loss used to train Mask R-CNN. A Mask R-CNN is trained with a multi-task loss function that simultane-ously accounts for classification, bounding box regression, and object segmenta-tion.

2.3.6 YOLO

Redmonet al. introduced a novel approach to object detection in [34] called

YOLO,You Only Look Once. Unlike R-CNN and its successors, YOLO does not

use any region proposal method and instead uses a single CNN to predict both bounding boxes and classes.

In YOLO, an input image is first split into an S × S grid. Each grid cell is then responsible for predicting B bounding boxes as well as a confidence score for every bounding box. The confidence score is calculated as Pr(Object) ∗ IoUgt_pred, where Pr(Object) is the predicted probability that the box contains an object and IoUgt_predis the estimated intersection over union (IoU) between the predicted box and a ground truth box. For each grid cell, C object class probabilities are also predicted, these probabilities are conditioned on the cell containing an ob-ject. The predicted boxes and class probabilities are then combined into a single score for each class and box. Equation 2.3.1 is taken from the introduction of YOLO in [34] and shows how the class predictions and box predictions are com-bined. As in the original paper [34], Pr(Classi) is used as a simplified notation for

Pr(Classi, Object).

Pr(Classi|Object) ∗ Pr(Object) ∗ IoU gt

pred = Pr(Classi) ∗ IoU gt

pred (2.3.1)

The score accounts both for the probability that the box contains class i, Pr(Classi),

and for how the predicted box is estimated to fit to a ground truth box, IoUgt_pred. Figure 2.3.4 shows how an image is split into a grid and how the cell with the red dot centered in it predicts two different bounding boxes. The predicted bounding boxes are then combined with class probabilities, which are also obtained from the image grid, to produce the final object detections. The illustration is kept sim-ple so that it is easier to understand, in reality there would be many more objects predicted by the grid.

The above described procedure is realized by a custom CNN architecture inspired by GoogLeNet [40]. The custom CNN consists of 24 convolutional layers with 2 fully connected layers at the end. Each predicted bounding box is defined by 5 values, (x, y, width, height) and a confidence score. This means that predictions output from the CNN are represented by an S × S × (B ∗ 5 + C) tensor. Where

(24)

Figure 2.3.4:Illustration of the grid used in YOLO.

the number of object classes. In [34] they use S = 7, B = 2 and C = 20, yielding a final prediction with the shape 7 × 7 × 30.

2.3.7 YOLOv2

With the intention of improving YOLO, Redmonet al. proposed a method called

YOLOv2 in [32]. YOLOv2 is a modified version of YOLO intended to increase both speed and accuracy.

Similarly to Faster R-CNN, YOLOv2 utilizes anchors when predicting bounding boxes. For each grid cell, YOLOv2 produces bounding boxes by predicting offsets to 5 anchors. Classes are now also predicted for each anchor instead of for each grid cell, each anchor is also given an objectness score. As in YOLO, classes are predicted on the condition that there is an object, Pr(Classi|Object). Objectness

is calculated as the estimated IoU between the predicted box and an estimated ground truth box, IoUgt_pred. YOLOv2 also employs a new method to determine anchor sizes; instead of hand-picking the anchors as in Faster R-CNN, YOLOv2 uses k-means clustering on the training data to produce anchors that are better fitted to the data.

In order to increase speed, a CNN architecture called Darknet-19 is introduced in YOLOv2. Darknet-19 is able to achieve higher image classification accuracy than both the widely used VGG-16 [38] and the custom network previously used in YOLO [34]. It manages to do this while only using 5.58 ∗ 109 floating point operations per forward pass, compared to 30.69 ∗ 109operations in VGG-16 and 8.52 ∗ 109operations in the network previously used in YOLO.

2.3.8 YOLOv3

YOLOv3 includes further improvements of YOLOv2 presented by Redmonet al.

in [33]. Similarly to feature pyramid networks described in section 2.3.9, boxes are predicted at 3 different scales in YOLOv3. This increases YOLOv3’s ability to detect small objects, something which previous versions of YOLO struggled with.

(25)

Inspired by residual networks presented in [15], Darknet-19 is expanded to in-clude residual layers. This new CNN architecture is called Darknet-53 since it has 53 convolutional layers in total. Compared to Darknet-19, Darknet-53 has higher accuracy but is a bit slower.

2.3.9 Feature Pyramid Network

Classic object detection techniques based on hand-crafted features such as SIFT [26] and HOG [7] often use feature pyramids to detect objects in different scales. However, due to the large amount of memory needed to train a CNN with feature pyramids, methods such as R-CNN, Fast R-CNN and YOLO do not use feature pyramids. This was until Linet al. proposed a method in [23] that enables the

use of feature pyramids by utilizing pyramidical feature hierarchies created by CNNs. The proposed method is called Feature Pyramid Network (FPN) and can be applied to any CNN architecture. Examples of algorithms that uses FPNs are Faster R-CNN, Mask R-CNN and RetinaNet.

To generate feature pyramids, FPN initially creates two different feature pyra-mids and then merges them by adding them to each other. To do this, feature maps generated by a CNN are first grouped intostages where each stage contains

all the feature maps that are of the same size. For each stage, the feature maps generated by the deepest layer in that stage are taken to represent a level in one of the feature pyramids. The other feature pyramid is constructed by continu-ously upsampling the final level of the first pyramid to create a feature pyramid with the same dimensions as the first one. These two feature pyramids are then merged to create the final feature pyramid. The merging is done to combine fea-ture maps with finer detail from the first pyramid with coarser, but semantically stronger, feature maps from the second pyramid.

2.3.10 Single Shot Detector

The Single Shot Detector (SSD) is a single-stage detector presented by Liuet al.

in [25]. SSD adds convolutional layers to an existing CNN in order to produce layers with feature maps of smaller size. By then creating predictions at several different layers, SSD is able to detect objects in multiple scales. At each layer, object detections are produced by applying a number of 3 × 3 kernels to every position of the feature maps. This procedure is illustrated in figure 2.3.6, which shows how SSD produces predictions from mulitple different feature maps. All predictions are made relative to reference bounding boxes, these are called default boxes in [25], default boxes are analogous to anchor boxes used in Faster R-CNN. For each position and bounding box, a specific 3 × 3 kernel predicts a sin-gle output value denoting either a class or an offset for the bounding box. This means that the total number of filters applied to a position of a feature map is (C + 4)B, where C is the number of classes and B is the number of default boxes. The total number of outputs for a feature map of size M × N is then (C + 4)BMN .

(26)

Figure 2.3.5:Merging of feature pyramids.

2.3.11 RetinaNet

The single-stage approach utilized in detectors such as YOLO and SSD enabled faster object detection. However, these single-stage detectors were not able to achieve the accuracy that two-stage detectors such as Faster R-CNN could offer.

Linet al. found that class imbalance in the training data was the principal cause

for this. This class imbalance is a result of the large number of bounding boxes that a single-stage detector processes. The vast majority of these bounding boxes will be easy negatives, i.e. bounding boxes that can easily be classified as not containing an object. This imbalance makes training inefficient and can create models that do not work as intended. To address this, Linet al. introduced a loss

function calledFocal Loss [24] and the single-stage object detector RetinaNet that

utilizes Focal Loss.

Focal Loss

Focal Loss stems from the cross entropy (CE) loss for binary classification. The cross entropy is described in 2.3.2, where p is the predicted probability that an

observation belongs to a certain class and y ∈ {0, 1} is the ground truth, which is 1 if the observation belongs to the class and 0 otherwise.

(27)

Figure 2.3.6:SORT

By defining ptas in 2.3.3, 2.3.2 can be rewritten as CE(p, y) = CE(pt) = -log(pt).

pt=        p, ify=1 1-p, ify=0 (2.3.3)

In order to down-weight the effect easy negatives have on the training procedure, a modulating factor (1 − pt)γ is introduced. Easy negatives will have pt ≈1 and

thus (1 − pt) ≈ 0. The focusing parameter γ is used to tune the down-weighting

effect of the modulating factor, increasing γ reduces the impact easy negatives have during training.

In focal loss, a weighting factor αt∈[0, 1] is also used to balance the importance

of negative and positive samples, 2.3.4 describes how αtis defined. Positive

sam-ples are samsam-ples containing an object and negative samsam-ples are samsam-ples that do not contain objects.

αt=       

α, for positive samples

1-α, for negative samples (2.3.4)

Focal loss combines both the modulating factor and the weighting factor de-scribed above, which means that the focal loss function is defined as in 2.3.5 below.

FL(pt) = −αt(1 − pt)γlog(pt) (2.3.5)

RetinaNet

To make use of focal loss, Lin et al. developed an object detector called

Reti-naNet. RetinaNet is a single-stage detector consisting of a backbone network and two smaller subnetworks. The backbone network first generates feature maps at different scales, this is done using an FPN that was described in section 2.3.9.

(28)

RetinaNet also utilizes anchors, at every level of the feature pyramid each spatial position has 9 anchors. A classification subnetwork then predicts object existence probabilities for each class in each anchor. This classification subnetwork uses fo-cal loss as loss function. Parallel to the classification subnetwork is a bounding box regression subnetwork that produces bounding box offsets for each anchor. The bounding box regression network is wholly separated from the classification network and does not make use of class probabilities when predicting bounding box offsets.

2.4 Tracking Algorithms

The following section presents the theory behind the two tracking algorithm con-sidered in this thesis: SORT and Deep SORT. As described in the limitations, these two are especially suited to answer the second research questions without broadening the scope too much.

2.4.1 SORT

Simple Online and Real-Time Tracking, SORT, is a tracking algorithm that was introduced by Bewleyet al. in [4]. SORT is designed to perform multiple object

tracking (MOT) in a tracking-by-detection system. In order to achieve real-time processing, SORT is intentionally kept simple and avoids complex and time con-suming tasks. To compensate for its lack of complexity, SORT uses CNN based object detectors to instead rely on more accurate object detections.

For each new frame, SORT first propagates objects that are already tracked into the current frame. The new positions of these already tracked objects are pre-dicted using a Kalman [19] filter with a linear constant velocity model. Next, an object detection algorithm detects objects present in the current frame. These de-tected objects are then compared to already tracked objects and a cost-matrix is created. This cost-matrix is calculated as the IoU between each detection and each of the already tracked objects. Detections are then assigned to already tracked objects using the Hungarian method [21]. A new track is created when an object is detected in several consecutive frames while not overlapping with any of the already tracked objects. Figure 2.4.1 below shows how predicted positions are compared to object detections to assign identities in new frames. SORT, as it is used in this thesis, does not have any memory and a tracked object is lost if SORT fails to detect it in a frame.

2.4.2 Deep SORT

Built with the intention of reducing the number of identity switches, Deep SORT incorporates appearance information into the tracking procedure presented in SORT [44]. Similar to SORT, Deep SORT handles state estimations with a Kalman filter. Deep SORT differs from SORT in that it makes use of additional techniques when assigning detections to already tracked objects.

(29)

Figure 2.4.1:Object association between frames in SORT.

Deep SORT utilizes two different distance metrics when comparing detections to already tracked objects: Mahalanobis distance [13] and cosine distance between appearance descriptors. The Mahalanobis distance measures how the position of a new detection differs from the positions of already tracked objects in terms of standard deviations from the mean of the tracked objects. This metric allows Deep SORT to avoid assigning a new detection to an already existing track where the frame-to-frame motion would be unreasonable. Appearance descriptors are computed by forwarding each bounding box through a CNN that has been pre-trained on a person re-identification dataset. The appearance descriptor of each new detection is then compared to the appearance descriptors of already tracked objects by calculating the cosine distance between descriptors. Tracked objects and their appearance descriptors are also saved for 30 frames after they are lost so that Deep SORT has the ability to resume tracking identities that have been lost for a number of frames. Using appearance descriptors in this way gives Deep SORT the ability to find a previously tracked object even if it has been occluded for a number of frames.

(30)

(31)

3

Method

This chapter covers the methodology used to perform tests and evaluations for the thesis. The chapter begins with a section describing the data annotation method employed to annotate test data. The second section covers metrics used to eval-uate performance and the chapter then continues with an overview description of the test environment where different practical aspects of the tracking-by-detection system are explained. Finally, the last section of this chapter presents algorithms and implementations tested in this thesis.

(32)

3.1 Data Annotation

The test data consists of video from two different surveillance cameras. One cam-era overlooks a platform of an underground train station, the other camcam-era over-looks a stair leading down to the platform on that same station. The video se-quences are 10.6 and 12.8 seconds long respectively, both videos have 10 frames per second giving a total of 234 frames. Frames in both videos have resolution 1280 × 720 pixels.

Ground truth annotations are made following the protocol presented in MOT16 [30] using Microsoft’s applicationVisual Object Tagging Tool [29]. Due to the

na-ture of the test data only a single class,person, is annotated. There are other

ob-jects present in the videos such as trains and bags, these obob-jects are however not relevant in the context of this thesis. Objects are annotated if it is clear from the current frame alone that the object exists, this means that occluded objects are not annotated. If an object is partly occluded its full extent is estimated and the object is annotated thereafter. Bounding boxes are always fitted as tightly as pos-sible to the annotated object while still containing all of the object. There is a total of 2319 different ground truth bounding boxes annotated over the two test videos, these bounding boxes are distributed over 36 unique identities.

Each person is given a single identity throughout the whole sequence, even if the person is occluded in parts of the sequence. Occurring occlusions are tem-porary and no person disappears for more than a couple of seconds. Handling such short term occlusions is seen as a part of the tracking problem rather than as person re-identification. Therefore, these occlusions were considered relevant for this thesis and accounted for when annotating data. Figure 3.1.1 below shows an example of what annotated frames might look like for both sequences.

(33)

Figure 3.1.1:Annotated frames for both sequences.

3.2 Evaluation

Performance is measured according to the framework presented in MOT16 [30] and in the same manner as performance is measure in the MOTChallange1. The authors of [30] provide publicly available code2for evaluation, this code is used to calculate the different performance metrics. MOT16 was chosen since it is a compilation of other many metrics developed in an attempt to standardize eval-uation of multiple object tracking. MOT16 contains a wide array of metrics for

1_{https://motchallenge.net/}

(34)

evaluation of multiple object tracking, some which are quite similar to each other. Hence, metrics that were considered to similar to other have not been included in this thesis.

3.2.1 Classification of Predicted Bounding Boxes

The fundamental performance metric is the classification of bounding boxes. Ta-ble 3.2.1 below shows the different classes that a bounding box can be assigned. A predicted bounding box is considered a true positive (TP) if its intersection over union (IoU), or Jaccard Index [18], with a ground truth box is larger than 0.5. Equation 3.2.1 shows how the IoU between a predicted box P and a ground truth box G is calculated. False positives (FP) are predicted bounding boxes without a corresponding ground truth box, and false negatives (FN) are ground truth boxes that the algorithm fails to detect. True negatives (TN) are irrelevant in this context since object detection algorithms do not produce any predictions on whether ob-jects are absent. For all performance metrics defined in subsequent sections, TP, FP, and FN are used as shorthand notations for the number of bounding boxes that have been labeled as belonging to each class. GT is also used as a notation for the total number of ground truth bounding boxes that exist.

IoU(P, G) = |_|P ∩ G| P ∪ G| =

|_{P ∩ G|}

|_{P| + |G| − |P ∩ G|} (3.2.1)

Prediction

Object (positive) Background (negative)

Ground

T

ruth _(positive)Object

TP FN

True Positive False Negative

correctly labeled as object incorrectly labeled as background

Background (negative)

FP TN

False Positive True Negative

incorrectly labeled as object correctly labeled as background

Table 3.2.1:Classification of bounding boxes.

3.2.2 Object Detection Evaluation

The MOT16 framework includes several metrics that can be used to evaluate ob-ject detection algorithms. Below are the metrics that this thesis uses to measure performance:

• Recall [12]: Recall is measured as the ratio between correctly detected ob-jects and the total number of ground truth obob-jects. Thus, the algorithm’s recall reflects its ability to find ground truth objects.

Recall = TP

(35)

• Precision (Prcn) [12]: Precision describes the accuracy of predicted bound-ing boxes. It is calculated as the ratio between correctly predicted boundbound-ing boxes and the total number of predicted bounding boxes.

Precision = TP

TP + FP· 100 (3.2.3)

• F1 score (F1) [12]: F1 score is a metric that combines recall and precision into a single score by calculating the harmonic mean of precision and recall.

F1 = 2TP

2TP + FP + FN· 100 (3.2.4)

• Average Precision (AP) [28]: Average precision is calculated as the area un-der the precision-recall curve. This precision-recall curve is created by first sorting all predictions in descending order according to their confidence. Starting with the most confident prediction, precision can then be plotted against recall by iteratively calculating cumulative precision and recall at different ranks in the now ordered set of predictions. Figure 3.2.1 and table 3.2.2 shows an example of a precision-recall curve for when there is a total of 3 ground truth objects and 5 object are predicted, AP is calculated as the area under the curve.

Rank Conf Label Precision Recall

1 0.987 TP 1.0 0.33

2 0.934 FP 0.5 0.33

3 0.887 TP 0.67 0.67

4 0.764 FP 0.5 0.67

5 0.564 TP 0.6 1.0

Table 3.2.2:Example predictions

0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Pr ec isi on

(36)

• Multiple Object Detection Precision (MODP) [39]: A metric that mea-sures the overlap between predicted bounding boxes and ground truth data. MODP is calculated as the IoU between predicted bounding boxes and ground truth bounding boxes.

MODP = P

k∈framesPi∈objectsIoU(Pki, Gki)

GT · 100 (3.2.5)

• Multiple Object Detection Accuracy (MODA) [39]: MODA measures the accuracy of predictions by looking at missed ground truth boxes and false positives.

MODA = (1 −FN + FP

GT ) · 100 (3.2.6)

• Frames Per Second(FPS): FPS is a metric for comparing the speed of object detection algorithms. It is calculated as the ratio between the number of frames processed and the time that it takes to run the algorithm.

FPS = #frames

total runtime (3.2.7)

3.2.3 Object Tracking Evaluation

Tracking is also evaluated in accordance with the MOT16 guidelines. The list be-low describes metrics specific to tracking that this thesis considers. Many object detection metrics can also be used to evaluate tracking performance. Those met-rics are assumed to have the same definition as described in section 3.2.2 unless stated otherwise.

• Identification Recall (IDR) and Identification Precision (IDP) [36]: IDR and IDP are similar to the metrics Recall and Precision for object detec-tion. The metrics will however differ since objects are considered tracked only if they can be assigned an identity, which will not be the case for all detected objects. Another difference is that inconsistencies in identity as-signments will lower the IDTP score. For each ground truth identity, the predicted identity most similar to it is found. Any other identity assigned to the ground truth identity is then considered a mismatch (IDFP) and will be counted as a false positive instead of a true positive.

IDR = IDTP

IDTP + IDFN· 100 (3.2.8) IDP =

IDTP

IDTP + IDFP· 100 (3.2.9) • IDF1-score(IDF1) [36]: Similar to F1 score for object detection, IDF1

com-bines both IDR and IDP into a single score to facilitate comparisons of dif-ferent trackers.

IDF1 = 2IDTP

(37)

• Mostly Tracked (MT) [30]: The number of ground truth identities that are tracked for 80% or more of their existence.

• Partly Tracked (PT) [30]: The number of ground truth identities that are tracked between 20% and 80% of their existence.

• Mostly Lost (ML) [30]: The number of ground truth identities that are tracked for less than 20% of their existence.

• Identity Switches (IDs) [30]: The number of identity switches. An identity switch is counted every time an already tracked ground truth identity is assigned a new tracking identity.

• Track Fragmentations (FM) [30]: The number of track fragmentations. A track fragmentation is counted every time a tracked ground truth identity is lost and then found again in a later frame.

• Multiple Object Tracking Accuracy (MOTA) [3]: MOTA combines false negatives, false positives and identity switches into a single score in order to express overall performance with a single value.

MOTA = (1 − FN + FP + IDs

GT ) · 100 (3.2.11)

• Multiple Object Tracking Precision(MOTP) [3]: MOTP measures how well correctly predicted bounding boxes (TPi) fit their respective ground truth

boxes (GTi). This is done by calculating the average overlap between true

positives and their corresponding ground truth object. MOTP = P IoU(TPi, GTi)

GT · 100 (3.2.12)

3.3 Test Environment

A tracking-by-detection system developed in a previous project at NFC as a part of the courseImage and Graphics, Project Course CDIO is used as test environment

for all tests in this thesis. The system is able to perform tracking-by-detection, per-son re-identification and object instance segmentation, though only the tracking-by-detection functionality is relevant for this thesis.

As described in section 2.1, the tracking-by-detection module consists of two parts: an object detection algorithm and a tracking algorithm. The object detec-tion algorithm first takes a video sequence as input and outputs a CSV file that describes objects detected in each frame. The CSV file is constructed following MOT16 guidelines so that the results can be evaluated using the MOT16 proto-col, table 3.3.1 shows an examples of values that rows in the CSV file can contain. The propertyid is always set to -1 since object detection algorithms do not assign

identities to detected objects.xmin, ymin, width, and height defines the

(38)

made is described by the propertyconf. x, y, and z are values used to evaluate

3-dimensional object detection within the MOT16 framework, these are not rel-evant for this thesis and are always set to -1. Only objects of the classperson are

considered since that is the only class annotated in the test data.

frame id xmin ymin width height conf x y z

1 -1 699.66 174.56 88.64 253.76 0.978 -1 -1 -1

1 -1 587.2 65.93 49.26 122.81 0.672 -1 -1 -1

2 -1 704.04 175.93 88.33 256.92 0.956 -1 -1 -1

Table 3.3.1:Examples of rows in the CSV file output from the object detec-tion algorithm.

The tracking algorithm then takes as input both the CSV file from the object de-tection algorithm, and the video sequence. It performs multiple object tracking and produces its own CSV file as output. This CSV file has the same format as the CSV file produced by the object detection algorithm except that each object now has been assigned an identity. Table 3.3.2 shows an example of a few rows of this CSV file,x, y, and z are as before static.

frame id xmin ymin width height conf x y z

1 1 699.66 174.56 88.64 253.76 0.978 -1 -1 -1

1 2 587.2 65.93 49.26 122.81 0.672 -1 -1 -1

2 1 704.04 175.93 88.33 256.92 0.956 -1 -1 -1

Table 3.3.2:Examples of rows in the CSV file output from the tracking algo-rithm.

The system in place at NFC currently uses YOLOv3 [33] as object detection al-gorithm and Deep SORT [44] as tracking alal-gorithm. All tests for this thesis are done by replacing either the object detection algorithm or the tracking algorithm, this is possible since the object detection algorithm and the tracking algorithm are completely separated from each other. The CSV files that the system outputs are then used to measure performance of different object detection and tracking algorithms. Figure 3.3.1 shows the main parts of the tracking-by-detection sys-tem and how each part outputs a CSV file.

3.3.1 Testing Object Detection Algorithms

Object detection algorithms are evaluated in two different ways: as stand-alone object detection algorithms and on how they perform in a tracking-by-detection system. Therefore, performance metrics for object detection and object tracking

(39)

Video Sequence Object Detection Algorithm Object Tracking Algorithm CSV CSV

Figure 3.3.1:Scheme of the tracking-by-detection pipeline with its outputs.

are both used to evaluate object detection algorithms. An algorithms performance in object detection is likely highly correlated with how it performs in a tracking-by-detection system. It is however possible that certain characteristics of an ob-ject detection algorithm interacts well with a specific type of tracker, which is why object detection algorithms are also evaluated in the tracking-by-detection system.

3.3.2 Testing Object Tracking Algorithms

One of the objectives of this thesis is to investigate how the use of visual descrip-tors in the tracking algorithm effects the performance of a tracking-by-detection system. There is a myriad of different tracking algorithms available, some which use visual descriptors and some which do not. This thesis is however mainly con-cerned with object detection algorithms and for that reason only two different tracking algorithms are tested, SORT and Deep SORT. These two are especially suited to study how the use of visual descriptors effects performance since Deep SORT is an extension of SORT in which usage of visual descriptors has been incor-porated. Both algorithms are evaluated using the performance metrics described in section 3.2. Tracking algorithms are also tested with ground truth object de-tections as input. The reason for doing this is that it gives an insight into how much error the tracking algorithm introduces and thus provides an upper limit for how much a better object detection algorithm can improve performance in a tracking-by-detection system.

3.3.3 Hardware

All tests are performed on the same machine in order to be able to make valid speed comparisons of different algorithms. The computer’s CPU is an Intel Xeon

(40)

Silver 41083 Processor with a clockspeed of 1.8 GHz. NVIDIA’s Quadro P40004 is used as GPU for the computer.

3.4 Algorithms and Implementations

Implementations of tested algorithms would preferably come from the author who initially presented the algorithm. This is so that the implementation stays as true to the cited paper as possible. When possible, such implementation are used for tests in this thesis. It is however not always feasible to integrate those implementations into the test environment and therefore some non-original im-plementations have also been tested.

The system in place at NFC uses pre-trained weights for both YOLOv3 and Deep SORT, these weights are supplied with the implementations. Similarly, this thesis utilizes pre-trained weights supplied with the implementations for the different object detection algorithms and for Deep SORT. To make comparisons fair, only pre-trained weights trained on the Microsoft COCO [22] dataset are used. Mi-crosoft COCO was chosen since it is an extensive dataset which is often used as a benchmark when comparing algorithms [33][35][16]. Part of the reason as to why only pre-trained weights are used is the lack of annotated data. The small amount of data that was annotated in this thesis was deemed to be more useful as test data rather than as training data.

A short script is created for each object detection algorithm in order to integrate it into the tracking-by-detection system. The script feeds the test videos into the object detection algorithm and then converts its output to a CSV file on the for-mat specified in section 3.3. The different scripts have similar overall structure but differs in the details since they have to be tailored to fit each implementation.

3.4.1 Deep Learning Libraries

Implementations of different algorithms are built using software libraries for deep learning. The choice of deep learning library could effect both speed and performance of algorithms and is therefore an import aspect when comparing different implementations. Deep learning libraries used for algorithms tested in this thesis are:

– TensorFlow [1]: A library developed by Google that includes functionality that can be used to create deep learning algorithms such as CNNs.

– PyTorch [31]: Deep learning library developed by Facebook’s AI research group.

3 https://ark.intel.com/content/www/us/en/ark/products/123544/intel-xeon-silver-4108-processor-11m-cache-1-80-ghz.html

4 https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/productspage/quadro/quadro-desktop/quadro-pascal-p4000-data-sheet-us-nvidia-704358-r2-web.pdf

(41)

– Caffe25_{: A deep learning framework that has recently been integrated into}

PyTorch.

– Keras [6]: A high-level deep learning API that can run on top of other li-braries such as TensorFlow.

3.4.2 Implementations

The following list describes the different object detection algorithms and imple-mentations that are evaluated in this thesis:

Facebook’s Detectron6[11]: Detectron is an object detection library devel-oped by Facebook AI research, it includes implementations of several ob-ject detection algorithms with different backbones. The library is written in Python and all tests for this thesis are done with the Caffe2 framework built into PyTorch 1.0. Detectron’s implementations of Faster R-CNN, Mask R-CNN and RetinaNet are tested in this thesis. All three algorithms are tested with both ResNet50 and ResNet101 as backbone.

Matterport’s Mask R-CNN7[2]: A Keras implementation of Mask R-CNN

that runs on TensorFlow 1.13.1 and Keras 2.2.4. The implementation uses a slightly lower learning rate than 0.02 which is used in the original paper, it also zero-pads images to resolution 1024 × 1024 instead of dynamically resizing the image as in the original paper [16]. The implementation is writ-ten in Python and uses ResNet101 as backbone.

Fizyr’s RetinaNet8: A Keras implementation of RetinaNet running on Ten-sorFlow 1.13.1 and Keras 2.2.4 where ResNet50 is used as backbone.

Ayoosh Kathuria’s YOLOv39: A PyTorch implementation of YOLOv3 with

Darknet-53 as backbone, tests are performed using PyTorch 1.0.

Pierluigi Ferrari’s Single Shot Detector(SSD)10: Keras implementation run-ning on top of TensorFlow 1.13.1 with Keras 2.2.4, it uses VGG-16 as back-bone for SSD.

Table 3.4.2 shows the configurations of different object detection algorithms tested as a part of this thesis. Some algorithms are tested with multiple image resolu-tions, when an algorithm is tested with different resolutions it is denoted with

algorithm:resolution. YOLOv3 tested with images of size 320 x 320 pixels is for

example called YOLOv3:320.

5_https://ca_ffe2.ai/ 6_{https://github.com/facebookresearch/Detectron} 7_{https://github.com/matterport/Mask_RCNN} 8_{https://github.com/fizyr/keras-retinanet} 9_{https://github.com/ayooshkathuria/pytorch-yolo-v3} 10_{https://github.com/pierluigiferrari/ssd_keras}

(42)

Algorithm Implementation Backbone

SSD:300 Pierluigi Ferrari VGG-16

SSD:512 Pierluigi Ferrari VGG-16

YOLOv3:320 Ayoosh Kathuria Darknet-53

RetinaNet Fizyr ResNet-50

Mask R-CNN Matterport ResNet-101

Faster R-CNN Detectron ResNet-50

Faster R-CNN Detectron ResNet-101

Mask R-CNN Detectron ResNet-50

Mask R-CNN Detectron ResNet-101

RetinaNet Detectron ResNet-50

RetinaNet Detectron ResNet-101

Table 3.4.1:Table showing the different object detection algorithms tested.

Two different implementations of tracking algorithms are used in this thesis: Alex Bewley’s SORT11 [4]: A Python implementation of SORT written by

the author of the SORT paper [4].

Nicolai Wojke’s Deep SORT [44]12: A Python implementation of Deep

SORT written by one of the authors of the original paper [44].

When Deep SORT is used in testing, an object has to be detected in two consecu-tive frames before it is given an identity. Also, if an identity leaves the sequence Deep SORT saves its position and appearance for 30 frames before it is disre-garded. SORT does not have this functionality and thus gives detected objects an identity directly and disregards them as soon as they are lost. To remedy this in-consistency, the first two frames in which a previously unseen object is visible are removed from the ground truth CSV file. This is so that SORT and Deep SORT have the possibility to track the same amount of ground truth objects.

11_{https://github.com/abewley/sort} 12_{https://github.com/nwojke/deep_sort}

(43)

4

Results

The following chapter will present results for the different object detection and tracking algorithms tested in this thesis. All testing is done on the two video sequences annotated as described in section 3.1, and performance is evaluated using the metrics presented in section 3.2. In order to make it easier to read the plots, each configuration of an object detection algorithm is given a unique color. This means that every meta-architecture is first given a general color, YOLOv3 is for example yellow. The brightness of the color is then used to indicate either complexity of the backbone or image resolution, with brighter colors denoting a less complex backbone or a lower image resolution. Both RetinaNet and Mask R-CNN are tested with two different implementations. To avoid confusion, Detec-tron’s implementations are given a black edge in all 2-dimensional plots so that it is easier to distinguish different implementations from each other.

(44)

4.1 Object Detection Results

This section presents results for the different object detection algorithms consid-ered in this thesis. Figure 4.1.1 first shows the average precision achieved by the different object detection algorithms. Average precision is a common metric for comparing object detection algorithms and is therefore first displayed in a sorted graph to give a general overview of how the algorithms compare to each other [33][24][16].

Figure 4.1.3 plots the average precision against the number of frames the algo-rithm can process per second. This plot is interesting since processing time is a limiting factor for how useful an algorithm is in a surveillance system. Precision and recall are then plotted against each other in figure 4.1.4. The balance between precision and recall is often a design choice and demonstrates central character-istics of an algorithm. Next, figure 4.1.2 displays the F1-score of different algo-rithms, this metrics also expresses the overall performance of the algorithms. Last, full results for object detection evaluation using the MOT16 [30] frame-work are presented in table 4.1.1. This table includes many of the metrics that can be calculated with the publicly available evaluation code1 for MOT16. The best score for each metric is written in bold font in order to make comparisons easier. The red bars in the cells are used to make it easier to compare algorithms, a larger bar indicates a better score.

0.0 0.2 Average Precision (AP)0.4 0.6 0.8 1.0

SSD:300 (VGG-16) SSD:512 (VGG-16) YOLOv3:320 (Darknet-53) Detectron's RetinaNet (ResNet-101) RetinaNet (ResNet-50)

Detectron's RetinaNet (ResNet-50) Mask R-CNN (ResNet-101) YOLOv3:416 (Darknet-53) YOLOv3:512 (Darknet-53)

Detectron's Faster R-CNN (ResNet-50) Detectron's Faster R-CNN (ResNet-101) Detectron's Mask R-CNN (ResNet-101) Detectron's Mask R-CNN (ResNet-50)

Figure 4.1.1:Average precision for different object detection algorithms.

(45)

0 20 40F1 score (F1)60 80 100 SSD:300 (VGG-16) SSD:512 (VGG-16) YOLOv3:320 (Darknet-53) Mask R-CNN (ResNet-101) RetinaNet (ResNet-50) YOLOv3:416 (Darknet-53)

Detectron's Faster R-CNN (ResNet-50) Detectron's Mask R-CNN (ResNet-50) Detectron's RetinaNet (ResNet-50) Detectron's RetinaNet (ResNet-101) Detectron's Faster R-CNN (ResNet-101) Detectron's Mask R-CNN (ResNet-101) YOLOv3:512 (Darknet-53)

Figure 4.1.2:F1 score for different object detection algorithms.

0

5

10

15

20

25 Frames

Per Second (FPS)

0.0 0.2 0.4 0.6 0.8 1.0

A

ve

ra

ge

P

re

ci

si

on

(

A

P)

SSD:300 (VGG-16) SSD:512 (VGG-16) YOLOv3:320 (Darknet-53) YOLOv3:416 (Darknet-53) YOLOv3:512 (Darknet-53) RetinaNet (ResNet-50) Mask R-CNN (ResNet-101)

Detectron's Faster R-CNN (ResNet-50) Detectron's Faster R-CNN (ResNet-101) Detectron's Mask R-CNN (ResNet-50) Detectron's Mask R-CNN (ResNet-101) Detectron's RetinaNet (ResNet-50) Detectron's RetinaNet (ResNet-101)

(46)

0 20 40 60 80 100

Recall

60 65 70 75 80 85 90 95 100

Pr

ec

isi

on

(

Pr

cn

)

SSD:300 (VGG-16) SSD:512 (VGG-16) YOLOv3:320 (Darknet-53) YOLOv3:416 (Darknet-53) YOLOv3:512 (Darknet-53) RetinaNet (ResNet-50) Mask R-CNN (ResNet-101)

Detectron's Faster R-CNN (ResNet-50) Detectron's Faster R-CNN (ResNet-101) Detectron's Mask R-CNN (ResNet-50) Detectron's Mask R-CNN (ResNet-101) Detectron's RetinaNet (ResNet-50) Detectron's RetinaNet (ResNet-101)

Figure 4.1.4:Precision and recall plot.

Object Detection Algorithm AP Recall Prcn F1 TP FP FN MODA MODP FPS SSD:300 (VGG-16) 0.2706 24.2 97.7 38.8 561 13 1758 23.6 76.7 5.52 SSD:512 (VGG-16) 0.3578 31.5 94.4 47.2 730 43 1589 29.6 78.3 4.73 YOLOv3:320 (Darknet-53) 0.6182 63.2 81.2 71.0 1465 339 854 48.6 75.6 22.81 YOLOv3:416 (Darknet-53) 0.7141 72.1 86.8 78.7 1671 251 648 61.2 78.7 20.82 YOLOv3:512 (Darknet-53) 0.7185 78.3 89.6 83.5 1815 211 504 69.2 78.7 17.2 RetinaNet (ResNet-50) 0.6321 66.1 95.2 78.0 1534 78 785 62.8 82.1 4.02 Mask R-CNN (ResNet-101) 0.7118 79.3 73.6 76.3 1838 658 481 50.9 78.5 1.84 Detectron’s Faster R-CNN (ResNet-50) 0.7881 81.1 77.3 79.2 1881 552 438 57.3 78.5 5.43 Detectron’s Faster R-CNN (ResNet-101) 0.7919 81.2 80.3 80.8 1884 462 435 61.3 79.7 4.32 Detectron’s Mask R-CNN (ResNet-50) 0.7970 82.2 77.5 79.8 1907 555 412 58.3 79.1 2.95 Detectron’s Mask R-CNN (ResNet-101) 0.7936 82.8 80.0 81.3 1919 481 400 62.0 79.8 2.62 Detectron’s RetinaNet (ResNet-50) 0.6327 68.8 95.1 79.8 1595 83 724 65.2 79.8 4.69 Detectron’s RetinaNet (ResNet-101) 0.6315 69.5 94.4 80.1 1612 96 707 65.4 80.7 3.91

(47)

4.2 Object Tracking Results

Tracking results for SORT and Deep SORT with different object detection algo-rithms are presented in this section. As described in section 3.3, the tests are performed in a tracking-by-detection system where tracking and detection algo-rithms are completely separated, this is so that it is possible to test different com-binations of tracking and detection algorithms. Further, ground truth detections are also tested with SORT and Deep SORT, this is done so as to give an insight into how much of the error is due to the object detection algorithm and how much of it is due to the tracking algorithm. Tracking performance with ground truth detections should only have errors introduced by the tracking algorithms and can thereby give an upper limit for how much better the tracking-by-detection system can become by changing object detection algorithm. This upper limit is represented by a brown dashed line in all plots in this section.

Figure 4.2.1 and 4.2.2 first show the IDF1 score for different object detection algo-rithms with SORT and Deep SORT respectively. MOTA results are then displayed in figure 4.2.3 and 4.2.4. These plots aim to display the overall performance of each object detector and tracking algorithm. Next, figure 4.2.5 and 4.2.6 plot IDR against IDP for SORT and Deep SORT. This shows how tracking algorithms bal-ance accuracy and precision with different object detection algorithms. Finally the full results with many of the metrics obtained using the MOT16 evalution code2are presented in table 4.2.1 and 4.2.2.

(48)

0 20

IDF1 score (IDF1)

40 60 80 100 SSD:300 (VGG-16) SSD:512 (VGG-16) YOLOv3:320 (Darknet-53) YOLOv3:416 (Darknet-53) RetinaNet (ResNet-50)

Detectron's RetinaNet (ResNet-101) Detectron's Mask R-CNN (ResNet-101) Detectron's RetinaNet (ResNet-50) Detectron's Faster R-CNN (ResNet-101) Mask R-CNN (ResNet-101)

YOLOv3:512 (Darknet-53)

Detectron's Faster R-CNN (ResNet-50) Detectron's Mask R-CNN (ResNet-50) Ground Truth Detections

Figure 4.2.1:IDF1 score for object detection algorithms with SORT.

0 20

IDF1 score (IDF1)

40 60 80 100

SSD300 (VGG-16) SSD512 (VGG-16) YOLOv3:320 (Darknet-53) YOLOv3:416 (Darknet-53)

Detectron's Faster R-CNN (ResNet-50) Detectron's Mask R-CNN (ResNet-50) Mask R-CNN (ResNet-101) Detectron's RetinaNet (ResNet-101) RetinaNet (ResNet-50)

Detectron's RetinaNet (ResNet-50) Detectron's Mask R-CNN (ResNet-101) YOLOv3:512 (Darknet-53)

Detectron's Faster R-CNN (ResNet-101) Ground Truth Detections