Deep Active Learning of Object Detection for Smart City

(1)

Deep Active Learning of Object Detection for Smart City

JULES FLABEAU

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Object Detection for Smart City

JULES FLABEAU

Master in Systems, Control and Robotics Date: July 20, 2020

Supervisor: Hossein Azizpour Examiner: Patric Jensfelt

School of Electrical Engineering and Computer Science Host company: Univrses AB

(4)

(5)

Abstract

Deep learning networks are nowadays a major asset for smart city applications and brand new technologies. It is well known that deep learning methods require a great amount of data to have good performance, especially for safety-critical applications such as autonomous driving. Therefore reducing the expensive and time-consuming labelling task done by human annotators is a hot topic. Being one of the most promising candidates to solve this problem, active learning aims to reduce drastically the number of samples to annotate for the learning process.

In this work, we focus on the design of an active learning strategy in the specific context of object detection in videos. Besides traditional criteria of sampling, the queries are evaluated based on the temporal coherence of the network’s predictions. Introduced very recently, this characteristic has proven itself to be efficient for evaluating the informativeness of data points.

Introducing Temporal Flow, we tested our sampling strategy against the state of the art methods and outperformed them on a benchmark dataset. In- deed, our active learning showed better average performance per labelled samples after each cycle of training. The promising results are encouraging to pursue the effort done in active learning for object detection in videos. A real implementation of this work is feasible but also more research can follow as we acknowledge that further improvements are possible.

(6)

iv

Sammanfattning

Djupinlärning är idag en viktig tillgång för tillämpningar i den smarta sta- den och annan ny teknik. Det är välkänt att djupa inlärningsmetoder kräver stora mängder data för att uppnå bra prestanda, särskilt för säkerhetskritis- ka applikationer som autonom körning. Att försöka minska mängden dyr och tidskrävande annotering som utförs av människor är ett hett ämne. En av de mest lovande kandidaterna för att lösa detta problem är aktiv inlärning.

I detta arbete fokuserar vi på utformningen av en strategi för aktiv inlärning i ett specifikt sammanhang, detektion av objekt i video. Förutom traditionella kriterier för sampling, utvärderas den temporära koherensen i nätverkets för- utsägelser. Denna nyligen introducerade egenskap har visat sig vara effektiv för att utvärdera informationsinnehållet hos datapunkter.

Detta arbete introducerar vår metod Temporal Flow. Vi testade vår samp- lingsstrategi mot de modernaste metoderna och överträffade dem vid jämfö- relse på ett benchmarking-dataset. Resultaten uppmuntrar en fortsättning av ansträngningarna som gjorts i aktiv inlärning för objektdetektering i videor.

(7)

Résumé

Les réseaux de neurones sont aujourd’hui un atout majeur pour les applications en Smart City et autres nouvelles technologies. Il est bien connu que ces méthodes nécessitent une grande quantité de données pour avoir de bonnes performances, notamment en matière de sécurité pour des applications critiques telles que la conduite autonome. Par conséquent, la réduction de la longue et coûteuse tâche d’annotation effectuée par les annotateurs humains est un sujet de recherche prisé. Étant l’un des candidats les plus prometteurs pour pallier à cela, l’active learning vise à réduire considérablement le nombre d’échantillons à annoter pour le processus d’apprentissage.

Dans ce travail, nous nous concentrons sur la conception d’une stratégie d’active learning dans le contexte spécifique de la détection d’objets dans les vidéos. Outre les critères traditionnels d’échantillonnage, les requêtes éva- luent la cohérence temporelle des prédictions. Introduite très récemment, cette caractéristique s’est révélée efficace pour évaluer le caractère informatif des points de données.

En introduisant Temporal Flow, nous avons testé notre stratégie d’échan- tillonnage par rapport aux méthodes faisant état de l’art et les avons surpassé sur un dataset de référence. Les résultats prometteurs sont encourageants pour poursuivre l’effort entrepris en active learning pour la détection d’objets dans les vidéos. Une véritable mise en œuvre de ce travail est faisable, mais des re- cherches plus avancées peuvent également suivre, comme nous reconnaissons que des améliorations peuvent être apportées.

(8)

vi

Acknowledgement

First of all, I would like to express my gratitude to Univrses for allowing to realise my master thesis with them. My special thanks goes to Alessandro Pieropan and Miquel Martí Rabadan for being two amazing supervisors.

Secondly, I would also like to thank greatly Hossein Azizpour for its supervision from KTH, for all the good advice and preparation done during the thesis.

Finally, I give a special thanks to my family and my friends, for being supportive in any situation and giving me freedom and encouragements in all my projects.

(9)

1 Introduction 1

1.1 Background . . . 1

1.1.1 Deep learning and object detection . . . 1

1.1.2 The active learning principle . . . 2

1.2 Problem definition . . . 3

1.3 Research questions . . . 4

1.4 Scope . . . 5

1.5 Challenges . . . 6

1.6 Contributions . . . 6

1.7 Societal impact . . . 7

1.8 Sustainability . . . 7

1.9 Ethical considerations . . . 8

1.10 Overview . . . 8

2 Related work 10 2.1 Deep object detection . . . 10

2.1.1 2D object detection: the principles . . . 10

2.1.2 Two-stage architecture . . . 11

2.1.3 Residual networks . . . 12

2.1.4 Recent Architectures . . . 13

2.2 Cost reduction methods . . . 14

2.3 Active learning . . . 15

2.3.1 Core principle . . . 15

2.3.2 Query strategies . . . 17

2.3.3 Active Learning meeting object detection . . . 21

2.4 Temporal coherence and deep object detection . . . 23

3 Methods 24 3.1 Notations and metrics . . . 25

vii

(10)

viii CONTENTS

3.2 Object detector . . . 26

3.3 Object tracking . . . 26

3.4 Graphical model . . . 28

3.4.1 Mapping . . . 28

3.4.2 Energy function . . . 31

3.4.3 Construction of the graph . . . 35

3.5 Acquisition function . . . 37

3.6 Implementation details . . . 38

4 Results and Discussion 40 4.1 Dataset . . . 40

4.2 Evaluation metrics . . . 41

4.2.1 Object detector performance . . . 41

4.2.2 Sampling strategy performance . . . 42

4.3 Protocols . . . 44

4.3.1 Overall comparison of performances . . . 44

4.3.2 Tracker’s influence . . . 45

4.3.3 Class imbalance and decoupled performances . . . 45

4.4 Experiments . . . 45

4.4.1 Overall methods . . . 45

4.4.2 Tracker’s influence . . . 48

4.4.3 Class imbalance correction . . . 50

4.5 Analysis and discussion . . . 52

5 Conclusions 56 5.1 Summary . . . 56

5.2 Future work . . . 57

Bibliography 59

A Figures 65

(11)

Introduction

This thesis is focused on Deep Active Learning for Computer Vision. In this chapter, we give an overall comprehension of the scope, limitations as well as the objectives of the project after a small introductory part of the subject.

1.1 Background

1.1.1 Deep learning and object detection

Traditional computer vision methods involve the use of hand-crafted features in many tasks such as classification and detection. Such features, however, cannot cope with a high degree of variation in appearance, lighting changes, etc. Thus, appeared more sophisticated algorithms based on graphs, decision trees and support vector machines (SVM) making the core of machine learning. The learning processes allowed to detect those complex shapes, but with limited performances. Finally, showing great improvements in the field of computer vision, deep learning became the state of the art technique to tackle high complexity problems. The high-dimensionality learning approach made problems deemed impossible with basic algorithms very easy to solve today.

For computer vision, problems like object detection, semantic segmentation or depth estimation are today tackled by deep learning.

The specific problem of deep object detection has been well studied in recent years and the recent achievements are well documented in the work of Huang et al. [1]. Nowadays, Faster R-CNN (region proposal on CNN features) [2] and SSD (Single-shot detector) [3, 4] are the standards use in practice that provide impressive results in the computer vision field.

However, if the performances of such learning techniques have incredibly

1

(12)

2 CHAPTER 1. INTRODUCTION

risen, a major drawback never disappeared: learning is extremely data-hungry.

The computation power is not the biggest limitation anymore with the huge resources that one can find publicly available¹. The data, on the contrary, needs to be collected and annotated, involving a substantial cost (cf. figures A.1 and A.2 for a cost comparison). Moreover, each slight improvement needs a lot of new data points in the learning process. Even if images can be easily collected on public resources or with basic implementations, the labelling process, however, remains very laborious and depends on the task at hand (ie.

labels for classification, detection, and segmentation require different amount of work and time).

1.1.2 The active learning principle

Reducing the labelling cost is the core of many interesting techniques.

Some, like active learning, consider the problem as a reduction of annotated samples like semi-supervised learning [5], weakly supervised [6], unsupervised learning [7, 8], and transfer learning [9]. In other words, all of these methods only possess and use a small proportion of labelled samples, but also use the various unlabelled ones. Others are considering the problem by aug- menting artificially the number of labelled samples [10]. This strong research field shows that the problem is being considered very seriously and promising techniques emerged from the literature at a fast pace.

Active learning has been studied right after machine learning techniques emerged to try to reduce the number of labelled data necessary. This technique manages to benefit from the huge amount of unlabelled data available, which is often easy to acquire in computer vision problems [11]. The key idea is that a machine learning problem can achieve a reasonable result with only a subset of the training data, given that this subset contains highly informative samples. Therefore, the core of the problem is to evaluate the potential benefit of each sample for the training and label the best subset before performing supervised learning. Active learning has proved itself very useful in medical image analysis [12, 13], where the labelling process requires an expert, and so a higher cost. In his survey, Settles [14] describes carefully the major studies achieved in this field so far. If the core idea remains the same since machine learning cases, the criteria to select samples can be more and more sophisticated when working with deep learning problems. In this way, various works designed techniques that work well in some particular conditions. However,

1. Google Colab, for example, is a free Jupyter notebook environment provided by Google where you can use free GPUs and TPUs.

(13)

there is no consensus to designate the best strategy to apply for a single deep learning problem.

1.2 Problem definition

This thesis aims to address the problem of the huge quantity of data required for the learning procedures. More precisely, it aims to implement an active learning strategy for object detection using video sequences. This field has not been investigated much compared to other classic state of the art problems for active learning where the data is sparse and with no inherent coherence.

As more deep learning based solutions are deployed in the real world, the need to annotate large amounts of data increases and make this topic well suited for arriving technologies (ie. smart-city technologies aiming to optimize traffic or autonomous cars with mounted cameras). The data is then flooding and accessible, but the labelling task becomes more and more laborious.

Object detection can take many forms, and if some techniques were designed specifically for videos [15, 16], single images detector [2, 3] are still the baseline in this field as they offer fast and accurate results, hence we will rely on this type of framework for the thesis. More precisely we will use a Faster R-CNN [2, 1]. This architecture is providing good results for object detection and is easily implemented with any type of backbone. We will use Pytorch as our main development library. Many backbones are provided with pre-trained weights, allowing for fine-tuning models on new datasets. This will be used to study our active learning strategy relatively quickly compared to a long full training from scratch.

The temporal aspect is not part of the network and not considered to improve the detections, but it is considered to implement an active learning strategy. [17, 18] are among the first in tackling this problem. The temporal coherence is used as a heuristic to evaluate the samples’ informativeness and queries are formed based on this factor. Proven to be a great sampling strategy, this criteria can easily be implemented in an online manner and suits perfectly for autonomous cars using a video feed as the main tool for environment understanding.

The estimation of coherence between two inferences requires to implement extra tools to analyze this property (besides than the network’s output).

Template matching and tracking systems can be employed to compare or link bounding boxes in at different times. A similar approach has been proposed in [18] by using optical flow to track objects within a small time window in a video sequence. By doing so, a mapping between detections in several frames

(14)

can be established, and the network’s errors can be detected with the coherence between them all. The active learning queries are made with the frames showing no temporal coherence. A similar idea has been presented in [19], showing the advantage of possessing multiple views of the same scene where the redundancy can be used to score the detections of a single object among all the views.

Also in the spirit of the upcoming autonomous cars, we will employ the ImageNet VID dataset [20]. This set contains real-world sequences in many different conditions. The great number of sequences (and frames) makes it a perfect candidate to apply active learning techniques on it. We select only a subset containing road scenes to limit the total size and rely more upon the smart-city theme aforementioned.

1.3 Research questions

Based on [18, 19], we want to experiment a possible improvement by com- bining the advancements done with the tracking and the multi-view techniques, creating continuous scores to discriminate each sample from another. This would provide a very specific sampling policy for our case of video feed object detection. By doing so, we expect to obtain new state of the art results in the active learning field for object detection. The different points that we address in this thesis are as follow:

— Designing a probabilistic estimation for the graphical model that represents the temporal information contained in the sequences. The first graphical model introduced by [18] is designed only to obtain a partition of the boxes between two groups (true or false). Based on this first idea, we will modify the graph to incorporate more information about the sequence, including uncertainty and group voting in the model. Hence we will search for the best possible way to describe a sequence in terms of probabilistic model.

— Introducing a continuous scoring system based on an energy function that discriminates all samples in a pool-based active learning perspective. If the partition between true and false for all boxes allows to sample a group of frames, it does not make any clear difference between all the images having the same number of falses. Therefore, we want to experiment with a new scoring system that provides an individual score to each single frame in a sequence.

(15)

— Introducing a real tracker to have better temporal information about the variation of the boxes’ shape and detect object occlusion if it occurs. This would allow first to link the detections in time but also to incorporate some knowledge about the boxes localization and their retrieval scores thank to the tracking algorithm. This would differ from the methods of [18] and [17] where they respectively transpose the boxes from frame to frame using optical flow and do template matching in the neighbouring frames. Thus, we will study the influence of such implementation and its benefits or drawbacks.

From these three points gathered, we will implement a new active learning strategy able to discriminate all samples thanks to a new graphical formulation and a better understanding of the context in the sequences. This new strategy is designed to harmonize the classical sampling methods with the temporal coherence as an overall single measure of energy.

1.4 Scope

It is important to note the particular scenario in which we place ourselves for the thesis. If autonomous cars or any smart-city application could be studied via tasks such as instance segmentation, depth estimation, and other complex objectives, we restrict ourselves here to solely object detection. This case is offering already box classification and localization, which is enough to design a temporal criterion on the inference. Object classification was widely studied for active learning and seems not appropriate for the video scenario.

On the contrary, object detection in videos has not been studied much in the literature and even less in videos. If the ultimate goal is to generalize the sampling strategy on any task (or multi-task learning) we let this part aside for future researches.

Then, with autonomous cars on the roads, predictions must be done in real- time and fast enough for the embedded computer to take decisions quickly. If an active learning strategy would be implemented in an online manner, this would imply some speed requirements (often expressed in frame per second).

In our design, we do not explicitly put this parameter on paper. However, we keep this aspect in mind for a possible future implementation and by this, we exclude all sort of ensemble made of multiple networks, which would add too much complexity (memory and computation wise). However, ensembles are well known to estimate uncertainty greatly and possible implementations may be done in future researches as well.

(16)

1.5 Challenges

The design of new active learning strategies is very challenging [21] as there is no consensus on which parameter is the most important to consider during sampling. The work of [18] showed a very promising path by including the temporal coherence as a heuristic for informativeness, but by leaving aside the classical measures of uncertainty and diversity during sampling. In this work, we try to consider them all by creating a graphical model able to embed all these measures at once. It seems legitimate to hope for improvements in the learning process by doing so. But as active learning is very irregular depending on the type of tasks or datasets used, the experimentation will reveal if our strategy is worth considering for further work.

Then, only a few datasets are available to test active learning for object detection in videos. The size of such data has to enable pool-based active learning process and thus contain several thousands of samples to recreate the context correctly (big unlabelled pool versus small labelled pool). As opting for road scenes in the mindset of autonomous cars, the ImageNet VID dataset seems very adapted with its real-world numerous sequences. However, not many other datasets have the same characteristics making hard to generalize our experimentations.

Finally, to deploy active learning experimentations, networks have to be re- trained after each sampling step which can be time-consuming as well. Hence to get results in limited time, we will use pre-trained networks to experiment and fine-tune them. This makes faster convergence but to the risk of falling in local minima or creating a dependency between the dataset used for pre- training and the one we use for experimentation.

In the previous section, we presented the scope of active learning in the field of deep learning. We restrict ourselves to a very specific part of the many possible applications to tackle one problem only. Active learning for object detection is very challenging in itself and promising for future research. Going from classification to object detection offers many options for the design of the sampling policy as it is presented in [22] and [23].

1.6 Contributions

In this thesis, we design a new active learning sampling strategy that takes into account the classic parameters of uncertainty and diversity in the sample prediction, as well as the temporal coherence between consecutive frames

(17)

showing the same object multiple times. Inspired by [18], we improve the criterion of selection by enhancing the graphical model moving from a binary distribution to a continuous one, to discriminate all samples and make the selection more precise. We show that our method outperforms the baselines techniques of active learning tested against the ImageNet VID dataset.

1.7 Societal impact

Deep learning and autonomous systems are nowadays linked to one another and the progress in the first affects strongly the other (and vice versa). In this way, the researches to make deep learning methods more efficient and easier to train are very promising for the future of such systems. Autonomous cars have been benefiting a lot from deep learning breakthrough since the past years.

Computer vision has made possible to drive, at a certain limit, completely autonomously. One can imagine that this limit will be completely crossed as more and more systems will move autonomously soon. On a bigger scale, active learning promises an easier training phase for the neural networks. In this way, we can imagine to see the biggest projects of deep learning become reality in a matter of years. The societal impact of deep learning might be con- stantly growing as we imagine and develop always more incredible projects. A well-known example is Neuralink developed by Elon Musk, aiming to create a machine-brain interface. If the technology is still not ready yet, we get closer and closer and active learning might be a kinetic factor in the process.

1.8 Sustainability

Cars are nowadays a major source of pollution with the use of fossil fuels.

Autonomous driving allows room for optimization of the traffic. It is estimated that the introduction of autonomous vehicles will reduce drastically the number of privately owned cars thus reducing the need for roads and parking space.

The introduction of safe autonomous system will reduce the number of deaths on the road as well. The green areas can ideally take over more place and make cities less polluted. For example, Univrses’ Smart city project is developed to make cities more efficient and less polluted and is the origin of this work. In this thesis, a strong emphasis is made on the driving scene. But deep learning can also be used in the very promising field of recycling waste. Making new robots able to recognize the type of waste and the category to sort it in is becoming affordable. Making such treatment would allow for cleaner sorting

(18)

and reuse of the waste, one great hope for sustainability.

It is also important to note that we used for this thesis multiple GPUs and hours of training. The energy used for deep learning is not a small affair in general and could be used to power a few houses for months (just for this thesis).

However, we can see via active learning, a much more efficient development of deep learning to save resources. Both labelling work and training time can be optimized thanks to this researches.

1.9 Ethical considerations

Autonomous cars are a great question mark in terms of ethical considerations. In term of responsibility, great questions arise if an accident occurs. The current laws for responsibility on the road need to be adjusted and calibrated for such systems without a human in control. This is an important ethical issue that society will have to face more and more with new systems appearing and changing the human’s decision/responsibility. It is also very hard now to really understand what happens in a deep neural network and motivate that it is fine to use in such safety-critical scenarios. Then, the ethical issue can also be coming from an industrial perspective. To which extent do we want an autonomous car to go on the road? Companies must asses the risks of cars before putting them on the market. The question is then to know which level of risk the companies are ready to accept and still commercialize the product.

Deep learning, in general, brings new questions with every new networks abilities. Face recognition, voice synthesizing, facial deepfakes, and many other specific networks are questioning our fundamentals conceptions about identity. It is now possible to create a video of a someone popular saying a complete nonsense text, with realistic voice and facial deformation. All we need is a few second voice sample and a picture of that person. In terms of ethics, this brings a lot of questions about how far can we use somebody’s identity. If active learning is making easier the training of such technologies, it certainly also has to face the same questions. However, it is important to note that in parallel with these developments, new networks are also designed to detect deepfakes.

1.10 Overview

The recent works and existing solutions that relate to our research will be presented in the following Chapter 2. The implementation details, technical

(19)

and conceptual aspects are then described in Chapter 3 for a better comprehension of the thesis. Afterwards, the experimental results obtained are exposed in Chapter 4. Finally, in Chapter 5 we summarize and draw the conclusions of this work, as well as giving possible future leads for it.

(20)

Chapter 2 Related work

In this section, we will expose the state of the art of active learning for object detection. To do so, we first expose the baselines of object detection in deep learning and then describe the scope of active learning, from its first principles to the specific case of object detection.

2.1 Deep object detection

Object detection is one of the main tasks performed with deep learning. It differs from object classification, by localizing an object in image space rather than assessing only its presence, thus providing a 2D axis-aligned bounding box and a label. But it limits itself to this box positioning and does not provide a very precise contour like instance segmentation, which labels pixel by pixel the entire image. Novel networks structures provide tremendous abilities to solve those tasks. The key elements are Deep Convolutional Neural Networks (CNNs), trained in a supervised learning context. On well known benchmark datasets, such as CIFAR [24] or ImageNet [20], performances even outperform human classification accuracy. For our research we limit ourselves to object detection, giving a detailed overview in this section.

2.1.1 2D object detection: the principles

Since CNNs appeared, great changes have been made for object detection.

The structures of the networks evolved from R-CNN (Region CNN)[25], to SSD (Single-shot detector) [3], to finally Faster R-CNN [2] or YOLO (You Only Look Once) [26]. The two last ones can be considered as the baselines of object detection. If all of the aforementioned structures use features extrac-

10

(21)

tion from CNNs, there is however a difference in the way of processing the bounding boxes. SSD and YOLO can be assimilated to a single-stage structure, whereas the Faster R-CNN belongs to the two-stages structure category.

This last technique is the one we are interested in, as it will be our core element for the rest of the thesis.

2.1.2 Two-stage architecture

For this type of architecture, the features are first extracted by a region proposal network (RPN) to output several box proposals (forming the first stage). Afterwards, for each proposal, the cropped features are reused in the second stage to then perform class prediction and refine the box coordinates.

To prevent from considering repeated boxes, non-maxima-suppression (NMS) is performed after both of the stages. This post-processing step helps to consider only a few interesting boxes, by filtering them depending on their overlap between each-other. The entire procedure is the core of the Faster R-CNN [2] structure, making the baseline for object detection being both fast and accurate.

In the process of designing this final procedure, many changes and improvements were done since R-CNNs. Each of the improvements was done to gain in accuracy or speed. To first perform object detection, R-CNNs were designed to do a selective search on the input image, to select many regions of interest, passing them one by one to the CNN feature extractor for the label assignment. The main steps of this method are presented in figure 2.1. The drawback the method appeared to be the great number of regions to treat, making the process very slow. Besides, the regions made by the selective search algorithm cannot be learned from the data directly, hence lacking precision and leading to poor proposals.

Figure 2.1 – R-CNN structure overview [2].

(22)

12 CHAPTER 2. RELATED WORK

Fast R-CNN [27] have been proposed to decrease the inference time by modifying the first stage to perform the Region of Interest pooling (RoI) directly on the feature map, the output of the CNN of the entire image. In this way, the features are already computed once and for all, and the labelling task does not need to compute again the features for each region. This process is visible on the figure 2.2. With this modification, the prediction time was greatly improved and made this technique very attractive for further research.

Figure 2.2 – Fast-RCNN structure [27].

Finally, the Faster R-CNN [2] structure replaced the first stage by a RPN, learning and adapting the region proposals with the data. The network is now both fast and accurate with this specific learning process. For a more in-depth overview of different methods for object detection please refer to Huang et al.

[1] where a study in terms of speed and accuracy is conducted.

2.1.3 Residual networks

The classic building approach for deep networks is to build an end-to-end stack of layers (i.e. CNNs) to extract features at every level (also called plain networks). However, these systems are not easy to optimize and the training might face some issues, falling into local minima, or being very slow to converge. The typical vanishing gradient effect is visible in such architectures.

During back-propagation, the gradients tend to zero as we propagate further, making the weight updates very small.

The residual networks were introduced by He et al. [28] to overcome this problem. Introducing shortcut connections (or skip connections), the mapping done by the classic layers is now replaced by a residual mapping. The figure 2.3 illustrates a building block of the residual networks. The shortcut connections

(23)

perform identity mapping, and their outputs are added to the outputs of the stacked layers creating the residual F (x) + x. In this notation, F constitutes the transformation of the block, while x is its input.

Figure 2.3 – 2-layer residual block [28].

By doing so, the residual networks solve the two previous bottlenecks of the plain networks. Indeed they are easy to optimize and also show lower training error when the depth increases greatly. For the skip connection to be mathematically correct, the dimension of the input and the output of the ele- mentary block must be the same. Otherwise, it would not be possible to add x to its transformation F (x). To solve this limitation, it is possible to manipulate the skip connection with a simple convolution to change its dimension when the output’s dimension is modified.

The architecture of the network is presented in the appendix figure A.3.

This new connection is very attractive when we notice that the total number of parameters is not increased, but the performances are improved.

Each ResNet block is either 2 layers deep (used with small networks like ResNet 18, 34) or 3 layer deep( ResNet 50, 101, 152). The ResNet-50 model consists of 5 stages each with a convolution and Identity block. Each convolution block has 3 convolution layers and each identity block also has 3 convolution layers. The ResNet-50 has over 23 million trainable parameters.

2.1.4 Recent Architectures

More recently, new architectures emerged in the deep learning field. Tak- ing the Faster R-CNN baseline, Mask R-CNN [29] added a branch for predicting high-quality instance segmentation in parallel than the existing bounding boxes inference. It runs at low fps (ie. 5 fps in general) but has the advantage to be easy to train and can be extended to other tasks like human pose estimation.

(24)

If the last depicted two-stage framework shows state of the art accuracy on most benchmark sets, it is still disadvantaged by its slow predictions compared to the one stages pipelines. The great imbalance between background and desired detection is making the training process hard and the loss sometimes very abstract or overwhelmed on the well-classified objects. To tackle this issue, Lin et al. [30] designed a new loss, replacing the typical cross-entropy by Focal loss, to down-weight the loss applied to correct detections. Besides, they proposed a new one-stage. architecture called RetinaNet showing new state of the art performances for accuracy and speed.

In the same spirit, the above-mentioned [26] is also making great progress for single-stage frameworks, releasing their newest version of YOLO Nano [31]. The small model size is making this solution perfectly adapted for embedded systems such as mobile phones. Despite their small number of parameters, they show good result on the benchmark datasets for object detection and are competitive with the two-stage architectures.

2.2 Cost reduction methods

If we focus in this work only on active learning as a way to reduce the annotation cost, other methods have also been studied to make the training of neural networks more efficient. Data augmentation was designed decades ago to generate new samples thanks to the data already annotated. Introducing some transformations (flips, zoom, etc.) the samples can be modified and used during training to generalize the knowledge of the network. The recent work of Cubuk et al. [10] combines a multitude of transformations applied randomly to the samples, called RandAugment and shows state of the art improvements.

They improve the existing methods by designing an augmentation that requires only two hyper-parameters to fine-tune, providing a fast and efficient implementation.

When only a few labels are available as in active learning, other techniques consider the reversed problem: labelling the samples with high certainty without any human in the loop. The pseudo label is then used for training and this process is called semi-supervised learning. It is possible to combine the data augmentation existing methods, as done in FixMatch [5], by using the augmented sample for training (with the method of [10]), when the network is highly confident on a sample weakly augmented (ie. flip only). Hence the number of labelled data is artificially augmented and the learning offers better accuracy.

Fully unsupervised methods tackle situations when no labels are available.

(25)

It can be possible to learn the feature space with no annotation as shown in the work of Gidaris et al. [8] by predicting images rotation in the first time. This implies a first training cycle to enforce the network to learn a representation and a good understanding of the features in the distribution. The network can by fine-tuned afterwards with less amount of labelled samples. In the same way, the pipeline described by Liu et al. [7] is learning the latent space using generative adversarial networks GANs for image-to-image applications. Fully unsupervised, the original data has no pairs of annotated images, but their method is still able to learn a marginal distribution of the latent space for both input and output distributions.

Then, some methods relate more to object detection as for our study, with weakly-supervised techniques. In this scenario, the labels are available only for one of the task. Wan et al. [6] is improving the state of the art for such application by adding a control on the randomness often observed in the localization of the boxes. They implement a minimization of the entropy in the latent space during training, showing great improvements from existing techniques.

Finally, knowledge transfer or knowledge distillation try to take advantage of an already-trained network as a source of information for a target, or stu- dent network. The tasks can be different between the source and the target but the already trained layers of the teacher can offer crucial information if the transfer is done successfully, reducing the number of annotated data required for training. Ahn et al. [9] introduced a principled knowledge transfer framework to maximize the mutual information between the two networks based on a variational information maximization technique.

2.3 Active learning

Given the large success of machine learning in the last decade, followed by the deep learning emergence, the task of labelling data became more and more fastidious. Active learning appeared to be one of the promising candidates to deal with this issue, as well as other techniques like unsupervised or semi- supervised learning. Here is a brief description of this particular process.

2.3.1 Core principle

The hypothesis behind active learning is that if a learning algorithm can choose the data from which it learns, it will perform better with fewer data.

This is a very desirable property if one might want to reduce the number of

(26)

labelled data during training. Many problems in deep learning require nowadays thousands (if not hundreds of thousands) of samples [20], where the label comes with a difficult, time-consuming and expensive work.

To tackle this problem, active learning is formulating queries of unlabelled samples to be labelled by an oracle, which is most of the time a human expert.

The model is then learnt with those few samples, finishing one cycle of active learning (cf. figure 2.4). The process may then be repeated, until the accuracy is deemed sufficient, or the number of available queries expired.

The active learning cycles are using and modifying two distinct sets of data. Queries are made inside the set of unlabelled instances, usually denoted U . After being labelled by the oracle, those samples are moved to the second set of the labelled instances, L.

Active learning may be used in different scenarios, creating various types of queries. The two main ones are pool-based and stream-based sampling. To understand in details the principles behind active learning, one may refer to the survey of Settles [14].

Stream-based selective Sampling

In this case of selective sampling [32], or also called sequential active learning, the samples are taken from the unlabelled pool one after the other and the learner can decide to request its label or to discard the sample this time. The decision is made instantly and does not depend on the other samples forming U .

With this strategy, the learning is highly depending on the data distribution.

Queries may be very sensible if the distribution is non-uniform and unknown as samples are taken one by one regardless of the others.

In this setup, the labelling decision is often based on the informativeness measured from the sample. The more informative the sample is, the more likely it is to be labelled. A naive approach of the problem is to put a threshold on the informativeness and discard all samples below this limit. This delimits in a certain way a region of uncertainty, in which the learner desires to label the samples falling into it.

This approach seems well suited for real-world application, first reducing the annotation effort, but also reducing the size of the database needed (typically the size of U ). One can imagine for the road scenes and autonomous systems, a camera providing a stream but not saving any sequence. The frames to be annotated would be chosen with a selective sampling strategy.

(27)

Pool-based sampling

The most popular active learning setting is pool-based sampling, where we assume to have a small labelled set L initially and a large collection of samples forming the unlabelled set U . The user has a view on the entire set of unlabelled samples, considered closed as not gathering more data every cycle.

Queries can now be made with knowledge over the entire set U . Thus samples are queried according to the informativeness evaluated for all the instances in the pool. The figure 2.4 illustrates the process of pool-based sampling.

Figure 2.4 – Active learning pool-based process [14].

This technique appears to be widely used in the literature, in opposition to the previously depicted one. The main difference is that decisions are not made individually (see stream-based sampling) but require to rank the entire content of U to finally take the best sample(s). Indeed, the pool-based sampling allows to design more precise and selective strategy and might be interesting on paper. But for many-real world application, stream-based methods seem to be more appropriate due to memory limitation or processing power.

2.3.2 Query strategies

The active learning strategies employed today for deep learning strongly inherited from their machine learning predecessors. To denote the query for- mally, we refer to the query algorithm, also called the acquisition function a(U , θ), that returns the best candidate x^∗aor a set of candidates Xa^∗, that are

(28)

the most informative for the model θ according to a. This function is a heuristic of the samples’ informativeness as the real information cannot be directly computed.

For a strategy to be interesting for implementation, it has to be better than random sampling. This means that for a fixed goal accuracy, the strategy must reach it with fewer samples compared to adding samples picked randomly and training until reaching the goal.

Uncertainty sampling

The first and simplest method implemented to query is the uncertainty sampling [33]. From this uncertainty measure, many strategies can be derived.

Least Confidence [34]. Using the posterior probability of the model θ, the strategy queries the sample of U with the smallest confidence. This way of doing is very popular as it is the simplest to implement, taking into account only the label with the highest probability of sample i, ˆy_i = argmax

y

P_θ(y|x_i).

x^∗_LC = argmin

xi

Pθ(ˆyi|xi) (2.1) Minimum Margin [35]. Considering this time the two best labels predicted yˆ_i¹andyˆ_i²for the sample i, this strategy estimate the uncertainty via the margin between the two scores. As the margin tends to get smaller if the model is unsure, this strategy will select the most ambiguous samples.

x^∗_{M M} = argmin

xi

P_θ( ˆy¹_i|x_i) − P_θ( ˆy²_i|x_i) (2.2) Entropy [36]. Finally, an estimation of the uncertainty can be calculated with the entropy of the prediction. This approach makes more sense when the predictions are done over more than just two labels. This computation is a way to estimate the "impurity" of the prediction, and is considered as a heuristic of informativeness.

x^∗_H = argmax

xi

−X

j

P_θ(y_i^j|x_i) log P_θ(y_i^j|x_i) (2.3)

Query-by-committee

Another well-suited strategy for active learning is the query-by-committee (QBC) [37, 38]. This approach is less evident to implement as it requires to form a committee of several models {θ⁽⁰⁾, . . . , θ⁽ⁿ⁾}. All of the models are

(29)

trained on L, however, they differ slightly from each other (e.g. initialized differently, randomness during the training, etc.). Therefore, the predictions over one sample can be different. The strategy selects the samples on which the committee disagrees the most, often meaning that the sample is close to a decision boundary, and then very informative for the training. The consensus probability of assigning label yⁱto the sample x is defined by Eq 2.4 and can be used by the acquisition function to make the labelling decision.

P_C(yⁱ|x) = 1 C

C

X

c=1

P_θ(c)(yⁱ|x) (2.4) This techniques is theoretically promising but also implies some drawbacks when implemented. The larger the committee is, the better the quality of the sampling will be, but also the more computational power will be required.

More specific to deep learning

The aforementioned baselines were designed first to tackle machine learning problems. But active learning is facing new challenges when applied specifically to deep learning. Indeed, an evident complexity in the models’

structures and the high dimension of the input data make the training often more complex. Therefore, many new manners to apply active learning were designed to be in phase with the current applications.

Dealing with image classification, Aghdam et al. [23] and Jiang et al. [39]

showed in their works that the measure of uncertainty can be directly applied as described in the previous part, by computing the entropy and the minimum margin respectively. The classification problem in deep learning structures refers directly to the classic machine learning problem. They show that this measure is consistent and beneficial for those problems. In a similar trend Ducoffe et al. [40] analyse the minimum margin on adversarial attacks to label the sensitive samples.

To bring the ensemble theory to application, several approaches appeared.

For the same problem of image classification, Beluch et al. [41] gives a good overview of the benefits of such a theory applied to deep learning. Two techniques of applying it are depicted and compared. First, the ensemble is created with several models trained separately and the consensus between them is the key element to create the acquisition function. This way of doing is also studied in [42] and is extremely promising. The ensemble showed to select very informative samples and also naturally correct the class imbalance during the active learning cycles. However, having several models in the committee is

(30)

not always guaranteed. This requires great computation power to evaluate all the committee, and memory to stock the models in. This brought to life the second techniques of Monte Carlo Dropout. In [43], only one model is trained and is sufficient to make the ensemble theory work. Indeed, once the model is trained, several declinations of it are created with random dropout on its layers.

Each version of the model is a new element of the committee, and predictions differ thanks to the "noise" of the connections turned off. However, in [41] it is showed that a few fully-trained networks make a better committee than many created by dropout.

If the uncertainty of the prediction was a good candidate for a heuristic of the informativeness, the work of Sener et al. [44] is giving great value to the representativeness of the samples. The core-set criterion maximizes the coverage of the samples in the data space to have a better representation of all the features in the labelled set. This method manages to prove that an upper limit of the expected loss can be estimated and reduced by solving a clustering problem in the feature representation of the samples. However, such spacial- computation methods do not scale well with high dimensional data and/or vast pools of samples.

The representativeness criterion is a useful way of creating batches with different content [45] (very different samples form each other) and thus very attractive for deep learning training techniques. Considering only the uncertainty may select very similar images in the unlabelled set, and thus damage the training. New methods have tried to combine both criteria to create informative batches directly. The method of Kirsch et al. [46] called BatchBALD takes a small committee to perform Bayesian active learning and select the samples carefully to create a diverse batch (see also [47, 48, 49]).

As the first idea of active learning is to reduce the time and cost of the labelling process, some researches managed to draw a cost-driven approach.

The Cost-effective methods [50, 51, 52] are considering a labelling cost for every sample, in addition than informativeness. Then the acquisition function makes the balance by selecting low cost and highly informative images to annotate. This idea is often used when the cost of the labels is very high and requires a qualified expert.

Finally, in their recent work Yoo et al. [53] have implemented a learning loss module to learn the acquisition function. This approach is very similar to the concept of reinforcement learning as the queries are done by learning in an online manner, defining the rewards as the accuracy gained. If the theory seems promising, this technique is difficult to implement in real-world scenarios. Similar approach has been presented in [54], implementing a rein-

(31)

forcement learning policy to decide when to select the samples to be annotated by an expert or when to make the prediction. This entire module is applied to image classification on videos and is showing great improvement in the design of hybrid active learning where the strategy is very flexible. In this sense, the queries are made automatically and adapt when a new subject or difficulty appears on the videos to be classified. In the same mindset, [55] implemented an adversarial approach to learn a latent space for both the labelled and unlabelled data. Doing so, the sampling strategy is based on the scores given by the discriminator that tries to distinguish between labelled or unlabelled samples. In other words, samples are selected based on the likelihood of their representativeness w.r.t. other samples the discriminator thinks belong to the unlabelled pool.

From this brief state of the art review, it is clear that active learning is taking many forms. Lipton et al. [21] analyze the obstacles of applying such techniques to real scenarios. The results of all the studies prove that active learning is promising and can be largely applied to real problems. However, with many different ways for implementation, it is hard to find a technique that can generalize to all the problems or all the networks structures.

2.3.3 Active Learning meeting object detection

If the predictions of image classification problems can be directly inter- preted to compute the uncertainty of a sample, object detection brings new possibilities for this estimation. Indeed, the uncertainty of the label can re- main the same but the box prediction is a new parameter to include. Also, the number of objects in the frames may modify the scores given by the acquisition function.

In the work of Brust et al. [56], many aggregation techniques are detailed.

The prediction scores of all the predicted boxes can be manipulated to assess a final value of the sample’s informativeness. A direct approach is to apply the baseline techniques for each prediction and average the heuristics to finally compute a more general acquisition function. The scores can be manipulated otherwise, taking the sum, the max, etc. and this work provides a comparison of them all.

To tackle more specifically the object detection task Roy et al. [22] provide a new active learning technique only for object detection. The architecture of the network is used to get predictions at several scales in the images. Those predictions are then manipulated as an ensemble. Their method is taking both the uncertainty value and the representativeness to select the frames to anno-

(32)

tate. However, this techniques implies to know and study the architecture itself of the network, thus may suffer from inconsistency if the backbone changes.

Taking example on the real-world application of object detection, Bengar et al. [18] manage to create a new selection criterion, different than uncertainty or representativeness. Indeed, the object detectors are often used with a video feed, for autonomous cars, or fixed street cameras. Therefore, the objects should appear in several frames, and moving around the images. Thus, the predictions of the same object from a frame to the next one should not vary much and be positioned roughly at the same coordinates. To this extent, the boxes predicted for the same object can be tracked, and anomalies can be detected. For example, a frame not showing the object when the surrounding frames identified one. The samples showing predictions not coherent in time are then selected for labelling.

Finally [19] focuses more on image segmentation, but the idea can be gen- eralized to detection and is similar to the work of [18] above. They detail an acquisition function for scenarios where the same object is seen with different views among multiple frames. Thus their sampling strategy is to compute the entropy of the predictions among all the views, and querying when the objects present high entropy on all the views. The frame to annotate is then picked inside the group showing the object, taking the most informative inside the reduced group of frames with another acquisition function. This work can also be assimilated to the one of Jin et al. [17], where the detection defects are analysed in the same way to query the samples. By doing template matching on the surrounding area of the detections, they manage to find the incoherent frames and call this idea Hard Sample Mining.

On more complex tasks, active learning has been applied to strengthen semi-supervised learning. For videos particularly, [57] in designing a network to find the best sample to annotate before training on a semi-supervised fashion. The query strategy itself is then learned and samples are selected due to their prediction performances. They improved strongly the state of the art of existing methods where the first and only sample annotated was selected manually. More precisely, this refers to the ’cold-start’ learning and is very relevant for many applications. Active learning avoids this situation by selecting randomly a first set to label, making the process easier but not more accurate.

(33)

2.4 Temporal coherence and deep object de- tection

As depicted before in section 2.1, the current baselines for deep object detectors are built for a general case, where the temporal coherence is not considered. However, video feed training has been explored recently with the challenges offered by [20]. This kind of application relates closely to real- world scenarios and is a great research area. New network architectures are designed to take into account the temporal coherence in the learning and prediction process.

To answer this challenge, Liu et al. [16] propose to combine single-image object detection with convolutional long short term memory (LSTM) layers.

They also propose Bottleneck-LSTM, a new bottleneck reducing computational cost and achieving temporal awareness by refining and propagating feature maps across frames. This approach is very interesting as the computation speed is very attractive even if the structure of the network is more complex than classic object detectors. They manage to get a better accuracy than these baselines for the ImageNetVID challenge.

The work of Kang et al. [15] also manages to incorporate temporal information thanks to a novel tubelet proposal network to efficiently generate spa- tiotemporal proposals and a Long Short-term Memory (LSTM) network that incorporates temporal information from tubelet proposals for achieving high object detection accuracy in videos. They also experiments on the large-scale ImageNetVID dataset and demonstrate the effectiveness of their framework for object detection in videos.

Faster R-CNN [2] stays competitive even with the new techniques gaining in accuracy for object detection in videos. This networks provides fast prediction and is a commonly used framework for object detection in videos as well.

The temporal knowledge is sacrificed in favour of more speed and accuracy on single images.

(34)

Chapter 3 Methods

This thesis bases its approach with the work of Bengar et al. [18]. Their analysis of the importance of the temporal coherence for measuring the informativeness of a sample is very promising. In this spirit, we aim to reproduce their results and develop further a sampling strategy based on temporal coherence.

Figure 3.1 – Overall framework of our method Temporal Flow.

Our active learning approach first uses the bounding boxes detected from a pre-trained object detector to initialize an object tracker. Then, as the tracker propagates the boxes in the neighbouring frames in time, the relations between all detections can be represented as a graphical model. Incorporating potential scores and measures of confidence, this graph is used to predict the most likely false positive/negative boxes in a sequence as well as the overall flow. Queries for active sampling are finally made to select the frames to annotated by an

24

(35)

expert based on those measures. The entire process of our active learning strategy called Temporal Flow can be seen in figure 3.1.

After introducing some notations, the object detector will be presented in Section 3.2. Then, the object detector is described in Section 3.3 before presenting the graphical model in details Section 3.4.

3.1 Notations and metrics

To get a good understanding of the upcoming sections, we first introduce the common notations and metrics that can be employed in computer vision.

When referring to a bounding box in an image, we can assign a label to describe its nature:

— true positive (TP), a correct detection with both label and box matching with the ground truth.

— false positive (FP), an incorrect detection. The label associated with the box is wrong and does not match with any ground truth.

— true negative (TN), something not detected and where there is indeed nothing to detect.

— and false negative (FN), a missed detection, the object was not given a box or a label according to the ground truth.

In the section 3.4 we will separate the bounding boxes into different cate- gories.

— b is the general descriptor for a bounding box,

— d^ki is the k-th bounding box in frame Iigenerated by the network during inference. Thus d in the notation for detection,

— c^i,kj is a bounding box in frame I^j added during mapping (not gener- ated during inference), and linked to d^ki. Thus c in the notation for candidate.

One metric commonly used is the intersection over union, IoU. It evaluates the overlap between two boxes, here b1 and b2. The computation is simple, given by the overlapping area between the bounding boxes divided by the area of union between them:

IoU = area(b1∩ b2) area(b₁∪ b₂)

(36)

26 CHAPTER 3. METHODS

3.2 Object detector

Described previously in the section 2.1.2, the object detector used in this thesis is a Faster R-CNN [2]. The network is a single image detector as it does not take the previous frames into account for the predictions of the boxes.

This property is very important for our work, as it differs from the structures depicted in section 2.4. Indeed the predictions of two different samples are independent of each other and this property makes possible to derive information from the temporal aspect, as we can study the variations of the predictions knowing that one does not influence the other.

The key element of the network’s framework is the backbone. This part extracts the features of the samples and is the base for the label assignment as well as the region proposals. As commonly used in the literature, we will implement a ResNet-50 [28] backbone for the feature extraction (cf. section 2.1.3). This network is relatively small in terms of parameters (around 23 million) and provides a good accuracy after a few epochs of training.

As they are often used in the literature, those networks are trained regu- larly on the same datasets. The majority of them are publicly available with pre-trained weights on the benchmark datasets (ie. MS-COCO [58], ImageNet [20] and others). This makes possible to run experiments by only fine-tuning the networks on a new learning problem, without having to start the training from scratch. We will use this as an advantage to converge faster in our experiments.

3.3 Object tracking

When working with videos, one must find a way to follow an object from a frame to the next one. This process is called object tracking and can be implemented in many ways. The first approaches are using algorithms to find the best match of a region of interest from a source image to the destination one. To work reasonably, the trackers require a decent frame rate.

Indeed, it is easier to find the object if the two samples are similar, not being separated with a long time step. The framerate (FPS) is very important for the choice of the tracker. Recently, new learning-based methods appeared, designing networks to compute the flow between two samples (often based on Optical-Flow). The entire movements of the image can be predicted pixel-wise and offers a detailed description of the motions in the scene.

(37)

In [18] the boxes are tracked with PWC-Net [59]. This network predicts the flow on each pixel of the entire image. Then, to know the position of a box in the next frame, the coordinates can be translated with the flow, but with some additional computation is required. As the object does not occupy the entire box, some pixels must be carefully selected to compute the translation.

Indeed the corners of the box often belong to the background, if the object inside is represented in the foreground. The deformation of the box also needs more work and would require to have a proper segmentation of the object.

However, this kind of tracker has the great advantage of being independent of any box initialization. The flow can be computed once for all, for all the image pairs in the set.

Classic visual trackers using algorithms are also interesting as they can modify the box appearance directly to surround the object in the next image.

They work at a decent speed for the majority. The commonly used tracker algorithms are more available in general than the learning methods mentioned above¹. Hence, they are easy to implement and optimized for computation performance. Besides, they provide the information of retrieval (sometimes with a confidence value as well) which is crucial to know if the object was found.

Figure 3.2 – Tracking objects back and forth in a sequence. The tracker MIL [60] is used. In green, the original boxes taken to initialize the tracker. In blue the resulting boxes by tracking backward and in red forward.

We choose in this work to use a classic tracker and not the flow prediction networks existing. Indeed, those trackers provide the box deformation directly and are not adding much computation time compared to the predictions.

In the design of our sampling strategy, the shape of the box will play an important role. For some trackers, a confidence value can also be evaluated

1. For an implementation example see https://docs.opencv.org/3.4/d2/

d0a/tutorial_introduction_to_tracker.html

(38)

28 CHAPTER 3. METHODS

and this would add a great information in the queries. With those criteria reunited, we choose to use the library OpenCV. An illustration of tracking objects in a small sequence if given figure 3.2.

Different algorithms can be used for tracking. We call the tracker used T , and the given notation is used to denote the transformation of a box b from frame i to j with such tracker: T^i→j(b). In addition to the box outputted, we are also given a confidence score of the tracking, denoted here qT. This value is a representation of the confidence of retrieval. In some cases, this is a real number between 0 and 1, but on some other, this is just a boolean (0 or 1) indicating the success or not. The various trackers have different sensitivities to object retrieval, and the reliability of qT depends greatly on the solution adopted.

Trackers

Name Accuracy (%) Speed (fps)

CSRT [61] Very good (85.8) Moderate (7)

KCF [62] Good (63.2) Fast (65)

MOSSE [63] Mediocre (-) Very fast (>100)

Table 3.1 – Trackers’ characteristics from the study [64] performed on the object tracking benchmark OTB2015 [65]. The numbers in parenthesis represent the results of this study if when available.

Among the various possible trackers, we choose to focus on three of them, with different characteristics visible in table 3.1. As the entire method is based on the tracking ability, we hope to conclude on the influence of the algorithm used.

3.4 Graphical model

This part is probably the most important of our active learning process.

Using object tracking as described in the previous section, we design a graph as a representation of the temporal coherence, or energy flow in between the predictions.

3.4.1 Mapping

In this section, we explain the method to create links between the boxes in a sequence. This is in made of three parts. First, we want to establish the

(39)

relationship between all detections (the boxes outputted during inference by the network). Then, we want to generate candidates, meaning the boxes that were likely missed during inference. Finally, we filter some boxes out.

To perform these steps, we will use a tracker T as introduced section 3.3.

As we only search for the possible links in the neighbouring frames of a box, we set r as the tracking length. Thus, the links for a certain box to another are only made between time steps less or equal than r.

We also set the notation for R, which is the bijective mapping between all boxes of a sequence. Typically R(b1, b₂) equals to 1 when the boxes are related, and 0 otherwise.

Linking two detections

For a frame Ii, we initialize a new tracker to the detection d^ki contained in it. We are now able to track backward and forward this precise box in the neighbouring frames. When the output of the tracker is overlapping greatly with a detection d^k

0

j in another frame, we associate those two boxes together if they have the same label. We repeat this procedure for all detections in all frames of the sequence.

Typically, two boxes d^ki and d^k

0

j respectively from frame Ii and Ij can be linked to each other if they relate with the following assertion.

∀i, j : |i − j| ≤ r, R(d^k_i, d^k_j⁰) = 1 ⇐⇒ IoU (T_i→j(d^k_i), d^k_j⁰) > 0.5 (3.1) Generating candidates

When performing the first step, we initialised the tracker to the detection d^ki

and spotted the link with existing detections. However, in some of the neighbouring frames, the output of the tracker might not overlap with any existing detection. If this is the case in frame Ij, we create a candidate c^i,kj , which will possibly be later the key to find the missed detections. This candidate is a box, with coordinates the output of the tracker initialised before: T^i→j(d^k_i). The candidate inherits its label from the originating detection (as it is supposed to represent the same object).

We can then complete the mapping by adding the relations between a detections and candidates.

∀i, j : |i−j| ≤ r, R(d^k_i, c^i,k_j ) = 1 ⇐⇒ ∀k⁰, IoU (T_i→j(d^k_i), d^k_j⁰) < 0.5 (3.2)