Visual Tracking with Deformable Continuous Convolution Operators

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2017

Visual Tracking with

Deformable Continuous

Convolution Operators

(2)

Joakim Johnander LiTH-ISY-EX--17/5047--SE Supervisor: Martin Danelljan

ISY, Linköping University

Examiner: Fahad Khan

ISY, Linköping university

Computer Vision Laboratory Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden Copyright c 2017 Joakim Johnander

(3)

Abstract

Visual Object Tracking is the computer vision problem of estimating a target trajectory in a video given only its initial state. A visual tracker often acts as a component in the intelligent vision systems seen in for instance surveillance, autonomous vehicles or robots, and unmanned aerial vehicles. Applications may require robust tracking performance on difficult sequences depicting targets undergoing large changes in appearance, while enforcing a real-time constraint.

Discriminative correlation filters have shown promising tracking performance in re-cent years, and consistently improved state-of-the-art. With the advent of deep learning, new robust deep features have improved tracking performance considerably. However, methods based on discriminative correlation filters learn a rigid template describing the target appearance. This implies an assumption of target rigidity which is not fulfilled in practice. This thesis introduces an approach which integrates deformability into a state-of-the-art tracker. The approach is thoroughly tested on three challenging visual tracking benchmarks, achieving state-of-the-art performance.

(4)

(5)

Abstract

Visuell följning är ett problem inom datorseende där ett objekt ska följas i en video givet endast dess begynnelseposition. En visuell följare agerar ofta som en komponent i de intelligenta datorseendesystem som återfinns exempelvis inom övervakning, autonoma fordon eller robotar, och obemannade flygfarkoster. Tillämpningar kräver ofta robusthet under svåra förhållanden där objekt kan genomgå stora förändringar i utseende. Samtidigt finns ofta ett krav på realtid.

Diskriminativa korrelationsfilter har de senaste åren visat lovande resultat och varje år förbättrat högstanivån inom visuell följning. I och med djupinlärningens ankomst har man kunnat extrahera och använda sig av nya features och därigenom förbättrat prestandan avsevärt. Dock, metoder baserade på diskriminativa korrelations filter lär sig en oform-lig mall för att beksriva objektets utseende. Man har till viss del antagit att objekt är stela vilket inte är sant i verkligheten. Denna tes introducerar en metod som integrerar deformerbarhet in i inlärningsramverket ifrån en befintlig state-of-the-art visuell följare. Den föreslagna metoden testas på tre utmanande dataset där prestandan uppmäts vara god.

(6)

(7)

Acknowledgments

This thesis has been written at the Computer Vision Laboratory at ISY, to which I am grateful for having me. I want to extend my deepest gratitude to Dr. Fahad Shahbaz Khan for providing me with an exciting thesis proposal, and very kindly provided insights. I am equally grateful to Martin Danelljan who would provide ideas and assistance when-ever needed, discussed the topic and related topics with me, and patiently answered my inquiries. I also want to thank Goutam Bhat for helping me set up the tracking framework and discussing various topics with me.

Linköping, Maj 2017 Joakim Johnander

(8)

(9)

Notation

SETS

Notation Explanation Z The set of integers R The set of real numbers C The set of complex numbers

× Cartesian product

KN The set of K × K × · · · × K of any field K, equipped with the standard inner product

K+ The set x ∈ K : x > 0 L2

(R) The set of functions which are Lebesgue integrable on R L2

(RN) The set of functions of N variables which are Lebesgue in-tegrable on RN

L2

T(R) The set of periodic functions which are Lebesgue integrable

on [0, T ), where T ∈ R+, equipped with the standard inner product

L2 T(R

N₎ _{The set of periodic functions of N variables which are}

Lebesgue integrable on [0, T )N

`2 _{The space of square-summable sequences}

LINEARALGEBRA

Notation Explanation

u _{Finite complex-valued vectors, u ∈ C}N_{, are denoted with}

boldface

U _{Finite complex-valued matrices, u ∈ C}N M_{, are denoted}

with capital letters U u Matrix-vector product

U V Matrix-product

uT Transpose

uH Conjugate transpose

(14)

FUNCTIONS ANDOPERATORS

F {u} The Fourier transform for u ∈ L2(R), or the operator grant-ing the Fourier coefficients of u for u ∈ L2

T(R)

F−1_{ˆ_u} _{The inverse of F}

DFT The discrete Fourier transform

DFT−1 The inverse discrete Fourier transform ˆ

u The Fourier coefficients of u ∈ L2_T_(RN), Fourier transform of u ∈ L2, orDFTof a discrete function

u · v Element-wise multiplication u ∗ v Convolution

u ? v Cross-correlation

hu, vi The standard inner product between vectors u and v kuk The Standard norm induced by the inner product kukF The Frobenius norm defined for any finite matrix u

ˇ

u The mirroring of a function ˇ_{u(t) = u(−t) for t ∈ R} u The complex conjugation of a function or vector u

ABBREVIATIONS

AUC Area Under the Curve

CNN Convolutional Neural Network DCF Discriminative Correlation Filter

(15)

1

Introduction

This thesis studies the computer vision problem of visual tracking. The aim is to estimate the trajectory of a target in video. There are numerous applications and much attention has been shown for this problem. This thesis describes a problem and a state-of-the-art method to tackle it. The proposed method is based on discriminative correlation filters.

1.1 Brief Background

Computer vision is an interdisciplinary field of research treating problems where informa-tion is extracted from some type of visual input. Visual input includes, but is not limited to, RGB images, RGB-video, thermal images, three-dimensional laser scanners, event-based cameras, computed tomography images and grayscale images. Output is often high-level and semantic information such as the presence of an object category for instance in an image, automatic image captioning, tumor development assessment, driving lane detec-tion, or target trajectory estimation. Generic visual object tracking is a computer vision problem where one is given a video and the initial position of an object as a bounding box. The goal is to estimate the object trajectory as it moves throughout the video. In practice, tracking is often part of a larger computer vision system where the initial bounding boxes are found by a detector picked for the current application. There are several subproblems including occlusions, illumination changes, scale changes, in- and out of plane rotations, non-rigid deformations, difficult background and scene contamination. A tracker has to utilize the information given in the first frame to build a model which discriminates against the target and the background. Usually the model is updated in subsequent frames with some kind of assumption that the target location is predicted correctly in the new frame. Figure 1.1 shows the result of the tracker, which is later proposed in this thesis, on two example videos.

The arguably simplest solution is to save the pixel values of the patch in the target rectangle, and then find the target in the next frame by using some kind of similarity

(16)

measure, for instance the mean square error, to match the extracted patch with all other possible patches. In fact, a basic signal processing technique to locate a given signal part in an unknown signal is to use correlation. As the image is a two-dimensional signal a very simple tracker can be constructed as a correlation filter. In the first frame it is set to the patch to be matched, and applied to subsequent frames where the largest response is used as an estimate of the target location. These methods are very simple, and would fail to track the target through appearance changes induced by for instance rotations and deformations.

The matched correlation filter is a simple example of a model for the target appearance, often referred to as appearance model. For most practical applications more sophisticated appearance models are required. One fairly simple example is to use a histogram of the tar-get colors or some other information. Another example is to model the tartar-get appearance with probability distributions such as a Gaussian Mixture Model. In general, the tracking problem is often modelled using probabilistic methods [39][31][29] or as an optimization problem [5][22][13].

Visual tracking can be tackled with a simple model using good features. Such trackers takes feature maps as input, rather than the raw grayscale or rgb images. In computer vision a feature is a vector calculated at a point in an image via the application of some mapping. Often the features are calculated using a local region around the point. A feature map consists of these features sampled on a grid and is therefore an image with a number of channels. An example of a simple feature is the image gradient at a certain point. Another example which is commonly used is the color names (CN) [45] descriptor, a vector of size 11 where each element is a probability that the pixel is of a certain color. The most well-known feature in computer vision is arguably SIFT [32], but this is rarely used in tracking. A feature reminiscent of SIFT is Histogram of Oriented Gradients (HOG) [9] which has seen widespread use in tracking algorithms as it is fast to compute and of small size.

In this thesis a discriminative correlation filter (DCF) based tracker is used as baseline. This is motivated by the fact that DCF based methods have achieved state-of-the-art per-formance across several tracking datasets [27][14][12][21][34][4]. A DCF is reminiscent of the previously described matched correlation filters. Target detection is performed in each frame via a two-dimensional correlation filter applied to the input image. The filter response is treated as a detection score, where the location of the maximum value is used as an estimate of the target position. In contrast to the matched filter, a DCF is trained on several previous frames before estimating the target position in the current frame. Train-ing the filter refers to usTrain-ing previous images and target positions to find a filter, usually by minimization or maximization of some quantity. An example of this quantity is

C

X

c=1

kf ? xc− yck22 (1.1)

where f is the filter, ? is cross-correlation, k · k2 is the euclidean norm, xc the feature

map extracted from image frame c, and yc a Gaussian with a sharp peak located at the estimated or given target position. It is minimized over the filter coefficients f . By using Parseval’s formula the problem is solved very efficiently in the Fourier domain.

Despite reaching state-of-the-art performance, the discriminative correlation filters contain an underlying assumption that targets are rigid and do not undergo rotations. This

(17)

1.2 Problem Formulation 3

Figure 1.1: Three frames of three sequences from the OTB-2015 dataset. In the initial frame the bounding box is given, and a tracker attempts to estimate bounding boxes in the subsequent frames. These estimates are marked with green boxes.

assumption is often violated, for instance when the video depicts a running human or whenever the video contains changes of perspective relative the object. These violations are of no problem as long as large parts of the target appearance remains the same, or whenever powerful features yield sufficient invariance to these transformations.

1.2 Problem Formulation

This thesis studies filter deformability in the context of visual tracking. Deformability shall be introduced to a state-of-the-art DCF-based tracker. There are various considera-tions to be made here. While the proposed method must allow deformaconsidera-tions in the appear-ance model, prior knowledge of plausible deformations must most likely be incorporated. Furthermore, the sophisticated online learning process of the baseline must be adapted to the introduced deformability. The proposed method shall be tested extensively to measure tracking performance. The method shall be tested using several commonly used datasets developed by the tracking community. Cases where tracking performance is increased or decreased should be identified and discussed. Formulated in terms of questions, we ask:

(18)

What are the effects to tracking performance? What are the effects?

1.3 Motivation

To motivate this study, two questions are posed. First, is visual tracking something needed by the society or the industry, more bluntly, can it yield something commercially viable? Second, is the study of deformable discriminative correlation filters of interest to a re-search community?

1.3.1 Practical Applications

Generic visual object tracking is a problem with numerous applications. In recent years, large companies such as Google and Tesla have looked into autonomous driving. Also Swedish companies such as Volvo and Autoliv have shown interest in the field. For au-tonomous driving, generic tracking proves useful by estimating trajectories of detected objects. It is also an area where being able to handle every scenario, including deforma-tions and rotadeforma-tions, is imperative and can be the deciding factor between life and death.

Unmanned Aerial Vehicles (UAV), or drones, have lately received attention from com-panies and hobbyists alike. These often come equipped with a camera, allowing the UAVs to make intelligent decisions on their own based on visual input. Amongst the deluge of more or less viable applications there are some notable examples. Drones could be used to find and follow survivors of natural disasters. They could also be used to assist the police in finding and following suspects. Another use for this would be UAVs collecting footage for movies or sports. The UAV could track and follow an actor or actress in a movie scene, or the player running to score in football.

Intelligent surveillance systems can also benefit from visual tracking. Surveillance systems can refer both to systems used by companies, persons and the police to prevent or solve crime. It can also refer to systems used by the industry during production. In both cases it may be desirable to count and keep track of objects, which can be done by detecting and tracking them.

1.3.2 Recent Research

Recently a large amount of papers have been published in the area of visual object track-ing. The area has generated much attention and is well dissected at notable computer vision conferences such as ICCV, ECCV and CVPR [27].

Several State-of-the-art trackers rely on DCF-based approaches and these have shown consistent performance increases in several visual tracking benchmarks. Despite this, they often rely on an assumption of target rigidity, and inability to rotate. By using features containing powerful semantic information they are often, but not always able to keep track of targets undergoing large changes to appearance. Common targets, such as humans performing various tasks, may prove difficult for these rigid DCF-based approaches. It is therefore desired to introduce deformability into the method. Staple [4] proposes to use color histograms which are fairly invariant to deformations, but not robust to any changes in illumination. Several recent works aim at introducing part-based information into the

(19)

1.4 Thesis Outline 5

DCF framework [31][29][33]. This often involves applying several filters and merging the information they provide via some additional components. These works have however shown increases in performance. This thesis studies how to introduce deformability by rewriting the filter as a linear combination of subfilters, and merging the information in a single joint objective functional. The merit is that the tracker uses few components, solving a single optimization problem to track.

1.4 Thesis Outline

This thesis introduces the tracking problem in Sec. 2.1 and the challenges associated with it. Section 2.3 briefly reviews some current state-of-the-art tracking methods, some meth-ods of the closely related problem of video object segmentation, and lastly some methmeth-ods attempting to explicitly add deformability to DCF-based trackers. Discriminative cor-relation filters are introduced in Sec. 2.4, which begins with a simple formulation and continues with several improvements. Sec. 2.5 contains thorough theory of a learning framework utilized by a state-of-the-art tracker. In chapter 3 this framework is altered to introduce deformability to DCF-based filters. Various considerations and method choices are briefly discussed and motivated. Chapter 4 describes the experiments performed and lists results. The results and future work is discussed in chapter 5. The thesis is concluded with chapter 6.

(20)

(21)

2

Background

A discriminative correlation filter can be used to build a simple solution to the problem of visual tracking. This chapter describes the visual tracking problem and some impor-tant approaches to solve it. In Sec. 2.4 the theory of discriminative correlation filters is introduced. The section begins with a simple formulation, used for instance by Bolme et al. [5]. This is followed by descriptions of several successful improvements, which are needed to attain competitive performance.

With the advent of deep learning, powerful convolutional neural networks trained for image classification tasks could be used as a pre-processing step for visual tracking. These resulted in large improvements to tracking performance, but combining the information provided by the different parts of the classification network proved challenging. The learn-ing framework introduced with C-COT [13] provided a method tackllearn-ing this problem and attained top performance across several benchmarks. The method is based on discrimi-native correlation filters and utilizes the improvements described in Sec. 2.4. Section 2.5 provides a thorough description of this learning framework, including two improvements to C-COT introduced in [14].

2.1 Visual Tracking Problem

Generic visual object tracking is a fundamental computer vision problem which has re-ceived much attention by the industry and researchers alike. Given a video and the object state (x, y, height, width) in the first frame, a tracker should estimate the object state for all future frames. The problem is generic in the sense that no prior knowledge of the object to be tracked is available. This is in contrast to scenarios where the type of target is known beforehand.

(22)

2.1.1 Datasets

The applications of visual tracking are many and diverse, with each application having its own set of requirements. These can be real-time speed constraints, or performance re-quirements for a special class of videos. In practice, picking the best tracker is application specific and will imply some kind of trade-off.

However, to be able to properly and in a fair manner, test and compare generic visual object trackers, the community has developed datasets and evaluation metrics. Any pub-lication within the field is expected to evaluate their method on these collective datasets. The method proposed in this thesis is evaluated on three such datasets: OTB-2015 [48] consisting of 100 videos, TempleColor [30] consisting of 128 videos, and VOT2016 [27] consisting of 60 videos. The datasets are diverse and challenging, containing a large set of scenes and targets. Some experiments of this thesis are performed using OTB-2015 or TempleColor. Evaluation of the final tracker is done across all three datasets however.

2.1.2 Attributes

To be able to analyze a tracker it is necessary to understand why a tracker may fail. The datasets are rich with instances where the target to be tracked undergoes large changes in appearance or when the scene contains difficult background. This section briefly describes several types of such tracking challenges. In the literature, these are often referred to as attributes.

Background Clutter In many cases the features of a target may look much like the

background. Consider tracking a face when there is a crowd in the background. Small changes in target appearance may result in a part of the background looking more similar to the target than the target itself. If the tracker sees and is trained on a face for several frames a simple out-of-plane rotation of the target may yield a situation where a face in the crowd looks more like the target than the target itself. Any situation where the background contains similar features to that of the target can result in a tracker failure.

Deformations Non-rigid deformations forms a large set of transformations which a

target can undergo. One of the most common examples is a target consisting of fairly rigid parts which move and rotate with respect to each other, such as a human. Humans are deformable but under most circumstances they are piecewise rigid. Another example of piecewise rigidity are flying birds where each wing, and the center part, forms three fairly rigid parts. An example where the target contains little to no rigidity is a swimming octopus.

Fast Motion Several videos depict targets which moves very quickly with respect to

their size. An oftentimes utilized prior in the visual tracking problem is that a target rarely moves very far between two subsequent frames. This prior information can stop a tracker from switching between two similar targets, which can otherwise be the case for instance in a video showing competing 100m runners. However, utilizing this information in videos showing fast motion, such as several scenes of “The Matrix”, may cause a tracker to lose its target.

(23)

2.2 Features 9

Illumination Variations The tracked target may be subject to spatial and temporal

il-lumination variations. The scene may contain different lighting conditions in different places and the target may suffer from events such as lightning flashes or moving lights. The impact of such effects can be attenuated by utilizing features invariant to changes in illumination, such as HOG [9].

In-plane Rotations In-plane rotations are rotations occurring in the 2D image plane.

An example of an in-plane rotation is a motorcyclist performing a back flip, seen from the side. There is prior information available for targets undergoing such transformations, namely that the appearance remains the same, just rotated. This is however rarely utilized, and in-plane rotations proves challenging for many trackers. The DCF-based trackers contain an assumption that targets do not rotate, and other trackers such as LOT [39] view all changes in appearance as noise.

Low Resolution Another challenging attribute is low resolution. Low resolution

re-duces the available information of the target which may reduce the ability to discriminate between two targets. It is however common, both due to cheap cameras and due to large distances to targets.

Motion Blur Motion blur occurs when a target moves quickly, drastically changing the

target appearance into a smudged mess.

Occlusions Targets are often covered by something else. This can be seen in sequences

showing a person walking behind a car, or a tiger moving behind some vegetation. This attribute is referred to as occlusions and they may be partial in the sense that only a part of the target is occluded, or full.

Out-of-Plane Rotations Contrary to in-plane rotations, out-of-plane rotations are not

in the image plane, that is, the rotation vector is not perpendicular to the image plane. This results in some parts of the target disappearing, and others appearing.

Out-of-View Videos may show a target moving beyond the image border, disappearing

for a set of frames, displaying the out-of-view attribute. Many trackers become unstable when the target disappears and move around the video looking for the target, but are unable to locate the target when it reappears.

Scale Variations Often the target moves closer or further away from the camera,

chang-ing in scale. This problem is commonly remedied by an attempt to estimate the scale and then rescaling the input image.

2.2 Features

For many tasks in computer vision special kinds of features are used. These are elements which are extracted from the image. A feature is a vector calculated at a point in the

(24)

Figure 2.1: Five sequences where a state-of-the-art tracker attempts to estimate the target trajectory. The first column contains an early frame of each shown sequence. First row (from above) shows four frames of a scene from the movie “The Ma-trix”. It displays illumination variation and background clutter. Second row shows a “Transformer”, an example of a deformable target. Third row consists of four frames depicting “Ironman”, displaying the attributes fast motion and motion blur. The second and third frames are subsequent. The fourth row shows a low-resolution diver performing in-plane rotations. The fifth row shows a sequence containing scale variations, occlusions, and out-of-plane rotations.

input image using the local region. A feature map consists of these features sampled in a grid. Therefore feature extraction can be seen as a map from the input-image-space to Rabc where a and b are the number of pixels in each row and column respectively; and c is the number of feature dimensions for that feature. Hence each image is transformed into another image, possibly with another amount of channels, and possibly with another resolution.

There are various popular choices of features. Some examples are color names [46], Histogram of Oriented Gradients (HOG) [9], and features extracted from deep convolu-tional neural networks (CNN). In this thesis, color names and features extracted from a CNN are used.

(25)

2.2 Features 11

2.2.1 Color Names

The idea of color names [46] is to yield a linguistic description of the color at a given position in the image. At each position, that is, at each point in the feature map there is an 11 element feature vector. Each element of the vector corresponds to a color and is the probability that this given point in the image would be referred to as that color. The small size of this feature and its discriminative power [11] makes it attractive for visual tracking.

2.2.2 CNN Features

During recent years convolutional neural networks (CNN) have shown remarkable results in several computer vision research areas. A CNN is a sequence of operations where the type of operation is specified in advance, but its parameters are trained on large datasets. The sequence of operation types is usually referred to as network architecture. The ar-chitecture is divided into layers where each layer performs one operation type. The most common layers follow:

Convolution Layer - Contains several convolution filters. Each filter is applied to the input, one image per feature channel, and results in one image per convolution filter. The size of the convolution filters can vary between layers, but is usually fairly small, such as 3 × 3.

Max Pooling Layer - Usually yields a downsampled output. For each pixel in the down-sampled image, look at a neighbourhood centered at the corresponding point in the input image. The value of this pixel is picked as the maximum in the corresponding neighbourhood. The size of the neighbourhood is typically 2 × 2, and the downsam-pling factor is typically 2. The operation is illustrated in Fig. 2.2.

Flatten Layer - Reshapes the input into a vector, preserving the elements. The input is usually a set of images, viewed as a third order tensor, which has been passed through a sequence of convolution and max pooling layers.

Fully Connected Layer - Can be viewed as a general linear mapping given by a matrix. Non-linear Activation Function - Each convolutional and fully connected layer is

fol-lowed by a non-linear activation function such as the commonly used rectified lin-ear unit (ReLU). ReLU is defined as f (x) = max(0, x). The activation function is applied element-wise to the input. For some applications, such as image clas-sification, the desired output is a set of probabilities. Each element in the output would correspond to a class and would be viewed as the probability that the classi-fied image is of that class. This is obtained with the softmax function, defined as f (xj) = exj/Pje

xj_.

Each convolution layer contains a set of convolution filters. Each filter coefficient is a parameter in the network. In a similar fashion, each fully connected layer contains a matrix, which represents the linear mapping. Each element of this matrix is a parameter in the network. These parameters are learnt on a dataset via some optimization scheme.

(26)

3 1 9 4 1 1 3 3 2 3 2 7 5 4 3 1 9 3 7 5

Figure 2.2: Shows the operation of max pooling on an input image of size 2 × 2. The max pooling is of size 2 × 2 with a downsampling factor of 2. The input image is to the left, the output image is to the right.

As an example, in image classification a batch of several images is fed into the network, each labeled with a correct class. By calculating the gradient of the class prediction error with respect to the parameters, the parameters can be updated by for instance gradient descent. In practice a modified gradient descent is usually used. Figure 2.3 shows the VGG16 [43] architecture which is used for image classification. It is trained on ImageNet [15] which contains over a million images.

For visual tracking, the input image can be fed into a convolutional neural network already trained for some other computer vision task. The output after different convolu-tional layers are images with a varying amount of feature channels. Usually the resolu-tion decreases while the number of feature channels increases as the image is propagated deeper through the network. Recent state-of-the-art trackers have shown that using the out-put of a convolutional layer as a feature map yields great gains in performance [35][26].

2.3 Prior Work in Visual Tracking

Visual object tracking receives much attention from the research community and the ap-proaches used in the state-of-the-art methods varies greatly. The research regarding the closely related problem of video segmentation is important to consider as any solution to that problem problem infers a solution to visual object tracking.

2.3.1 Current State-of-the-Art

Currently, the top performing trackers use a wide and varied set of techniques. The MLDF-and SSAT [36] trackers trains a CNN online to predict the target position in each frame, using features from another network trained for image classification. TCNN [38] locates the target by averaging predictions made by multiple CNNs. Staple [4] merges informa-tion provided by a discriminative correlainforma-tion filter and a color histogram. C-COT [13] is a continuous correlation filter using features extracted from a CNN trained for image classi-fication, adding a spatial regularization to the filter learning. These five trackers were the top performers on the recent Visual Object Tracking (VOT) challenge 2016 [27]. In this

(27)

2.3 Prior Work in Visual Tracking 13 Input Image Size 224x224x3 Convolution 3x3, ReLU Convolution 3x3, ReLU Max Pool 2x2, 2 Size 112x112x64 Convolution 3x3, ReLU Convolution 3x3, ReLU Max pool 2x2, 2 Size 56x56x128 Convolution 3x3, ReLU Convolution 3x3, ReLU Convolution 3x3, ReLU Max pool 2x2, 2 Size 28x28x256 Convolution 3x3, ReLU Convolution 3x3, ReLU Convolution 3x3, ReLU Max pool 2x2, 2 Size 14x14x512 Convolution 3x3, ReLU Convolution 3x3, ReLU Convolution 3x3, ReLU Max pool 2x2, 2 Size 7x7x512 Flatten

Fully Connected, ReLU Fully Connected, ReLU Fully Connected, Softmax

Size 1x1x1000 Output Vector

Figure 2.3: An example CNN architecture. This is the VGG16 image classification network. To the left are the performed operation types. To the right are intermediate data sizes. A size 224 × 224 RGB image is used as input. As RGB images have 3 channels it is viewed as an array of size 224 × 224 × 3. Output is a vector of length 1000.

(28)

thesis the latter will be used as a foundation, adding deformability to the correlation filter.

2.3.2 Relation to Video Segmentation

Furthermore, a natural extension to the problem of estimating bounding boxes is to es-timate a full object segmentation in each frame. With a segmentation a bounding box is easily estimated, and the segmentation in itself can be useful for some applications. There are several techniques for video segmentation. NLC [16], JOTS [47] and Object Flow [44] are graph-based techniques. FusionSeg [23] consists of two CNNs, one receiv-ing and RGB image as input, and the other takreceiv-ing a precalculated optical flow as input. MaskTrack [24] trains a CNN which predicts an object mask in each frame. Their CNN receives both an RGB image and a segmentation mask from the previous frame.

These methods are however usually very slow, often requiring several seconds or min-utes per frame. In visual tracking, real-time efficiency is not always attained, but often desired. Too slow methods are deemed infeasible. Furthermore, the datasets used for evaluation in this field are different from the challenging datasets seen in visual tracking. Commonly used are SegTrack-v2 [28], DAVIS [40] and YoutubeObjects [6][41]. The for-mer two are fairly small datasets both in terms of number of sequences, and in terms of sequence length. The latter is large, but comprises only ten classes. These datasets also often contain situations where the target is larger than the frame, which is generally not the case in visual tracking. The scope of this thesis is limited and little attention is given to video segmentation. Instead we focus only on visual tracking, which is closely related.

2.3.3 Deep Learning for Visual Tracking

Recently many computer vision problems have seen a surge of approaches based on Deep Learning. Recent advances in the field and the massive processing power of graphics cards have lead to the possibility of training complex models with big data. Image classification tasks can be solved by a Convolutional Neural Network (CNN) trained on millions of images. The research community has shown much interest in whether neural networks can outperform the current state-of-the-art trackers, and how this could be done. The current state-of-the-art trains a CNN online to classify the target as foreground and anything else as background. Problems such as occlusions and background clutter will however enforce some additional component to be incorporated to such a tracker. To solve this a time component would need to be incorporated into the model. One intuitive idea would be to investigate Recurrent Neural Networks (RNN). A common question is whether the tracking problem can be solved end-to-end by training a neural network.

2.3.4 Part-based Approaches

A recurrent idea in visual tracking is to divide the target into parts and have several small trackers track each part. Tracking a smaller part of the image deteriorates performance however, as there are less good features to utilize. This can easily be seen by considering for instance the problem of tracking a leg and the problem of tracking an entire body in a video showing a crowd. Additionally, partial occlusions may become full occlusions, the parts may lack discriminative features to track, and the parts may move very quickly

(29)

2.4 Discriminative Correlation Filters 15

relative their size. To reduce the effects of these problems information of the different parts needs to be shared, for instance by constraining their movements relative each other with a system of springs.

The work of [31] utilizes several kernelized correlation filters (KCF) [22] each track-ing a part of the target. The filters are combined in a probabilistic framework by consider-ing the part filter locations as an observation of the target state. By weightconsider-ing the part filter responses depending on the peak-to-sidelobe-ratio (PSR) and disabling part filter updates whenever the PSR is too low they gain robustness versus partial occlusions. Li et. al. uti-lizes several KCF where the particle filter is used to merge information. The deformable parts tracker (DPT) [33] utilizes several KCF constrained by a system of springs. The predicted part positions are merged with the spring model, combining target appearance with a geometric regularization. These methods all build on the KCF which is a highly computationally efficient filter with good tracking performance. Additional components are added to merge the responses provided by the filters.

2.4 Discriminative Correlation Filters

Trackers based on discriminative correlation filters have recently achieved great success in several object tracking benchmarks [27]. These trackers rely on learning a set of 2-dimensional correlation filters in the first frame. The tracker will then predict a bounding box in subsequent frames and usually also updates the filters as new boxes are predicted.

2.4.1 Simple Correlation Filter

Arguably, the simplest correlation filter is a single matrix [5], denoted f [k1, k2]. Each

video frame a new sample x is received. In this case, the sample corresponds to a two-dimensional array corresponding to a part of a grayscale image. Denote the sample size as N1× N2. f is of the same type and size, that is, an N1× N2array. The filter f is trained

on the hitherto received samples. Denote this set of samples {x1, x2, . . . , xC}. Here, we note that training typically occurs every m frames during the whole sequence. The set of samples can be all hitherto received samples, or an intelligently selected subset.

Define a detection score operator

Sf{x}[k1, k2] = (f ? x)[k1, k2], (2.1)

where ? denotes circular cross-correlation. We desire the score operator to produce a sharp peak at the target location in the frame corresponding to the sample x. An example of this can be seen in Fig. 2.4 For efficiency the score is calculated in the Fourier domain, using the Discrete Fourier Transform (DFT),

Sf{x}[k1, k2] =DFT−1{DFT{f }DFT{x}}[k1, k2], (2.2)

where · denotes the complex conjugate operation. The detection score is used to estimate the target position by locating its maximum. As the parameter space is very small a sim-ple grid-search can do this. Via interpolation, sub-pixel accuracy can be achieved, that is, rather than finding the maximum of S{x}[k1, k2] one finds the maximum of an

(30)

Figure 2.4: The score operator Sfapplied to the sample overlaid on the input image

(Left). The label function y overlaid on the input image (Right). Sf was trained to

track the diver on the initial frame.

where by performing all calculations in the Fourier domain an implicit interpolation is gained.

Hitherto a filter f has been used to detect the target position. Before doing so f must be trained. This can be done by training the score operator applied to sample xc corresponding to frame c, Sf{xc}, to yield a Gaussian function with a peak centered at

the target location. That is, define the Gaussian label y[k1, k2] = 1 √ 2σ2e −k21 +k22 2σ2 (2.3)

and shifted versions

yc[k1, k2] = τpcy[k₁, k₂] = y[k₁− pc₁, k₂− pc₂], (2.4)

where τpc is a translation operator which centers the peak at the target location pc =

(pc

1, pc2) in frame c. To train the tracker we find the filter f which minimize the objective

functional (f ) = C X c=1 αckSf{xc} − yck22= C X c=1 αck(f ? xc_{) − y}c_k2 2 (2.5)

where αcis a weight applied to each frame, and k · k2is the Euclidean norm. For

compu-tational efficiency we apply Parseval’s theorem to get

(f ) = C X c=1 αck ˆf ˆxc− ˆyck2 2, (2.6)

which is a linear least squares problem with a closed form solution. Here ˆ· denotes the Discrete Fourier Transform (DFT) of a finite discrete signal.

What has been described is a simple correlation filter which can be used as a visual tracker. There are two parts to this. When the initial sample is received the target location

(31)

(p1

1, p12) is known. The filter coefficients f are then calculated by minimizing (2.6). In the

subsequent frame, a new sample is received. The estimated filter is then used to estimate (p2

1, p22), followed by an estimation of f . This is repeated and yields a visual tracker which

may work well for simple cases. However, most real sequences displays one or several of the attributes described in Sec. 2.1.2 and this will cause the tracker to fail.

2.4.2 Closed Form Solution

The objective (2.6) is minimized using linear least squares. A similar derivation is per-formed by Bolme et al. [5]. To find a stationary point of the objective the derivative is taken with respect to each of the filter coefficients, and set to zero. Note that the objective is real, and the derivative is taken with respect to both the real part and the imaginary part. Explicitly writing out the norm and changing the order of summation in (2.6) yields

(f ) = X k1,k2 C X c=1 αc ˆ f [k1, k2]ˆxc[k1, k2] − ˆyc[k1, k2] 2 (2.7)

For a given k1, k2the term

C X c=1 αc ˆ f [k1, k2]ˆxc[k1, k2] − ˆyc[k1, k2] 2 (2.8)

depends only on ˆf [k1, k2] and no other filter coefficient. As shown in A.1, it is minimized

if and only if ˆf [k1, k2] satisfies

0 =

C

X

c=1

αc(|ˆxc[k1, k2]|2f [kˆ 1, k2] − ˆxc[k1, k2]ˆyc[k1, k2]). (2.9)

Utilizing this, each optimal filter coefficient is found as

ˆ f [k1, k2] = PC c=1α c_x_ˆc_[k 1, k2]ˆyc[k1, k2] PC c=1αcxˆc[k1, k2]ˆxc[k1, k2] . (2.10)

This closed form solution is available for this simple case. More intricate approaches to the visual tracking problem will require iterative optimization methods.

2.4.3 Equivalence of Convolution- and Correlation Operators

For mathematical convenience it is useful to use convolution operators rather than cor-relation operators. We shall show that the problem solved and the result is equivalent when exchanging the correlation operators for the more convenient convolution operators. Consider u, v ∈ L2_{(T ). Since} (u ?T v)(t) = 1 T T Z 0 u∗(τ )v(t + τ )dτ (2.11)

(32)

and (u ∗T v)(t) = 1 T T Z 0 u(τ )v(t − τ )dτ (2.12) we have (u ?T v)(t) = 1 T T Z 0 ˇ u(−τ )v(t + τ )dτ = 1 T T Z 0 ˇ u(τ )v(t − τ )dτ = (ˇu ∗T v)(t) (2.13)

where ˇ· denotes the mirroring operation. This means that a convolution filter is just the mirror image of an equivalent correlation filter. Instead consider u, v : Z → C such that (u ∗ v)[k] is defined, and where v is periodic, with period K. This yields

(u ? v)[k] = K−1 X l=0 ˇ u[−l]v[k + l] = 0 X l=−(K−1) ˇ u[l]v[k − l] = = K−1 X l=0 ˇ u[l]v[k − l] = (ˇu ∗ v)[k]. (2.14)

Again, this means that by mirroring the first element in the convolution, correlation is achieved. This shall be used to instead train a convolution filter. Note that this also applies in two dimensions, that is, when u, v ∈ L2

(R2

) or u, v : Z2_{→ C such that (u ∗ v)[k} 1, k2]

is defined, and where v is periodic with periods K1, K2.

2.4.4 Multiple Features

The convolution filter can be improved by allowing multiple feature channels. Con-sider the sample x to be a tensor of order 3, of size N1× N2× D. We denote x =

(x1, x2, . . . xD). This sample can for instance be a part of an RGB-image with D = 3.

In practice this will correspond to some other feature map extracted from the image, such as HOG, color names or features extracted by propagating the image through the initial layers of a convolutional neural network. We generalize the detection score operator to the multiple feature case as

Sf{x}[k1, k2] = D

X

d=1

(fd∗ xd)[k1, k2]. (2.15)

The individual feature responses in each feature dimension is illustrated in Fig. 2.5. The filter optimization is again performed in the Fourier domain for reasons of efficiency. The DFT of the detection score is found as

\ Sf{x}[k1, k2] = D X d=1 ˆ fd[k1, k2]ˆxd[k1, k2] (2.16)

(33)

Figure 2.5: The detection score Sf is found by summing the filter responses (fd∗

xd) of each dimension d. Here the filter responses in four dimensions are shown,

overlaid the input image. In this case, the feature maps xd are extracted from the

first convolutional layer of a classification CNN.

yielding the objective

(f ) = C X c=1 αk \Sf{xc} − ˆyck22= C X c=1 D X d=1 ˆ fdxˆcd ! − ˆyc 2 2 . (2.17)

The current state-of-the-art visual trackers show that it is necessary to use some kind of multi-dimensional features which are robust to changes in appearance. While this im-proves tracking performance, the training process becomes more complex and there is no closed-form solution corresponding to (2.10) available. Instead some other optimization method is utilized. In the learning framework proposed with C-COT the method of least squares is utilized together with a linear system of equations solver.

2.4.5 Detection

When receiving a new frame, indexed c, the trained filter should be applied to that frame to predict the target position. This is followed by a filter update, which requires labels for all hitherto received frames. The new label ˆyc_{is created by centering a Gaussian function}

(34)

at the estimated target position. Utilizing the detection scores, we find the target position as

arg max

k1,k2

Sf{xc}[k1, k2] (2.18)

by performing a grid search over k1, k2. The solution is then refined with sub-pixel

ac-curacy by applying a few steps of Newton’s method. To do this, define a continuous score operator by performing the calculations in the Fourier domain, yielding an implicit interpolation, Sf(t1, t2) = 1 N1N2 X k1,k2 [ S{x}[k1, k2]e i2πk1 N1t1+_N2k2t2 . (2.19)

We find the gradient

∇Sf(t1, t2) = i2π 1 N1N2   X k1,k2 [ S{x}[k1, k2]e i2πk1 N1t1+_N2k2t2 _k 1/N1 k2/N2  , (2.20) the Hessian Sf(t1, t2) = i2π 1 N1N2   X k1,k2 [ S{x}[k1, k2]e i2πk1 N1t1+_N2k2t2   k2 1 N2 1 k1k2 N1N2 k1k2 N1N2 k2 2 N2 2    , (2.21) and can now use Newton’s method to refine the coarsely estimated optimum found with the grid search.

2.4.6 Scale Estimation

The tracked target may move towards or away from the camera, resulting in a change of scale. In such a scenario, the simple DCF may lose track of its target as its rigid template no longer matches. Even if it would successfully track its target, for instance due to robustness provided by the features, the resulting bounding box will of the wrong size. Both of these problems can be solved by estimating the target scale and then rescaling the given sample [10].

During the detection stage where the target location is estimated, we also estimate the target scale. This is done by resampling the given image in different scales. Feature maps are extracted, yielding a sample for each scale. The score operator Sf is then applied

to each sample. For each corresponding detection score, the target location is estimated as described in Sec. 2.4.5. The sample with the highest detection score is utilized. The corresponding scale is used as an estimate of the target scale. The following filter opti-mizations will use this sample. By doing this, the target will remain of the same size in all samples. The resampled images and their corresponding detection scores are seen in Fig. 2.6.

(35)

Figure 2.6: The upper row shows the input image sampled at three different scales. The lower row shows the detection scores Sf{x} for each scale. The rightmost

image yields the maximum score and its scale is used as an estimate.

2.4.7 Windowing of Features

There is prior information about the target movement. Positioning systems often model this prior by assuming a motion model and then fuse this information with the measure-ments which in our case would be the estimated target positions. When working with generic tracking the motion model may however be unreliable for two reasons: we are tracking generic targets which may move very differently, and since the camera may also move.

Another issue is border effects. Since computations are performed in the Fourier domain there is an implicit periodic extension of the video frame. This means that close to the frame borders the scores will display artifacts due to this periodic assumption. In signal processing this effect is commonly addressed by applying a window to the signal. This removes energy of samples near the borders. Primary this will remove border effects. This will also result in a prior that the target will not move too far between two consecutive frames.

2.4.8 Spatial Regularization

Furthermore the tracker may learn parts of the background as parts of the target. In the case of discriminative correlation filters, high weights may be assigned to filter coeffi-cients corresponding to background. To alleviate this, the DCF-based tracker SRDCF [12] introduced a spatial regularization. In their learning formulation, a cost was added to each coefficient which increased with the distance from the filter center. Another perk to

(36)

Figure 2.7: To the left is an input image. In the middle is one dimension of a feature map extracted from a convolutional neural network. To the right is the windowed feature map.

Figure 2.8: An example of the filter in one dimension fdoverlaid on the input image.

Here the filter was trained on the region containing the coke.

this is that boundary effects may be reduced. Alter (2.17) by adding a regularization term

s= D

X

d=1

kw · fdk2 (2.22)

to the objective, which then becomes

(f ) = C X c=1 α S\f{x c_{} − ˆ}_yc 2 2 + D X d=1 w ∗ ˆˆ fd 2 2 . (2.23)

Here w is a window with high values close to the borders, heavily regularizing filter coef-ficients placed there. Note how sbecomes a convolution when expressed in the Fourier

domain. It is desirable to only represent the filter coefficients in the Fourier domain and not be forced to switch domains. Hence for computational efficiency it is important to be able to represent the spatial regularization window with few coefficients.

(37)

2.5 Continuous Formulation 23

2.5 Continuous Formulation

A key strategy is developed by Danelljan et al. [13] is to define the filter as a set of continuous periodic functions, one for each feature dimension. These are applied to the samples by defining an interpolation function which transforms them to the continuous domain. Previously a major challenge was how to merge information obtained from fea-ture maps of different resolutions. Powerful discriminative feafea-tures are often of different scales, as is the case for features extracted from different layers of a convolutional neural network. The shallow layers contains high resolution low-level information such as edges, and the deeper layers contains low resolution high-level information. The shallow layers have been shown to contain information which is very suitable for tracking while features extracted from the deeper layers can be very robust to large changes in target appearance.

2.5.1 Definitions

Several new definitions as well as some redefinitions are needed,

Description Notation

Cont. variables t = (t1, t2) ∈ R2

Disc. variables n = (n1, n2) ∈ Z2

Four. domain variables k = (k1, k2) ∈ Z2

Num. disc. variables Nd_{= (N}d

1, N2d) ∈ N2

Four. coefficients limit Kd= (K₁d, K₂d_{) ∈ Z}2 Support region T = (T1, T2) ∈ R2

Samples index c = 1, . . . , C Feat. dim. index d = 1, . . . , D

Filter f = (f1, f2, . . . , fD) ∈ L2T(R 2₎D Sample space X = (X1, . . . , XD) = (RN 1 1N21, . . . , RN1DN2D) Sample xc d∈ Xd Spat. reg. w ∈ L2 T(R 2₎ Interpolation function bd_{∈ L}2 T(R2) Interpolation operator Jd_{: X} d→ L2T(R 2₎ Score operator Sf : X → L2T(R2) Label yc_{∈ L}2 T(R 2₎ Sample weight αc_{∈ [0, 1)} Filter position qc _{∈ R}2

(38)

2.5.2 Filter Application

To be able to apply the continuous filter to the discrete samples, an interpolation operator J is applied to the samples

Jd{xd}(t1, t2) = N₁d−1 X n1=0 N₂d−1 X n2=0 xd[n1, n2]bd t1− T1 Nd 1 n1, t2− T2 Nd 2 n2 . (2.24)

Here xc_d is the sample taken in time instance c, along feature dimension d. bd is the interpolation function. Nd = (Nd

1, N2d) is the number of points of that sample, along

both spatial dimensions. T = (T1, T2) is the arbitrary but fixed period of the interpolated

samples and the filters. t = (t1, t2) is the continuous spatial variable and n = (n1, n2)

the discrete spatial variable. The detection score operator is redefined as

Sf{x} = D

X

d=1

fd∗ Jd{xd} (2.25)

where fd∈ L2(T ) is the filter and D the number of feature dimensions. This results in a

continuous score for any given input sample, which will be used to localize the target.

2.5.3 Objective Functional

The objective is redefined as

(f ) = C X c=1 αckSf{xc} − yck2+ D X d=1 kw · fdk2 (2.26)

where w ∈ L2(T ) is the spatial regularization. αc _{∈ [0, 1) is the learning rate for sample}

taken at time instance c. yc∈ L2_{(T ) is the label function, for instance a Gaussian centered}

about the target. fd∈ L2(T ) is the convolution filter in dimension d.

Given a set of samples and labels the functional shall be minimized. As previously, this is be done in the Fourier domain. This time however, the Fourier coefficients are utilized rather than the DFT. This is necessary since a discrete representation is required. First we find the Fourier coefficients for the score function as

\ Sf{xc} = D X d=1 ˆ fdJ\d{xcd} (2.27)

where ˆfdand \Jd{xcd} are the Fourier coefficients of f and J d_{xc

(39)

2.5 Continuous Formulation 25 is found as \ Jd_{xc d}[k1, k2] = F    Nd 1−1 X n1=0 Nd 2−1 X n2=0 xcd[n1, n2]bd t1− T1 Nd 1 n1, t2− T2 Nd 2 n2    = = Nd 1−1 X n1=0 Nd 2−1 X n2=0 xc_d[n1, n2]F bd t1− T1 Nd 1 n1, t2− T2 Nd 2 n2 = = ˆbd[k1, k2] Nd 1−1 X n1=0 Nd 2−1 X n2=0 xc_d[n1, n2]e−i2π(T1n1/T1N d 1+T2n2/T2N2d)₌ = ˆbd[k1, k2]DFT{xcd}[k1, k2]. (2.28)

The objective can now be written in the Fourier domain, using Parseval’s formula,

(f ) = C X c=1 αc D X d=1 DFT{xcd}ˆbdfˆd− ˆyc 2 + D X d=1 k ˆw ∗ ˆfdk2. (2.29)

In practice the filter representation must be finite. The objective (2.29) is discrete and is truncated to fulfill this requirement.

2.5.4 Filter Training

Equation (2.29) is minimized using the normal equations and the method of conjugate gradients. The first step is to vectorize the objective. The Fourier coefficients represent-ing the filter are truncated. This leads to a finite approximation which is intuitive as the coefficients close to the center tend to contain the most energy. We use 2Kd + 1 coef-ficients centered about 0, where Kd _{= b}Nd

2 c. Also define K = maxdK

d_{. Define a} PD d=1K d 1K2dsized vector ˆ f =      ˆ f1 ˆ f2 .. . ˆ_f_D      (2.30) where ˆ fd=         ˆ fd[−K1d, −K2d] .. . ˆ fd[−K1d, K2d] .. . ˆ fd[K1d, K2d]         , (2.31)

(40)

and a C(2K + 1) ×PD d=1(2K d 1+ 1)(2K2d+ 1) sized matrix A =      A1 A2 .. . AC      (2.32) where Ac= Ac,1 Ac,2 . . . Ac,D , Ac,d= diag         DFT{xc d}ˆb d_[−Kd 1, −K2d] .. . DFT{xc_d}ˆbd_[−Kd 1, K d 2] .. . DFT{xc d}ˆb d_[Kd 1, K2d]         . (2.33) Here, “diag” is an operator transforming a vector to a corresponding diagonal matrix. Furthermore, define a size C(2K1+ 1)(2K2+ 1) vector

ˆ y =      ˆ y1 ˆ y2 .. . ˆ yC      (2.34) where ˆ yc=         ˆ yc[−K1, −K2] .. . ˆ yc_[−K 1, K2] .. . ˆ yc_[K 1, K2]         . (2.35) Also define Γ =      α1_I 2K+1 α2_I 2K+1 . ._. αC_I 2K+1      . (2.36)

Lastly, the spatial regularization is rewritten as

D

X

d=1

ˆ

(41)

where W is the block-diagonal matrix where each block is a convolution matrix contain-ing the elements of ˆw and corresponds to the convolution.

We can now rewrite the objective functional (2.29) into

(f ) = C X c=1 αckAcˆf − ˆyck22+ kW ˆf k 2 2= = k√Γ(Aˆf − ˆy)k2₂+ kW ˆf k2₂ (2.38) which can be solved using the method of least squares. Here√· denotes the element-wise square root. As shown in Sec. A.2, the objective is minimized by solving the normal equations

(AHΓA + WHW )ˆf = AHΓˆy. (2.39)

In practice this system may contain a number of equations in the order of 105 _which

renders exact solvers such as Cholesky decomposition or methods based on for instance Strassen’s algorithm infeasible. The Conjugate Gradient (CG) method [42] yields a so-lution with a time complexity which is linear in the number of non-zero elements in the system, depending on the condition number of the left-hand side. Therefore CG is used to efficiently solve the system.

2.5.5 The Label Function

The appearance of the label function yc _{is chosen. Between the different frames the}

function changes only in translation. It should produce a single sharp peak resulting in easy localization and attenuation of false detections during training. No other conditions are used. For mathematical convenience we will look at a dimensional case. This one-dimensional label function yc_{can be extended to a two-dimensional version as y}c_(t

1, t2)

= yc

1(t1)yc2(t2).

An approximate Gaussian function with period T is picked,

yc(t) = ∞ X k=−∞ zc(t + nT ) (2.40) where zc(t) = e−(t−qc)2/2σ2. (2.41) Here qc_{is the position of the target in time instance c. The Fourier transform is found as}

ˆ

zc(ω) = σ√2πe−iωqce−ω2σ2/2 (2.42) by using the Fourier transform for the case qc= 0 (see Appendix A.4) and the translation property of the Fourier transform. By defining ˆyc_{(t) as a periodic summation we can use}

(42)

ˆ yc[k] = 1 Tσ √ 2π exp −2π 2_σ2 T2 k 2_{− i}2πqc T k . (2.43)

A derivation is available in Appendix A.5.

2.5.6 The Interpolation Function

The interpolation function bd_{(t) which transfers the samples to the continuous domain, is}

picked as the cubic spline

b(t) =      (a + 2)|t|3_{− (a + 3)t}2_{+ 1} _{|t| ≤ 1}

a|t|3_{− 5at}2_{+ 8a|t| − 4a} _{1 < |t| ≤ 2}

0 2 < |t|

(2.44)

where a is a constant. We follow the original paper and use a = −0.75 in our experiments. The Fourier transform

F {bd_{}(ω) = (6a + 12)} 1 ω4 − 12 1 ω4cos(ω) − 6a 1 ω4cos(2ω) + 8a 1 ω3sin(2ω) (2.45)

is straightforward but messy to calculate. The details are found in Appendix A.6. The origin of the interpolation kernel is then rescaled and shifted half a sampling period T /(2Nd) to align it with the center of the feature maps

˜_bd_{(t) = b} N d T t − T 2Nd (2.46)

which has a Fourier transform ˆ ˜ bd(ω) = T Nde −i T 2N dωˆb _T Ndω . (2.47)

Finally we define the periodic interpolation kernel

bd(t) =

∞

X

n=∞

b(t − nT ) (2.48)

which, using the same reasoning as in (A.34), yields the Fourier coefficients ˆ_bd [k] = 1 T ˆ ˜ bd 2πk T = 1 Nde −i2πk/2Nd ˆ_b 2πk Nd . (2.49)

Here we have again reasoned for a one-dimensional case. A two-dimensional interpola-tion funcinterpola-tion can be defined as bd[k1, k2] = bd[k1]bd[k2].

(43)

2.5.7 Projection Estimation for Features

Many applications for visual tracking have a real-time constraint. The described learning framework performs considerable work every frame. It was recently proposed to reduce the number of feature dimensions D by estimating a projection matrix [14] which can then be applied to the feature maps. The projection estimation will only be done in the first frame where a single sample has been received. The motivation to this is two-fold. Intuitively this should be good enough as the same object will be tracked and the relevant features should remain about the same. Furthermore, doing this for every sample would increase the method complexity. Introducing this strategy to the previously described continuous formulation alters the initial filter training process. During subsequent frames however, only the samples change as they are projected to a new, smaller sample space.

Adding a Projection Matrix

We define this projection P = p1 . . . pD as a D0×D matrix where D is the number

of projected feature dimensions and D0 > D is the number of feature dimensions of the input feature maps. Each pdis a vector used to project a sample onto feature dimension

d. We define a new operator for the detection scores as

SP f{x} = D

X

d=1

fd∗ (PTJd{xd}). (2.50)

Also denote z = Jd{xd}, with Fourier coefficients

ˆ

zd[k1, k2] = ˆbd[k1, k2]DFT{xd}[k1, k2] (2.51)

and by treating the sample x and filter f as D0- and D-dimensional vectors of elements in L2(T ), we can write the Fourier coefficients of the score as

\ SP f{x} = D X d=1 fd· (PTJd{xd}) = ˆzTP ˆf . (2.52)

Define regularization term for this projection

P(P ) = kP kF (2.53)

where the k · kF is the Frobenius norm. Adding the regularization to the objective

func-tional yields the new objective

(f, P ) = kˆzTP ˆf − ˆyk2_`2+

D

X

d=1

(44)

Unlike the previously mentioned objectives this is non-linear. It is therefore optimized with the iterative Gauss-Newton algorithm. The estimated parameters in step i are denoted ( ˆfi_{, P}i_{) and are updated as ˆ}_fi+1 _{= ˆ}_fi_{+ ∆ ˆ}_{f and P}i+1_{= P}i_{+ ∆P . Each step is derived}

by using Taylor’s theorem to get ˆ

zT(Pi+ ∆P )( ˆfi+ ∆ ˆf ) ≈ ˆzTPi( ˆfi+ ∆ ˆf ) + ˆzT∆P ˆfi (2.55)

which yields the linear least squares problem

min (∆ ˆf ,∆P ) kˆzTPi( ˆfi+ ∆ ˆf ) + ˆzT∆P ˆfi− ˆyk2_`2+ D X d=1 k ˆw ∗ ( ˆf_di+ ∆ ˆfd)k`22+ λPkPi+ ∆P k2F. (2.56) By finding the normal equations the Gauss-Newton step can be done efficiently with the conjugate gradients method.

Optimization Define AP =                 01 _{. . .} ₀D diag         ˆ z[−K1 1, −K21]Tp1 .. . ˆ z[−K11, K21]Tp1 .. . ˆ z[K1 1, K21]Tp1         . . . diag         ˆ z[−KD 1 , −K2D]TpD .. . ˆ z[−K1D, K2D]TpD .. . ˆ z[KD 1 , K2D]TpD         01 _{. . .} ₀D                 (2.57)

where 0d_{is a zero matrix which pads the feature channels of lower resolution. It is of size}

2(K1K2−K1dK2d)+(K1−K1d)+(K2−K2d)×(2K1d+1)(2K2d+1) where (K1d, K2d) is the

number of Fourier coefficients used for the projected feature dimension d, and (K1, K2)

is the maximum number of Fourier coefficients used for any feature dimension. Define the vectorizations p = pT1 . . . pTD T and ∆p = ∆pT1 . . . ∆pTD T . Furthermore define ˆ f =      ˆ f1 ˆ f2 .. . ˆ fD      , ˆfd=         ˆ fi d[−K d 1, −K2d] + ∆ ˆfd[−K1d, −K2d] .. . ˆ fi d[−K1d, K2d] + ∆ ˆfd[−K1d, K2d] .. . ˆ fi d[K d 1, K2d] + ∆ ˆfd[K1d, K2d]         (2.58) and

(45)

2.5 Continuous Formulation 31 Bf = B1,1 . . . B1,D0 . . . B_D,D0 , B_d,d0=         ˆ zd0[−K₁, −K₂] ˆf_di[−K₁, −K₂] .. . ˆ zd0[−K₁, K₂] ˆf_di[−K₁, K₂] .. . ˆ zd0[K₁, K₂] ˆf_di[K₁, K₂]         . (2.59) We can then rewrite the optimization problem into

min ˆ f ,∆p kAPˆf + Bf∆p − ˆyk22+ kW ˆf k 2 2+ λPkp + ∆pk22 (2.60)

with normal equations

AH PAP+ WHW AHPBf BH f AP BfHBf+ λI _ˆ f ∆p = AH Pyˆ BH f y − λpˆ . (2.61)

For a full derivation, see Sec. A.3. As earlier stated, the Gauss-Newton method is em-ployed to find an optimal projection matrix P and filter coefficients ˆf and in each step a least squares subproblem is handled by letting method of Conjugate Gradients solve the normal equations.

Optimization Problem Size

It is worth to mention the size of the described optimization problem. The vectorization (2.38) of the objective functional (2.29) yields the desired linear least squares problem, but one of a large size. For the baseline, or when using a single subfilter, a typical size of the matrix A is 300000 × 180000. The multiplication AHA would be impossible to perform. To make the problem feasible, its sparsity is utilized. The matrix A has a very clear structure with the majority of elements being zero. Furthermore, no matrix multiplications are performed, but rather vector-matrix multiplications. As a last speedup, only half of the Fourier coefficients are utilized, as the Fourier transform of a real-valued function is even.

Baseline

The C-COT tracker [13] using two of the features introduced with ECO [14] is used as a baseline in this thesis. It utilizes the previously described feature projections, and updates the filter every six frames. In this thesis the baseline is extended, and compared with.

(46)

Visual Tracking with Deformable Continuous Convolution Operators

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2017