One Shot Object Detection

(1)

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

One Shot Object Detection

for Tracking Purposes

TIJMEN VERHULSDONCK

(2)

iii Abstract

One of the things augmented reality depends on is object tracking, which is a problem classically found in cinematography and security. However, the algorithms designed for the classical application are often too expensive com-putationally or too complex to run on simpler mobile hardware. One of the methods to do object tracking is with a trained neural network, this has already led to great results but is unfortunately still running into some of the same problems as the classical algorithms. For this reason a neural net-work designed specifically for object tracking on mobile hardware needs to be developed. This thesis will propose two different neural networks designed for object tracking on mobile hardware. Both are based on a siamese net-work structure and methods to improve their accuracy using filtering are also introduced. The first network is a modified version of “CNN architecture for geometric matching” that utilizes an affine regression to perform object tracking. This network was shown to underperform in the MOT benchmark as-well as the VOT benchmark and therefore not further developed. The second network is an object detector based on “SqueezeDet” in a siamese net-work structure utilizing the performance optimized layers of “MobileNets”. The accuracy of the object detector network is shown to be competitive in the VOT benchmark, placing at the 16th place compared to trackers from the 2016 challenge. It was also shown to run in real-time on mobile hardware. Thus the one shot object detection network used for a tracking application can improve the experience of augmented reality applications on mobile hardware.

Keywords: Object tracking, Deep learning, Siamese neural network, Affine

(3)

iv

Sammanfattning

En av de saker som förstärkts verklighet beror på är objektspårning, vil-ket är ett problem som klassiskt finns i filmografi och säkerhet. Algoritmerna som är utformade för den klassiska applikationen är dock ofta för dyra be-räkningsmässigt eller för komplexa för att driva på enklare mobila hårdvaror. En av metoderna för att göra objektspårning är med ett utbildat neuralt nät-verk, vilket har redan lett till bra resultat, men tyvärr går fortfarande i några av samma problem som de klassiska algoritmerna. Av detta skäl måste ett neuralt nätverk utformat speciellt för objektspårning på mobil hårdvara ut-vecklas. Denna avhandling kommer att föreslå två olika neurala nätverk som är avsedda för objektspårning pâ mobil hårdvara. Båda är baserade på en siamesisk nätverksstruktur och metoder för att förbättra deras noggrannhet med filtrering introduceras också. Det första nätverket är en modifierad ver-sion av “CNN arkitektur för geometrisk matchning” som använder en affine-regression för att utföra objektspårning. Det här nätverket visade sig vara underpresterande i MOT-riktmärket såväl som VOT-riktmärket och därför inte vidareutvecklat. Det andra nätverket är en objektdetektor baserad på “SqueezeDet” i en siamesisk nätverksstruktur med utnyttjande av prestanda-optimerade lager av “MobileNets”. Nätverksdetektorns noggrannhet visar sig vara konkurrenskraftig i VOT-riktmärket och placeras på 16: e plats jämfört med trackers från 2016-utmaningen. Det visades ocksâ att det kördes i realtid på mobil hårdvara. Således kan det enda objektet för detektering av objekt som används för en spårningsapplikation förbättra upplevelsen av utvidgade verklighetsapplikationer på mobil hårdvara.

Nyckelord: Objektspårning, djupt lärande, Siamese neuralt nätverk,

(4)

v

Acknowledgements

I want to extend my gratitude to all the people involved in this project but I feel that I should mention a couple of people who have been very closely involved. First and foremost I want to thank my advisor Kenneth van Hoey from ETH Zurich for his continued support throughout the project, guiding me and helping me navi-gate the various obstacles. Without him the thesis would not have been what it is now. Secondly I want to thank Maximilian Schneider from Viorama GmBh, for his guidance and continued trust in my research even though the first results were disappointing. I also want to thank Bichen Wu from UC Berkeley for his help on determining a course for this thesis, and associate Professor Jim Dowling for his great course on Deep learning at KTH and being my second advisor. Finally I want to thank professor Magnus Boman from KTH for supporting me in this project and allowing me to do my research abroad.

I would also like to thank the Kungliga Tekniska högskolan (KTH) for allowing me to pursue this research, and Viorama Ltd. for hosting me and supporting me wherever they could.

Of course none of this would have happened without the continued support from my family and my girlfriend. They made the difficult moments during the making of this thesis a lot more bearable. Thank you.

Tijmen Verhulsdonck,

(5)

A Fire Module swift implementation 67 B SqueezeDet network architecture 69 C Expected Average Overlap results on the VOT benchmark 71 D Accuracy ranking on the VOT benchmark 73 E Speed of different tracking algorithms on the VOT benchmark 75 F Robustness ranking on the VOT benchmark 77

(8)

List of Acronyms

GPU Graphics processing unit CPU Central processing unit

API Application programming interface SGD Stochastic gradient descent

FCN Fully connected network CNN Convolutional neural network FPS Frames per second

VOT Visual object tracking MOT Multiple object tracking Maccs Multiply accumulates

NaN Not a number

(9)

(10)

Chapter 1 Introduction

1.1 Motivation

Recognizing people or objects in an image when presented with an example of the object or person is a trivial task for humans. For machines however this is not the case. Tracking has a range of applications in fields such as cinematography, security, self-driving vehicles and many more. In some of these, tracking is actually still done by humans as the accuracy of trackers using computers is not enough. This illustrates that there is still room for improvement.

Even when tracking is automated, the algorithms are often executed on devices with a lot of computational power and an unlimited source of energy. With the growing popularity of augmented reality on mobile platforms such as an iOS or android device, there is a need for good tracking algorithms designed for these mo-bile platforms. This means the algorithm needs to be designed with computational limitations and a limited energy supply. This goal is only becoming more and more relevant with the increasing popularity of mobile platforms.

1.2 Background

An automated tracking application or program can be imagined as a black box to which an exemplar of the target to track (tracking target) is given e.g. an image of a person or object, together with a new image from a camera or video sequence in which that same person should be found. The objective of the black box is to locate the target within the new image, also known as the search window (seen in fig. 1.1).

The output of the black box needs to be in the form of the smallest box possible, that when overlaid on the search window, fully encompasses the target. So the desired output is in the form of a point in 2D space locating the center of the box, and a width and height defining its size and aspect ratio.

(11)

2 CHAPTER 1. INTRODUCTION

INPUT

_{Black Box}

OUTPUT

Exemplar

Search

Window

Figure 1.1: Goal of a tracking algorithm.

1.2.1 Deep learning

In the past decade deep learning has become an established research field, it uses training data in order to teach a generic algorithm to perform a certain function. In essence deep learning is training a predefined black box with annotated data to produce a desired output. The algorithm is defined by a neural network (explained in section 2.1) trained with annotated data, this differs from a manually designed and implemented algorithm. Neural networks are nothing new but with the in-troduction of big data and the use of GPUs for increased computational power, they have started to outperform classical algorithms; the first example of that was “Alexnet” [1]. Alexnet was one of the first networks to be completely trained and executed on GPU hardware, it also beat the competition in the Imagenet classifi-cation challenge [2] with a lead of 10.8 percentage points in top-5 accuracy. These days neural networks can even beat humans in games like chess and GO [3].

1.2.2 State of the art in tracking algorithms

(12)

1.3. RESEARCH METHODOLOGY 3 these speeds on a much weaker mobile CPU, so therefore is not suited for mobile applications. Even more recent is CFNet released in April of this year (2017). In-stead of using hand crafted correlation filters that classical trackers use, it learned them using deep learning [6]. CFNet runs on a GPU with 52 FPS, it has yet to appear in the VOT benchmark so it is hard to compare it to other trackers.

1.2.3 Problem

Even the most state-of-the-art tracking algorithms, are often not able to do anisotropic scaling and are not designed for execution on mobile hardware. This prevents the application of these tracking algorithms for augmented reality on mobile hardware, as explained in section 2.4.

1.3 Research Methodology

The goal of the thesis is to develop a tracker that has good performance and accu-racy, while also being possible to run on mobile hardware i.e. compact in memory and fast to execute. For this reason a literature research of current available track-ing algorithms is performed; a selection of papers will be examined that are not only focused on accuracy but also on performance. The keywords used to find papers within this scope are:

• Object detection neural networks • Tracking algorithms

• Energy efficient neural network • Single shot learning algorithms

All papers will be evaluated on a quantitative basis, that is, see if they reach state-of-the-art results on either accuracy or speed. Speed will be measured in required computations and memory usage. Accuracy will be measured with the help of a benchmark, a relevant benchmark will be selected in order to compare different algorithms. Based on the results of the quantitative deduction, selected works will be evaluated to the following qualitative requirements: Does the algorithm allow anisotropic scaling, and is the design of the algorithm simple.

(13)

4 CHAPTER 1. INTRODUCTION a novel or modified neural network architecture will be developed with the main goals of providing state-of-the-art accuracy combined with the ability to run on mobile hardware.

1.4 Research Contributions

This paper contributes an evaluation of current neural network tracking algorithms, and the possibility of implementation of a neural network tracker on mobile hard-ware. Based on those findings two networks will be proposed, implemented and evaluated. One network performs a prediction of an affine transformation, which is shown to decrease accuracy and is not competitive on accuracy or performance. The other proposed and implemented network is a one shot learning object detector that can do single target object detection based on an exemplar. This network is shown to achieve a competitive accuracy on a popular single object tracking bench-mark while being simple in structure, and efficient in performance. To increase the capabilities of the tracking algorithm, a number of filtering algorithms are used to increase performance of the two networks.

1.5 Thesis Organization

This thesis is organized in seven chapters as follows:

1. Introduction: Explaining the motivation, background and contribution of this thesis.

2. Background: Explaining the concepts and technologies used in this thesis. 3. Related work: An overview of related work used and referenced in this thesis. 4. Tracking algorithm: A description of the two proposed network designs, and

an evaluation of selected works

5. Technical Details: Explaining all the specific technicalities used for imple-mentation

6. Evaluation and Results: Presenting the results of the two networks, evaluated on accuracy and speed

(14)

Chapter 2 Background

This chapter explains the concepts behind this thesis, a short introduction into neural networks (section 2.1), the APIs used for deep learning and implementation on a mobile platform section 2.2, and the use of big data in the form dataset explained in section 2.3. The final section 2.4 will present the specifications and requirements, that the final result should fulfill.

2.1 Neural Networks

Artificial intelligence - and by extension neural networks - has been a science for many years dating as far back as the 40s [7, Chapter 2], but has seen a sharp spike in scholarly interest in the last decade. This is mainly due to the practical applica-tion of deep learning and the large amounts of data available for training. Neural nets are essentially a series of mathematical operations, that learn to produce the correct output when presented with an input. Unlike classical mathematics where all operations are defined beforehand by a mathematician, neural nets work by defining a structure of basic building blocks which consists of simple mathematical operations. These building blocks(hereinafter referred to as layers) can be stacked and combined to create an advanced neural network. The layers in this neural net-work are then trained to produce a desired output when presented with a certain input. This means while the structure of a neural network is known, what each layer actually does after training is not; for this reason the layers of a neural net are commonly known as hidden layers. The exact process of training a neural net will be explained further in section 2.1.2.

(15)

6 CHAPTER 2. BACKGROUND

2.1.1 Fully connected networks (FCN)

Among the most general layer designs to be used in neural networks are fully con-nected layers, consisting of artificial neurons. These artificial neurons are inspired by the biological neurons (seen in fig. 2.1) of the human brain[8]. Their function is to transmit a signal as a function of its inputs.

Dendrites

Axon Terminals Axon

Figure 2.1: A 2d representation of a biological neuron (Image adapted from [9]) The artificial neuron works by calculating a weighted sum of its inputs x, adding a bias value b, and applying an activation function f. This process can be written as (2.1) where Ninputsis the number of inputs.

y= f(b +

i=0

ÿ

Ninputs

xi· wi) (2.1)

A visual representation of this function can be seen in fig. 2.2. The weights can be used to adjust the influence a certain input has on the final result, this can in effect tune a neuron to produce a desired output when presented with a collection of inputs.

Σ

w2 w1 w3 y x0 x1 x3 b

Figure 2.2: A visual representation of an artificial neuron

(16)

2.1. NEURAL NETWORKS 7

2.1.1.1 Layer structure & performance

A fully connected layer is defined by an arbitrary amount of artificial neurons having the same inputs but different outputs. This is called a fully connected layer or densely connected layer due to all neurons being connected to all inputs. Combining two fully connected layers with an input layer where data is fed into, and an output layer where the results will be presented, results in a simple neural network as seen in fig. 2.3.

Figure 2.3: A simple neural net with 2 fully connected layers (from [10]) While the network presented in fig. 2.3 is relatively simple, neural nets can contain many layers and many neurons per layer. It is important to note that, though simple, fully connected layers scale badly with the number of inputs. The number of parameters(nparameters) needed to be stored per neuron based on (2.1) is

Ninputs+ 1. The amount of parameters needed to be stored per layer can therefore

be be calculated like this nparameters= nneurons· ninputs+ nneurons. Immediately

it can be seen that increasing the number of neurons or inputs linearly increases the number of variables, this becomes an issue when dealing with large amounts of inputs for example when inputing an image. A simple color image of 250 ◊ 250 pixels results in 250 · 250 · 3 = 187500 inputs. With a single layer containing the same amount of neurons this results in nparameters = 1875002 = 35, 156, 250, 000.

With the computations requiring equal scaling, it can be determined that fully connected layers are not suitable for layers with a large amount of inputs, as is the case with images. Convolutional layers (presented in section 2.1.4) can handle a large amount of inputs better.

2.1.2 Training process

(17)

8 CHAPTER 2. BACKGROUND examples is called a training dataset, and is used to train a neural network in a method called supervised learning.

Supervised learning is done by feeding an example to the input layer, and per-forming the calculations of all neurons and layers to produce an output(also known as forward-pass or inference stage). The output produced by the network is then compared to the desired output by calculating the deviation using a loss function. The goal of the training is to minimize the output of the loss function by updat-ing the weights and biases; the most popular way of doupdat-ing this is by performupdat-ing stochastic gradient descent (SGD). SGD works by calculating the gradient vector of the loss function and repeating this for all neurons in the network using the gra-dient of the previous layer and the delta rule [11]. This process is called gragra-dient back propagation or backward-pass. The calculated gradients are then used in the

update phase to adjust the weights and biases by a certain magnitude called the learning rate in such a way that minimizes the loss function [12]. The steps shown

below summarize this process:

1. Initialize all weights and biases with random values 2. Feed an example to the input of the neural net

3. Execute all neurons and layers to produce an output (forward-pass) 4. Calculate the deviation of the output from the label using the loss function 5. Back propagate the gradient through the network (backward-pass)

6. Update all weights and biases based on their respective gradient (update phase)

7. Repeat from step 2 until convergence

The steps shown above are repeated over and over in order to approach the desired output for an example as closely as possible. A small variant on the above process is to feed not just one example but multiple examples as a mini-batch; this has the advantage of faster convergence to a minimum due to less noise in the gradient. This method of using multiple examples to update the gradient can also be executed in parallel therefore is especially beneficial when running on a GPU.

2.1.3 Inference stage

(18)

2.1. NEURAL NETWORKS 9 requires far fewer computations than are needed for the whole training process with the gradient back-propagation. For this reason inference can often be executed on much weaker hardware with respectable speeds. These two separate processes are also called the offline stage when training the neural net, and online stage when using the neural net for inference.

2.1.4 Convolutional neural networks (CNN)

Convolutional neural networks are based on convolutional layers instead of fully connected ones. Convolutional layers are mainly popular for use in image processing applications as they are designed to exploit the strong spatial correlation present in images. They are inspired by the biological eye, which uses cells only sensitive to a small part of the image called a receptive field, but tiles them to cover the whole image. A convolutional layer imitates the biological cell with something called a filter. A convolution works similarly to a fully connected neuron, but instead of having connections to every input it only has connections to the inputs in its receptive field. The receptive field of each convolution is small and constant, but by tiling many partially overlapping convolutions the receptive field effectively covers the whole input, just like the human eye. While the receptive field of each individual convolution is constant and unique, the convolution itself is not. Convolutions share weights between them in and are executed with a scanning behaviour where the weights are reused for multiple separate convolutions. This scanning behaviour can be executed with a 2d filter also known as a kernel. A convolution of a 2d filter can be used to scan a 2d plane or an image. A kernel can be an arbitrary size but popular sizes are 3 ◊ 3, and 1 ◊ 1. A 3 ◊ 3 kernel as presented as a matrix in (2.2) holds 9 different weights which can be tuned during training. In addition to the kernel, a convolution can also use a bias value, that is added to the output of the kernel convolution. _S

Uww14 ww25 ww36

w7 w8 w9 T

V (2.2)

Using a 3 ◊ 3 kernel to apply a convolution to an input image produces an output

image of similar size as seen in fig. 2.4. The output size of a convolutional layer size

(19)

Figure 2.4: A diagram of a 2d kernel used to apply a convolution to a 2d plane.(from [13])

channel, also called a feature map. A color image also consists of multiple channels

namely a Red, Green and Blue channel or RGB and can therefore also be considered a feature map. In order to act on the increased dimensionality of feature maps a convolution has to have multiple kernels to act on the different input channels. This results in a 3d kernel with an added dimension called depth, the depth of a kernel must match the number of input channels, this is true for a normal convolution, special depth wise convolutions exist but will be explained later in section 3.3.4. A visual illustration of a convolution with 3d kernels can be seen in fig. 2.5.

Figure 2.5: An illustration showing a 3d convolution, hin, winand chindescribe the

size of the input feature map. The output feature map is described with hout, wout

and chout, the amount of kernels k per output channel is equal to chin (source:

[14])

(20)

2.1. NEURAL NETWORKS 11 of a convolution, it adds kernels and thus the ability to detect a wider variety of features. Multiple convolutional layers can also be used, with each convolution using the output feature map of the previous convolution as input.

2.1.4.1 Pooling layers

Pooling layers are similar to convolutional layers but simpler. A pooling layer has a stride and kernel size just like a convolutional layer, but it does not contain any weights, and is therefore also not trainable [1]. A pooling layer is a constant mathematical operation that is applied similar to a convolution, popular pooling layers are max pooling and average pooling. The output of a convolution in a max pooling layer is the maximum value of that convolution, in the case of average pooling it is the average value. Pooling layers are often used to reduce the size of the feature map as they do not require a lot of computations. In more recent works pooling layers are often dropped in favor of regular convolutions with a stride greater than one [15].

2.1.5 Performance

Using convolutions on images takes advantage of the inherent spatial correlations between the pixels and their respective locations. This advantage enables the con-volutional layer to operate efficiently on image data, where a fully connected layer would be unpractical. The number of parameters in a convolutional layer can be cal-culated as followed, nparameters= Kwidth·Kheight·inputchannels·outputchannels+

outputchannels(with K being the kernel size). As one can see the number of

vari-ables is not dependent on the number and size of the inputs. The number of com-putations required is however dependent on the number and size of the inputs, this means bigger input images require more computations. The computations consist of multiply accumulates also known as Maccs, and the formula to calculate amount of Maccs can be seen in equation (2.3) (from [16]).

Stride= S KernelW idth= Kw KernelHeight= Kh InputW idth= Iw InputHeight= Ih InputChannels= Cin OutputChannels= Cout M accs= ((Kw· Kh) · Iw· Ih S · Cin) · Cout (2.3)

(21)

12 CHAPTER 2. BACKGROUND the required parameters and computations are much smaller. The number of parameters is nparameters = 3 · 3 · 3 · 64 = 1728, and the number of Maccs is

M accs= ((3·3)·250·250₁ _{· 3) · 64 = 108 · 10}6. Compared to the performance of a fully

connected layer on the same input, this is a reduction of 352 times in the number of Maccs. And the number of parameters is reduced around 2 · 107 _{times. It must} be noted that the two layer types differ so much in their functioning that it is hard to compare them. It is shown, however, that convolutional layers are much more practical for inputs consisting of images than fully connected layers.

2.1.6 Siamese network

Classically a neural network has a single data path from input to output, this is because the networks are highly specialized and fine tuned for performing a single task. However some tasks require adaption of the network based on a given in-put. A relatively new neural network design called siamese neural networks [17] is designed with these kinds of tasks in mind. These networks contain two or more distinct inputs that are combined somewhere later in the neural network. A simple example can be seen fig. 2.6, where two distinct inputs are processed by two sepa-rate hidden layers and combined in a combination layer which is connected to the output layer. Hidden layer A Hidden layer B Input layer A Input layer B Output layer Combination layer

Figure 2.6: An illustration of a siamese network with two distinct inputs A and B resulting in one output.

(22)

2.2. MACHINE LEARNING APIS 13 layer is to combine the output of the two separate branches in a meaningful way. The exact combination layer to use differs per application, but some very basic com-bination layers include concatenation, fully connected layers (as shown in fig. 2.6), and addition or subtraction.

2.1.7 Image classifiers

Image classifiers are algorithms that can recognize or predict what is represented in an image. Neural networks can also be trained to perform classification. Classifica-tion requires a neural network to output the probability of a certain class of object being in an image. This is a core function of computer vision, and it has therefore become a very popular field of research. A trained classifier performs a number of low level and high level feature extractions, it looks for edges, shapes and even specific objects like heads. These features are used in almost all computer vision application and for this reason a pre-trained classifier can also be reused. When reusing a pre-trained classifier as a feature extractor, the network is used to extract generic features which are then processed by another neural network trained for a different application. Another reason classifiers are popular is because of the annual Imagenet Challenge [18]. The Imagenet Challenge compares the top-1 and top-5 accuracy of different classifiers. Many of the big players in the field of artificial intelligence have participated in one way or another and the top ranking networks sometimes only differ from each other by as little as 0.05% accuracy.

2.2 Machine Learning APIs

(23)

14 CHAPTER 2. BACKGROUND Tensorflow was used. Because Tensorflow does not yet run on mobile hardware, another API was used called Metal. Metal is developed by Apple and only runs on iOS devices, it provides a small API with most of the commonly used layers in neural networks. Metal is currently the only neural network API that can run all of it’s operations on the GPU of a mobile phone, which is required to run inference of any computer vision network at real-time frame rates.

2.2.1 Tensorflow

In this work Tensorflow is used for training the neural network, Tensorflow is an API that uses python to interface to it. Python is an interpreted language, which means it interprets every line of code during execution, this becomes very inefficient for repeated executions of the same bit of code. For this reason Tensorflow uses something called a Graph describing the path of the training data, the computations applied to it and different data modification operations these operations and paths are also called tensors (inspired by mathematical tensors). A graph of tensors is set up using Tensorflow API calls from python, no actual data is being processed during this setup phase of the graph. In this setup phase the inputs of the neural network are defined, and tensors that act on these inputs, the tensors can be strung together to create complex computation graphs [21]. After the graph is set up as required for a certain neural network, a Tensorflow session is started. This session can then be used to feed data into the graph and evaluate certain tensors, in this session the graph cannot be changed anymore. Because the graph is fixed during execution, not only can it use native implementations of certain operations but it can also optimize the execution order and data path between the different operations. As a result of this, Tensorflow can be used to implement complex neural networks using a high level language like python, while still taking advantage of a very low level optimized implementation.

2.2.2 Metal

(24)

2.3. DATASETS 15 software on their devices, this enables them to support running neural networks on the GPU of all devices that are an iPhone 6 or later model. Of course more recent devices have a powerful GPU able to run a neural network at a higher frame rate compared to older devices, so the frame rate varies from device to device.

2.3 Datasets

As stated in section 1.2.1, neural networks are often trained using big sets of an-notated data [23]. For training very complex neural networks that are many layers

deep, very big datasets are needed to enable the network to find very complex

sim-ilarities between pictures. With a small set of training data a very deep neural network could start to learn irrelevant and tiny features specific to the training set, this behaviour is called over-fitting. The problem with over-fitting is that it can make a network non-generalizable, this means that the network would perform well on the training data but not on any unseen data. To prevent over-fitting, the annotated data must be of a significant size to enable the neural network to find big picture features and not focus on features only found in the training set. Reducing over-fitting allows for better generalization meaning that when the neural net is used for inference on unseen data it produces better results.

(25)

2.4 Project Goals and Specifications

This section will establish the goals and specifications of the project, but also de-termine the design constrains and limitations.

2.4.1 Problem

In a video or live stream a subject can move around, not only can the subject move but often the camera as-well. This results in transformations of the subject in the 3D world, these transformations include translations, scaling and changes in shape. Projected on the 2D plane of a camera image this results in translations, anisotropic scaling and shape changes. Shape detection is part of a research field called segmentation, and is not part of the scope of most tracking algorithms. Most state-of-the-art object tracking algorithms are able to accurately detect the trans-lations and isotropic-scaling of a target. Unfortunately most trackers are unable to handle anisotropic scaling. In practice this means that the bounding box of a target during the tracking process can only change in size by using the same scaling factor for both width and height. This behaviour is inherent to the design of most track-ers as they are often initialized with subsets of an original image containing only the target called a patch. This patch is then compared to a search window which can be the full original image or a subset of it, the location where the comparison returns the greatest activations is then assumed to be the new location of the tar-get. To identify scale changes of the target a scale pyramid is often used. A scale pyramid is a set of images containing the search window and slightly bigger and smaller versions of it as can be seen in fig. 2.7. The comparison described earlier is performed on each of the different scales, and whichever scale has the greatest activations is assumed to be the new scale of the target [25]. While tracking algo-rithms using a scale-pyramid have shown to be very effective, they will not be able to do anisotropic scaling, or any other transformations like rotation or shearing, it is also inefficient as running the comparison for each scale linearly increases the computational complexity with the amount of scales.

(26)

2.4. PROJECT GOALS AND SPECIFICATIONS 17

Figure 2.7: A scale pyramid showing a search window in original scale, slightly upscaled, and slightly downscaled to detect scale changes. (from [26])

2.4.2 Goal

The performance problems highlighted earlier are often a side thought when it comes to applying trackers on powerful hardware. Tracking algorithms are com-paratively light, and can be executed with very high frame rates especially when run on a GPU [4]. Maintaining high frame rates starts to become a problem when executing trackers on less powerful hardware, for example a mobile phone. An-other problem encountered when executing complex algorithms on a phone is an increased load on the battery, draining it faster than acceptable from an end user’s perspective. Mobile hardware is a relevant platform for execution of tracking al-gorithms; they all contain video cameras which can be used for augmented reality applications. Solving the problems that come with running complex tracking algo-rithms can enable more advanced augmented reality applications. Therefore it is an important research subject and can enable advancement in the field of augmented reality and its applications on mobile hardware. With the problems of section 2.4.1 in mind the main the goals of this research are:

Performance Minimize computational complexity of a tracker while maintaining

a respectable accuracy.

Scaling Enable an-isotropic tracking of a target’s scale

Simplicity The tracker must be simple in design to allow for efficient

implemen-tation on mobile hardware.

(27)

18 CHAPTER 2. BACKGROUND with either a limited loss in accuracy or preferably an increase in accuracy.

2.4.3 Proposed solution

Currently one of the most state-of-the-art tracking algorithms is a neural net based tracker [27]. This tracker is based on a siamese network structure and can track a target over multiple scales with very high accuracy. This tracking algorithm used a neural net architecture to extract meaningful feature maps for comparison. Unfortunately this tracker still uses a scale-pyramid to track the targets scale, which prevents any anisotropic scaling. This tracking algorithm also increased the computational complexity as the neural net used the structure of Alexnet[1] as a feature extractor which is much more demanding in the sense of computations required as opposed to hand-crafted feature extractors. This tracker would not achieve real-time frame rates [28] when run on mobile hardware. Nonetheless the idea is a promising one and has potential for improvement, possibly allowing it to be suitable for mobile hardware.

(28)

Chapter 3 Related work

3.1 One shot learning

Neural networks classically contain a single data path from input to output. This is because neural networks are often trained to perform a single task, and this task does not change during the inference stage. If the task changes during inference the neural network would need to be retrained, in order to perform the new task. Often it is not an option to retrain a neural network for each new task, this is because there might be a lack of training data or the hardware used for inference is not powerful enough to perform SGD. This problem of having limited training data, or even only one example is also known as one shot learning. Tracking is in essence a one shot learning problem, as an algorithm is given one example of the target and asked to track this target over multiple frames.

One of the first papers attempting to solve the problem of one shot learning by re-training a pre-trained neural network was released in 2013 [31]. It showed that a pre-trained network was better at generalizing to a new class then a network trained from scratch. A more recent paper utilizes a Neural turing machine to perform one shot learning, the turing machine consists of a controller such as a feed forward network or a recurrent neural network [32] that interacts with an external memory module [33]. This machine has long term storage in the network weights which are slowly updated, and short term storage in the form of the aforementioned external memory module. This structure achieved better results than a human on a few shot problem using the Omniglot [34] dataset and was a big step forward compared to comparable methods at that time. The “matching networks for one shot” learning [35] showed that, besides designing for one shot learning as in [33], training for one shot learning can improve results even further. They proposed a network which was designed to be trained for one shot learning and showed significant improvement over the previously described methods with 98% accuracy on 5-way challenges after 1 shot learning. These papers have laid out some of the best-practice design

(29)

20 CHAPTER 3. RELATED WORK ods which are now used as a base for many other researchers doing research on one shot learning. The applications are very widespread from simple visual recognition [36] all the way to object segmentation in video [37].

3.2 Tracking

This chapter presents the current state of research concerning tracking algorithms. All the tracking algorithms presented here were selected based on their tracking performance, if they worked on a frame by frame basis, and if they can achieve real-time frame rates.

3.2.1 Tracking using deep regression

In [38] a method for subject tracking using a feed forward neural network is de-scribed. The network uses the search window of the previous frame containing the target and a search window of the new frame as inputs. The neural network then applies a number of convolutions to both inputs and combines the outputs of the convolutions using fully connected layers. The neural network is trained to predict translations as well as anisotropic scaling of the target from search window to search window. This is an architecturally simple method which achieves a high speed (100fps) on a Titan X GPU. The shortcoming of this method and why it won’t be used, is that it can not look further back than 1 frame so any occlusion longer than that will result in the target being lost and never recovered.

3.2.2 Tracking using a CNN and recurrent layers

A network called ROLO (recurrent YOLO) described in [39] utilize the well known YOLO network [40] and combines it with a layer of LSTM [32] cells to improve tracking of a single subject compared to individual detections each frame. This network bested most of the competitors in the OTB-30 benchmark [41]. The draw-back of the network is the computational intensity; where the YOLO network by itself was already an expensive network to run, ROLO adds another layer to this network. There might be an option to replace YOLO with a lighter architecture e.g. Squeezedet. The network is inefficient by design as it uses the final predictions of a different networks designed for other purposes and adds to it. Besides the ar-chitecture being computationally expensive there is also no method of re-acquiring a target after it has been lost for a longer period of time.

(30)

3.2. TRACKING 21 cross-correlated with a feature map of every search image to find the target to be tracked. This system uses Alexnet [1] for the convolutions and shows state of the art quantitative and qualitative results, while also running at a high FPS. It has to be seen, however, how sensitive the network is to a change in the target’s pose, as the feature map of the target is never updated. Due to the good results on the VOT benchmark and the use of neural networks, this network will be evaluated in section 4.1.2.

3.2.4 Learnet

Learnet [43] is from some of the same authors as the network presented in [43]. It proposes another siamese network structure but it not only uses the siamese branches as feature extractors it also trained one of the branches to update the weights of a convolution in the other branch as seen in fig. 3.1. The weight matrix

Figure 3.1: The structure of the Learnet (from [43])

M of the convolution that is being changed during inference can be generated using

the following equation M = v◊diag◊hT. During inference when the weight matrix

is updated only the diagonal is updated, v an h learned during offline training stay the same. This greatly reduces the number of parameters to update. The result of updating the weights of a convolution is an improved tracking accuracy when compared to a siamese network the weights of which do not change during inference. The siamese design proposed in this paper is an interesting idea, but due to its complexity it is not a candidate for implementation on mobile hardware.

3.2.5 Visual Tracking by Reinforced Decision Making

(31)

22 CHAPTER 3. RELATED WORK network and the requirement to recompute a feature map of the subject every pass through.

3.2.6 Correlation Filter based tracking

In [6], a method is presented to adapt the correlation filter algorithm [45] to an end-to-end neural net training process. It is a step forward from hand-crafted correlation filters to correlation filters learned using training data, the paper shows that even a shallow neural network using correlation layers can acquire similar or better level of precision then deeper neural networks. The paper was only released in April of this year, and since the used correlation filter is non-standard, it will be a challenge to implement and debug.

3.2.7 Tracking using Recurrent net and LSTM Cells

(32)

3.3. OPTIMIZING NETWORK PERFORMANCE 23

3.2.8 Tracking by detection

This method utilizes frame by frame detection and adds a method of data associa-tion to track multiple targets similar to [48]. It can be implemented more easily by using any of the detectors currently available and add an algorithm such as solution path algorithm [49] for tracking. This can produce good results as shown in [49], and potentially improve the qualitative performance of a tracking by detection al-gorithm. It will however not be used in this paper, due to it’s focus on multi object tracking.

3.3 Optimizing network performance

This section will present all papers that focus on optimizing the performance of neural networks instead of accuracy, this means a trade-off between memory pres-sure or computations and total accuracy was considered. The techniques identified in these papers might help optimize performance of a neural net tracker.

3.3.1 Deep compression

(33)

24 CHAPTER 3. RELATED WORK

3.3.2 SqueezeNet

In [29] a network architecture is proposed that maintained the same accuracy as AlexNet [1] on the imagenet dataset, while reducing the model size by up to 500 times. To achieve this performance increase, while maintaining the same accuracy the authors proposed a new module called a fire module (seen in fig. 3.2). The fire module reduces the number of expensive 3 ◊ 3 convolutions, using a combination of 1 ◊ 1 and 3 ◊ 3 convolutions after a squeeze layer. The cheaper 1x1 convolution used in the squeeze layer reduces the number of channels in the feature map to a preset value called s1◊1. After which a 1 ◊ 1 and 3 ◊ 3 convolution are used to expand the number of channels in the feature map to a preset value called respectively e1◊1 and e3◊3. This prevents any convolution from having a large quantity of input channels, and a large quantity of output channels which is very costly according to section 2.1.5. The combination of a 1 ◊ 1 and 3 ◊ 3 convolution works especially well as they tend to cooperate, the 1◊1 convolutions focus more on channel relationships and 3◊3 convolutions focus more on spatial information. The

Figure 3.2: The Fire Module proposed in SqueezeNet (from [29])

main contribution of SqueezeNet was the fire module which showed that reducing model size and computations can not only be done by compression but also by making smart decisions in the architecture. But to achieve the aforementioned 500 times model size reduction SqueezeNet also utilized Deep Compression explained in section 3.3.1.

3.3.3 SqueezeDet

(34)

3.3. OPTIMIZING NETWORK PERFORMANCE 25 last 3 layers are called ConvDet in the paper. It reduced energy consumption 84x compared to a previous work called “Faster R-CNN”(FRCNN) [53], while achieving a similar accuracy and running at real-time speeds (57.2 FPS). It combines ideas of FRCNN and YOLO[40] as it only uses convolutions for it’s output layers and k-means clustered anchors (explained in section 3.3.3.1) proposed in FRCNN. The output layer does classification as well as region proposal similar to the YOLO network.

3.3.3.1 Anchors

The anchors of SqueezeDet are default bounding boxes determined by k-means clus-tering of the bounding boxes in the annotated data. This method has a statistical advantage over simple square boxes as it takes into account the specific size and aspect ratio of its classes. The anchors are arranged in a grid with every default box repeated for every grid position. One of the goals of the network is to regress which anchors to use when presented with an input. To accommodate more fine grained localization the network also predicts deltas for every anchor, such that every anchor can be adjusted slightly as can be seen in fig. 3.3. The total number of anchors at every position is outputwidht·outputheight·kclusters= nanchors. Every

anchor is assigned a certain probability of a class being there based on a class and confidence regression. This probability can be used to filter the output and select which of the predicted bounding boxes to keep.

(35)

3.3.3.2 Loss function

The loss function used by SqueezeDet to train the network was another improve-ment over FRCNN as it enabled end-to-end training of the neural net as opposed to a four-step training strategy [53]. The loss function of SqueezeDet consists of 3 parts, the first part is the delta loss, which calculates the loss of the predicted deltas ”pred

kj compared to the ground truth deltas ” gt

kj. This loss is a sum of the

square distance between the respective deltas seen in equation (3.1). The deltas loss equation is normalized with respect to the number of objects Nobj , and an

input mask I is used to only train relevant deltas. The ⁄bboxfactor is used later

when combining the different parts of the total loss function.

Deltasloss= ⁄bbox Nobj nanchors_ÿ k=1 (Ik 4 ÿ j=1 (”GT kj ≠ ” pred kj )2) (3.1)

The second part of the loss function seen in equation (3.2) is the confidence loss which trains the neural network, to select the right anchor for a detected object. The predicted confidence “pred

k and ground truth confidence “kGT are compared

using a square distance, the loss function also penalizes any confidence that does not correspond to a ground truth anchor. To adjust the influence of the positive and negative confidence loss a ⁄conf pos and ⁄conf neg is used.

Conf idenceloss=

nanchorsÿ k=1 ⁄conf pos Nobj Ik(“GTk ≠ “ pred k )2) + ⁄conf neg nanchors≠ Nobj ¯Ik(“kpred)2 (3.2) The last part of the equation is the class loss seen in equation (3.3), this is used to train the network to be able to detect different classes of objects. The output of the neural network is normalized with a softmax activation function and the loss is a simple cross entropy loss for classification where lck is a one hot encoded ground

truth vector, and pk the output of the softmax.

Classloss= 1 Nobj nanchorsÿ k=1 nclassesÿ c=1 Iklkclog(pkc) (3.3)

The three separate equations (3.1),(3.2),(3.3) are summed together, and the lambda factors are used to adjust their effect on the final output. The factors used in the SqueezeDet paper are ⁄bbox= 5, ⁄conf pos= 75, ⁄conf neg= 100. The confidence loss

and anchor loss was also used in section 4.3 of this thesis.

3.3.4 MobileNets

(36)

3.4. AFFINE TRANSFORMATIONS 27 [55] but never for mobile specific applications. Mobilenets focus on mobile appli-cations also showed in the structure of the network, it is a simple design making it easier to implement on mobile hardware that has a reduced set of instructions compared to desktop hardware.

3.3.4.1 Depthwise separable convolution

A depthwise separable convolution splits a normal convolution in 2 parts, every input channel is first convolved with a single 3x3 or greater kernel for each channel, afterwards a normal 1x1 convolution combines the depthwise convolutions and if necessary increases the number of filters. This way the number of parameters is only (3 · 3 · ninchannels) + (1 · 1 · ninchannels· noutchannels) = nparameters instead of

the number of parameters shown in section 2.1.4. As a drop in replacement the depthwise separable convolution is only slightly worse then a normal convolution, MobileNets saw a drop of 1.1% when using depthwise separable convolutions as opposed to using the normal convolutions.

3.4 Affine Transformations

The objective of the tracking network is to track a target and update the bounding box of that target in consecutive frames, most of these updates can be described in the form of affine transformations[56]. An affine transformation works by recalcu-lating the position of a point by doing a matrix multiplication with an affine matrix commonly referred to as ◊ seen below.

◊= 5 a11 a12 a13 a21 a22 a23 6 ◊· S Uxyoldold 1 T V=5xnew ynew 6

(37)

3.4.1 Spatial Transformer Networks

Spatial transformer networks proposed in [57], were designed to overcome the lack of scaling and rotation invariance in CNNs. The spatial transformer can be added between layer in any convolutional network as seen in fig. 3.4. The spatial trans-former network transforms the input according to a predicted affine transformation before feeding it to the neural network. It requires no extra modifications and can be trained in place, thus it is very versatile and could work in a lot of networks. The localisation network aims to predict an affine matrix that transforms a grid of

Figure 3.4: Spatial transformer network (from:[57])

points that cover only the important part of an image. This grid is then used to sample the original image and output an image that is correctly scaled rotated and translated. The authors suggested a convolutional network or fully connected could be used for the localisation network. The paper showed an increase of up to 2% when using multiple spatial transformer networks between convolutions, this also meant adding a substantial amount of complexity to the network; for that reason we do not use it.

3.4.2 CNN architecture for geometric matching

(38)

demon-3.5. DATASETS AND BENCHMARKS 29

Figure 3.5: The Siamese network structure of the geometric matching network(from: [58])

strated their matching layer to outperform more common combination layers used in Siamese networks, such as subtraction and concatenation. We used the matching layer in both networks proposed in this paper, due to it’s well argued performance and relative simplicity.

3.4.2.1 Loss function

In order to train for different geometric transformations the loss function was de-signed so it could be used for any geometric transformation. The authors did so by not training directly on the parameters of a transformation, but rather expressing the loss function on a transformed grid of points which is transformation agnostic. The score grid was an evenly spaced grid of 400 points in an image which top right corner was at (1,1) and bottom left corner was at (-1,-1). The loss seen in equation (3.4), calculates the square distance between the two score grids known as ·, one transformed with the ground truth transformation ◊GT and the other with the

pre-dicted transformation ˆ◊. The distance between the all the points is summed and divide by the number of points, to calculate the average square distance between the points. This loss function is also used in the loss function of section 4.2.

L(ˆ◊, ◊GT) = 1 N N ÿ i=1 d(·_ˆ◊(i), ·◊GT(i)) 2 _(3.4)

3.5 Datasets and Benchmarks

In order to train a neural network to perform a tracking application, a dataset containing annotated video sequences is required. The dataset must be of a size significant enough to prevent over-fitting of the neural network and allow for better performance on unseen video sequences. In general a dataset used for tracking purposes should preferably contain the following data.

• A sufficient number of different videos

(39)

30 CHAPTER 3. RELATED WORK • Frame by frame annotated bounding boxes preferably with rotation

• A dense mask or single value describing the visibility of an object

A benchmark is important to compare different tracking algorithms, a good bench-mark of a tracking algorithm uses a wide range of different video sequences. It is also important that the measured performance indicators are well argued and explained, finally a benchmark should preferably be used in a yearly challenge that compares tracking algorithms.

3.5.1 Imagenet video dataset

The Imagenet dataset [18] (ILSVCR) is most well known for its use in the “Ima-genet Large Scale Visual Recognition Challenge” (ILSVRC) challenge. The dataset contains around 1.2 million training images with 200 different classes of objects. Less well known is their video dataset, used in their object detection from video challenge. This dataset as of 2017 consists of 4000 different training sequences with annotated bounding boxes. The dataset also contains an additional 1314 validation sequences of annotated data, which can be used to test performance on a dataset not seen during training. The training sequences and validation sequences, both contain the following 30 classes of objects: airplane, antelope, bear, bicycle, bird, bus, car, cattle, dog, domestic cat, elephant, fox, giant panda, hamster, horse, lion, lizard, monkey, motorcycle, rabbit, red panda, sheep, snake, squirrel, tiger, train, turtle, watercraft, whale, zebra. The annotations consist of a square bounding box, without rotation and a rudimentary occlusion flag that is set when part of the ob-ject is occluded. The annotations are frame by frame, and there can be multiple annotated objects in a single sequence. It must be noted that this dataset does not contain annotations of people. But it will still be used to train both networks due to it being the largest and most varied available.

3.5.2 Multiple object tracking benchmark&dataset

The multiple object tracking benchmark(MOT) [47] is used to compare tracker performance on simultaneous tracking of multiple objects. The ground truth of the benchmark annotates multiple trajectories per frame, where a trajectory is the path of a single target during the whole sequence. The benchmark tests for a number of different metrics but the most important one is MOTA which stands for multiple object tracking accuracy. MOTA is calculated using equation (3.5) where the number of false negative F Ntis the number of tracked targets that do

not correspond to a target annotated in frame t. The number of targets that are annotated but are not being tracked in a frame is F Pt, and IDSWtis the number of

trackers that are tracking a target annotated with a different number in the current frame from the one tracked by the same tracker in the previous frame. Finally GTt

(40)

3.5. DATASETS AND BENCHMARKS 31 M OT A= 1 ≠ q t(F Nt+ F Pt+ IDSWT) q tGTT (3.5) The authors of [47] note that while MOTA is a good indicator of overall per-formance, it is debatable whether or not this number alone can serve as a good performance measure. Another metric used to measure performance is the multiple object tracking precision, which measures the average overlap of all tracked bound-ing boxes which have been matched to a ground truth annotated box [47].

The MOT benchmark is used to compare the performance of all the trackers sub-mitted to the yearly MOT Challenge, the 2017 challenge was open for entries until May 31st. To aid participants of the challenge, a dataset is made available to train competing neural networks on. The dataset consists of 21 video sequences for training in which a total of 1638 different people are annotated. The annotations consists of frame by frame bounding box annotations for each person in frame, and a visibility percentage form 0 to 1 describing how much of a person is visible. The dataset will be used to train the network presented in section 4.2 for the MOT challenge [47] submission.

3.5.3 Visual object tracking benchmark

(41)

(42)

Chapter 4 Tracking algorithm

The problem description in section 2.4 described the problems with current tracking implementations. They often do not support anisotropic scaling, which means the aspect ratio of the bounding box of the target is constant. And changes in the target’s scale are often detected with the use of a scale pyramid [26] containing differently scaled search windows. The problem with using a scale pyramid is that it linearly increases the computations required due to a tracking algorithm having to run for each scale. A tracker using a scale pyramid is also harder to implement on mobile hardware due to the lack of available API calls to perform scaling operations and batching of the different scales. This chapter will first present an evaluation of “SqueezeDet” and the “Fully-Convolutional Siamese Networks for Object tracking” in section 4.1. Then a network for regression of an affine transformation adapted for tracking applications will be presented in section 4.2. And finally, a novel network performing object detection based on a single exemplar, will presented including the filtering required to apply it to a tracking application in section 4.3.

4.1 Evaluating related works

In section 2.4.2 the main goals of the thesis were determined to be: increasing the performance regarding the speed of the tracker, enable tracking of scale and anisotropic scaling, and to create an algorithm that is simple to implement. Some of the related works already focus on these goals, for this reason two related works are evaluated in depth. The first one being “SqueezeDet” (SQDet), as it is one of the lightest and fastest networks [54] that does object detection. The goal is to implement it in iOS using the Metal API, this should show whether or not a state-of-the-art network (like SQDet) designed to be lightweight is able to run in real-time on mobile hardware. The second one is the “Fully-Convolutional Siamese Networks for Object Tracking” [42], this should give insights into the challenges regarding the implementation and training of a Siamese Tracker in Tensorflow and whether the network in the paper can be altered to fulfill the requirements of section 2.4.2.

(43)

34 CHAPTER 4. TRACKING ALGORITHM Release date iPhone Model Speed in fps

19 September 2014 iPhone 6 14 9 September 2015 iPhone 6S 47 21 March 2016 iPhone SE 47 16 September 2016 iPhone 7 57

Table 4.1: Comparing the speed of SqueezeDet on different iPhone models

4.1.1 SqueezeDet performance

At the time of release SqueezeDet was the lightest and fastest network for object detection, and thus a perfect candidate for an experimental implementation in iOS. The metal API natively supports convolutional layers, and though the fire module is not natively supported it could however be implemented using an intermediate image and an image offset as can be seen in appendix A. It was possible to execute the whole neural net on the GPU using the Metal API, this way only the final bounding box calculation and filtering needed to be done on the CPU. To filter the bounding boxes, first the anchors with a confidence score below a threshold of 0.4 were excluded from consideration. Of the remaining anchors the bounding boxes were calculated using the predicted deltas, the final bounding boxes were then sorted by confidence. Starting with the bounding box that had the highest confidence score, an IOU with every other bounding box was calculated and any that had an IOU above 0.4 were dropped. This process of filtering the detections is also known as non maximum suppression (NMS) [63]. Doing the NMS on the phone made the FPS fluctuate by ±5FPS when there were more or less detections kept after the confidence threshold. The neural network and bounding box filtering was executed on different iPhone models that support the Metal API. The results of which can be seen in table 4.1.

4.1.1.1 Conclusion

(44)

4.1. EVALUATING RELATED WORKS 35

4.1.2 Fully Convolutional Siamese Tracker

As stated in section 1.2.2, CFNet[6] is one of the most recent tracking algorithms utilizing deep learning. Since this tracking algorithm was only released in April 2017 after the work on this thesis was started, CFNet has not been evaluated in this thesis. The performance of CFNet is state-of-the-art but its use of novel layers in its network architecture will not be supported by the Metal API in the foreseeable future. For this reason an earlier paper by the same authors that introduced some ideas used in the CFNet paper will be evaluated, namely [42]. In the paper a smaller version of Alexnet is used in a tracking application.

4.1.2.1 Tracking algorithm

The tracking algorithm in the paper uses a siamese network (seen in fig. 4.1) to perform a cross correlation of an exemplar and search window. The output of the cross correlation is a score map where the highest activation represents the location of the target in the search window. fig. 4.1.

Figure 4.1: The Siamese network structure used in [42]. Ï represents the neural network (from:[42])

In order to perform the cross correlation, a feature map of the target is first gen-erated. This is done by cropping a part of the image known to contain the target, the crop needs to include some amount of context to enable robust performance [6]. The size of the crop including the context can be calculated using the equation (4.1) p is a parameter determining the amount of context, the p value used in this paper is 0.5 resulting in 50% of the crop being context.

context= p(targetwidth+ targetheight)

sizecropex=

Ò

(targetwidth+ context) · (targetheight+ context)

(45)

36 CHAPTER 4. TRACKING ALGORITHM The crop is resized to 127◊127 pixels and fed as exemplar z to the neural net Ï, the output is a feature map of 6◊6◊128 [42]. The search window is created in a similar method, the size of the search window can be determined from the exemplar crop size and the ratio of their respective dimensions sizecropse= sizecropex· (254/127).

The crop of the search window is done at the last known location of the target. It is resized to 254 ◊ 254 pixels and fed as search window x into the neural net, the resulting feature map is 22 ◊ 22 ◊ 128. The exemplar feature map is cross correlated with the feature map of the search window resulting in a feature map of size 17 ◊17◊128. This feature map is reduced by adding all channel values of each 2D position resulting in a score map of 17 ◊ 17 ◊ 1. The score map is upscaled to increase accuracy using bilinear interpolation.

To enable detection of scale changes, a scale pyramid of search windows is used. The scale of the target in the new frame is assumed to be the score map with the highest activation, while the position is the location of the highest activation on that specific score map. The size and location of the bounding box are adjusted based on the detected position and scale, the same happens to the search window. It must be noted that due to the use of a scale pyramid, the neural net is not able to detect anisotropic scaling.

4.1.2.2 Neural network

As stated before, [42] uses a smaller version of AlexNet, the structure of which can be seen in table 4.2. It is important to note that the network uses no padding around the edges, which means that a convolution will only be applied on a receptive field containing image data. A side effect of that is that the output size of a layer is not only determined by it’s stride but also by the kernel size as seen in equation (4.2).

outputwidth=

inputwidth

stride ≠ kernelwidth+ (kernelwidth%2) outputheight=

inputheight

stride ≠ kernelheight+ (kernelheight%2)

(4.2)

4.1.2.3 Training method

The neural network is trained using an exemplar taken from a random sequence at a random frame, and a search window taken from a random frame (up to 50 frames later) in the same sequence. The exemplar is generated as usual but the search window is always generated in such a way that the target is always in the middle of the window. For this reason the ground truth labels are a 17 ◊ 17 ◊ 1 map named

v, with ones in a radius R around the center and -1 outside R. This score map can

(46)

4.1. EVALUATING RELATED WORKS 37 Table 4.2: The layers of the neural network used in “fully convolutional siamese tracker” (adapted from [42])

Activation size

Layer Kernel Stride Exemplar Search window Channels

Input 127x127 255x255 3 Conv1 11x11 2 59x59 123x123 96 Pool1 3x3 2 29x29 123x123 96 Conv2 5x5 1 25x25 57x57 256 Pool2 3x3 2 12x12 28x28 256 Conv3 3x3 1 10x10 26x26 192 Conv4 3x3 1 8x8 24x24 192 Conv5 3x3 1 6x6 22x22 128

predicted score map.

Loss= 1 D D ÿ i=0 log(1 + exp(≠yivi))) (4.3)

This method of training is only possible when no padding is used in the network. If padding were used in the network, the network could over-fit on the zeros used to pad the image. Since the score map does not change, the network could learn to produce a perfect score using only the padded zeros. But since no padding is present, the neural network can only act on the image data, and thus does not a develop a bias.

4.1.2.4 Performance

One Shot Object Detection

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

One Shot Object Detection

for Tracking Purposes

TIJMEN VERHULSDONCK

Acknowledgements

Contents

List of Acronyms

Chapter 1

Introduction

1.1 Motivation

1.2 Background

INPUT

Black Box

OUTPUT

Exemplar

Search

Window

1.2.1 Deep learning

1.2.2 State of the art in tracking algorithms

1.2.3 Problem

1.3 Research Methodology

1.4 Research Contributions

1.5 Thesis Organization

Chapter 2

Background

2.1 Neural Networks

2.1.1 Fully connected networks (FCN)

Σ

2.1.2 Training process

2.1.3 Inference stage

2.1.4 Convolutional neural networks (CNN)

2.1.5 Performance

2.1.6 Siamese network

2.1.7 Image classifiers

2.2 Machine Learning APIs

2.2.1 Tensorflow

2.2.2 Metal

2.3 Datasets

2.4 Project Goals and Specifications

2.4.1 Problem

2.4.2 Goal

2.4.3 Proposed solution

Chapter 3

Related work

3.1 One shot learning

3.2 Tracking

3.2.1 Tracking using deep regression

3.2.2 Tracking using a CNN and recurrent layers

3.2.4 Learnet

3.2.5 Visual Tracking by Reinforced Decision Making

3.2.6 Correlation Filter based tracking

3.2.7 Tracking using Recurrent net and LSTM Cells

3.2.8 Tracking by detection

3.3 Optimizing network performance

3.3.1 Deep compression

3.3.2 SqueezeNet

3.3.3 SqueezeDet

3.3.4 MobileNets

3.4 Affine Transformations

3.4.1 Spatial Transformer Networks

3.4.2 CNN architecture for geometric matching

3.5 Datasets and Benchmarks

3.5.1 Imagenet video dataset

3.5.2 Multiple object tracking benchmark&dataset

3.5.3 Visual object tracking benchmark

Chapter 4

Tracking algorithm

4.1 Evaluating related works

4.1.1 SqueezeDet performance

4.1.2 Fully Convolutional Siamese Tracker

_{Black Box}