Video Saliency Detection Using Deep Learning

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Video Saliency Detection

Using Deep Learning

JAKOB WIESINGER

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Video Saliency Detection

using Deep Learning

JAKOB WIESINGER

Master in Computer Science Date: January 24, 2019 Supervisor: Håkan Lane Examiner: Jonas Beskow

(4)

(5)

iii

Abstract

(6)

iv

Sammanfattning

(7)

Chapter 1 Introduction

1.1 Video saliency detection

Video saliency detection is, simply explained, gaze prediction for an average human viewer, given only the video being watched. A more precise description follows after defining some important terms. The saliency of a medium is a term quantifying its tendency to draw attention from humans. A dynamic saliency map of a video specifies a saliency value for each spatial and temporal part of the video, i.e. for each pixel in each frame. Video Saliency Detection (VSD) is the process of creating a dynamic saliency map based on the video.

The true saliency map would perfectly specify the amount of atten-tion attracted by each pixel, for an average human viewer. But how is attention defined? Given what is known about the acuity of human vi-sion, the allotment of visual attention from a viewer to its visual field is commonly modelled as a symmetric 2D Gaussian centered at the spot of eye fixation. The width of the Gaussian is determined experimen-tally.

The typical way to approximate the true saliency map is to measure the fixation locations of n viewers with an eye tracking device, put a Gaussian on each fixated spot, take the average of the n maps, and normalize by dividing by the maximum value. (While it seems intu-itive to pick as normalization criterion that the integral of the saliency map should equal 1, this has some disadvantages of its own and it is

(10)

2 CHAPTER 1. INTRODUCTION

not the convention.)

Note that the whole pursuit is contingent on the assumption that gaze location of viewers correlates with video characteristics, and that the strength of this correlation sets a theoretical limit on VSD performance.

An example of predicted saliency can be seen inFigure 1.1. Example videos can be found on the following link:

http://www.jakobwiesinger.com/link/saliency2018.

Figure 1.1: Heat map of predicted saliency overlaid on the image used for prediction. This is in fact a single frame from a video of such pre-dicted frames.

1.2 Scientific contribution

(11)

Ex-CHAPTER 1. INTRODUCTION 3

perimentation was performed on a number of possible approaches. The final trained model is shown to perform on par with state of the art VSD. Ideas for future investigations are proposed.

This master thesis project was carried out as a project with video con-sulting company Entecon. The VSD system was combined with the work of another thesis student, who built a system for video compres-sion which encodes different parts of the image with different quality based on saliency maps. Using the two modular systems in combina-tion, the result is a system for "saliency enhanced video compression". Video regions which are likely to attract viewer attention have a higher visual quality than regions of less importance.

1.3 Report outline

(12)

Chapter 2 Background

2.1 Video saliency detection

A system for video saliency detection does the following:

Input: a video consisting of a number of equidistant frames, each be-ing a 2D image with a well defined resolution.

Output: a saliency map for each input frame, with the same resolution as the input, specifying a saliency value for each pixel.

Several different approaches to saliency detection have been investi-gated. Static VSD processes each input frame independently, as if there was no temporal correlation between frames. This assumption reduces the VSD problem to a number of image saliency detection problems. Dynamic VSD on the other hand attempts to determine saliency for a specific frame using not only that frame, but also other frames in the video.

Like many other topics of computer vision, saliency detection has re-cently undergone a shift from approaches relying on handcrafted fea-tures and models to methods based on deep neural networks. This new approach has achieved state of the art performance for both im-age and video saliency detection.

(13)

CHAPTER 2. BACKGROUND 5

2.1.1 Evaluating VSD accuracy

An intuitive way to evaluate VSD is to take a set of test videos and compare the predicted saliency maps with measured gaze fix-ation maps from a set of experimental subjects watching the test videos. Recording more than one viewer helps reduce the influence of individual viewer characteristics and random error of different kinds.

However, it is not obvious how to quantify the similarity between a predicted saliency map and a ground truth fixation map. Different evaluation metrics have been proposed, and an evaluation of these (in the context of image saliency detection) resulted in the recommenda-tion to use multiple metrics [22], which is a common approach [31, 2].

Some metrics are based on the well known Receiver Operating Char-acteristic (ROC) curve, which plots true positive rate against false pos-itive rate (in this case for the classification salient/not salient). Area Under Curve (AUC) measures the area under the ROC. There are mul-tiple variations on AUC, such as Shuffled AUC (sAUC), AUC-Judd and AUC-Borji.

Some other relevant metrics are Normalized Scanpath Saliency (NSS), Earth Mover Distance (EMD), Pearson’s Correlation Coefficient (CC), Similarity or Histogram Intersection (SIM), Kullback-Leibler Diver-gence (KL), Information Gain (IG).

The saliency metric evaluation by Bylinskii et al. [6] argues in favor of selecting a saliency metric with the end application of the system in mind. The choice should also depend on the specific assumptions of the model. For example shuffled AUC (sAUC), a metric that can compensate for center bias which is common in saliency datasets, is considered most appropriate when the saliency models involved are not expected to account for this bias.

(14)

6 CHAPTER 2. BACKGROUND

2.2 Deep neural networks

Neural networks are universal function approximators, i.e. with enough neurons any function can be approximated [17]. The com-puting architecture is inspired by biological brain circuits. The computation is distributed in the sense that each neuron performs a very simple task and all intelligence and complexity emerges from the interaction. Neurons are typically arranged in successive layers, where signals are propagated forward from an input layer through hidden layers, ending at an output layer. A neuron in a hidden layer receives input as weighted activations from a number of neurons in the previous layer. These activations, floating point scalars weighted by multiplicative weight parameters, are summed and passed through an activation function (such as a sigmoid, tanh or rectified linear unit), which results in an output to a number of neurons in the next layer.

The expressive power of networks comes from the architectural de-sign decisions and from the very large parameter space of the weights [15].

2.2.1 Optimization with neural networks

In order to provide the desired output to an input, a neural network first needs a suitable architecture, and then the weight parameters need to be set to suitable values. The weights are set in an iterative training procedure by gradient descent on a loss function. In the case of supervised learning, input samples for which the desired output is known are passed through the network (forward pass). The network output is then compared to the ground truth desired output by a loss function which quantifies the dissimilarity. A consistently small loss value L means that the network output is similar to the desired output, therefore optimization of weights W means approximating

Woptimal = arg min W

L. (2.1)

(15)

to the weights. The gradient can be computed analytically by utilizing the chain rule through layers (backward pass). The forward pass and backward pass together constitute the backpropagation algorithm.

To improve generalization, i.e. the capability of the network to per-form well also on samples not seen in the training phase, regulariza-tion methods such as these are employed: penalty on large weight values (complexity penalty motivated by low Bayesian prior) directly in the loss function (resulting in a "cost function", seeEquation 4.3for an example), early stopping of training by checking performance on a validation set, dropout and batch normalization [15].

2.2.2 Convolutional Neural Networks (CNN)

Many problems where the input samples extend in two spatial dimen-sions support the assumption of spatial equivariance, i.e. no perfor-mance drop is expected by forcing the model components to behave similarly across the spatial extension. A convolutional layer preserves the spatial extension of the preceding layer and makes this assump-tion, resulting in weight sharing across spatial locations. The addi-tional constraint of the output being dependent only on inputs in spa-tial locations "nearby" the output location means that most weights between neurons in the two layers can be zero. Both these properties drastically reduce the number of learnable parameters and thus the model complexity of CNN.

Doing this multiple times in parallel with separate sets of weights re-sults in multiple convolutional filters, each of which during training can learn to detect a certain feature.

Images (and single video frames) are well suited for convolutional lay-ers [15].

2.2.3 LSTM and ConvLSTM

(16)

8 CHAPTER 2. BACKGROUND

model temporal aspects. Predicting time-series-related data is a typ-ical use case. LSTMs in particular have a memory cell and learnable input, output and forget gates [15].

The ConvLSTM unit is an extension of the LSTM which also relies on the convolution operation [24]. In this way it can preserve spatial de-pendencies while also modeling temporal ones.

2.3 Optical flow

In simplest terms, optical flow approximates movement between frames in a video. Given two successive frames, typically called refer-ence frame and target frame, the optical flow specifies for each pixel in the reference frame an estimated vector of movement as judged by where that pixel seems to have ended up in the target frame. So optical flow is a vector field with the same spatial extension as the underlying frames, but instead of specifying color and brightness information at each point, it specifies 2D motion.

The problem of computing optical flow is ill-posed and under-constrained, so there is no single definitive solution. The propriety of an approach is influenced by the optical flow use case.

(17)

(a)

(b)

(18)

Chapter 3 Related Work

Latter years’ prevalence of deep neural networks in computer vision has affected VSD as well. Image saliency detection has seen multiple successful attempts to improve results using neural networks. As with the classification problem, video lags behind, but a number of papers have recently started exploring VSD with these new techniques, and surpassed the previous state of the art [31,2, 18]. A recent evaluation of several methods is presented in Wang et al. [31].

The similarity of VSD to image and video classification means that promising techniques can be borrowed from those more studied do-mains.

3.1 Previous VSD

The current state of VSD research is briefly summarized in Fig-ure 3.1.

Motion is salient. To explicitly provide the neural network model with motion info it is possible to precompute optical flow (pixel mo-tion between successive frames,section 2.3), as in Bak et al. [2] which combines a motion stream (optical flow as input) with an appearance stream (image/video frames as input) in a "twostream" approach. Each stream separately performs convolutional feature extraction, the mo-tion and appearance features are then stacked and merged by 1x1

(19)

CHAPTER 3. RELATED WORK 11

Non-DL

models

Deep Learning models

D

y

n

a

m

ic

m

o

d

e

ls

S

ta

ti

c

m

o

d

e

ls

Attenti

on mo

dule

Two streams

ConvLSTM

Tra

in

als

o w

ith

st

ati

c

da

ta

Pre-tr

ained

featu

re

extrac

tion

Extra

ct fea

tures a

_t

multip

_{le lev}

els

More & bette

_{r training}

data

...

(20)

12 CHAPTER 3. RELATED WORK

volution, followed by more convolutional layers. Similarly, Jiang, Xu, and Wang [18] uses separate streams for motion and appearance, but features are extracted on multiple levels in the streams. High level ap-pearance features are used to weight some layers in the motion stream, in order to extract motion of objects specifically. In addition, two Con-vLSTM layers are used in the end of the model to enforce the observed smoothness in gaze, i.e. the tendency for gaze to stay in a location nearby its previous location.

Convolutional features are combined with non-deep-learning features in Wang et al. [32]. A Support Vector Machine (SVM) is then used to generate a saliency map from the collection of features. Tempo-ral aspects of saliency are accounted for by using convolutional fea-tures from not only the current frame but also from the six preceding frames.

In Zhu and Xu [35], which goes on to propose a model of saliency aware video compression, appearance saliency is computed by CNN. Computation of motion saliency relies on block movement infor-mation available from the HEVC compression standard. The two saliency types are combined by a spatiotemporally adaptive fusion formula.

A residual attention module [28] was included in the neural network designed by Wang et al. [31]. The attention module was trained partly using static saliency datasets, i.e. still images with corresponding saliency maps, relieving the VSD problem of too little available train-ing data. The model uses a pre-trained CNN for feature extraction, a common occurrence also among previously mentioned works. For temporal consistency, a ConvLSTM was used in the end of the network.

(21)

CHAPTER 3. RELATED WORK 13

3.2 Innovations on convolutions

For the purposes of advancing VSD, the first obvious path is to inno-vate on high-level aspects such as network design or training regimen. But additionally it could potentially pay off to focus on a lower level of detail, such as the convolutional layer itself. Depthwise separable con-volutions [8] have been shown to perform very well, perhaps the main selling point being alleviating the computational burden. Other possi-ble variations include tiled convolutions or 3D convolutions.

While such improvements could plausibly improve VSD, there are other domains more suitable for the development of these techniques. VSD has some disadvantages when it comes to evaluating convolu-tional techniques, such as the comparatively small amount of avail-able ground truth data and less agreement on standards for evaluation - compare with image classification which has ImageNet [23]. There-fore it seems more reasonable to perform data intensive core research in related fields, and on the VSD side focus on translating established or very promising approaches. Making use of pre-trained CNNs is an example of this.

3.3 Pre-trained neural networks for feature

extraction

As established above, neural network models for VSD often incorpo-rate a feature extracting CNN, with weights pre-trained in some other work. This type of transfer learning avoids the computationally inten-sive process of training from scratch.

(22)

14 CHAPTER 3. RELATED WORK

high accuracy while still requiring relatively modest computational re-sources.

3.4 Integrated spatiotemporal features

(23)

Chapter 4 Methods

4.1 Overview of method

An initial neural network architecture was proposed, many design choices were based on previous results in VSD. It was trained on eye tracking data from videos, and used as input both video frames and computed optical flow (section 2.3) of the frames. A submodel of the full model was additionally trained on static eye tracking data from images.

Data was mainly sourced online, but a small dataset of videos was generated with an eye tracker for demonstration and testing purposes.

Two cloud servers (AWS instances p2.xlarge and p3.2xlarge, with NVIDIA K80 GPU and NVIDIA Tesla V100 GPU respectively) were configured for all storage and compute intensive tasks, and used interchangably. Code for optical flow generation and other prepro-cessing code was written. All data was preprocessed on the cloud server.

The neural network model was implemented using Keras [7]. Extra attention was required to ensure that the multiple types of training data were used correctly and efficiently.

Ways to evaluate model performance were established, partly relying on established saliency metrics and partly relying on manual

(24)

16 CHAPTER 4. METHODS

tion of model output. These evaluation methods were used to guide design changes and to benchmark the final model.

Initial training commenced on the cloud server, with the single pur-pose of refining model architecture and hyperparameter choices. Re-peated experimentation on a number of hypothetically useful ideas resulted in a candidate model for final training.

The candidate model was trained for multiple days. Manual interfer-ence was required to ensure training stability.

4.2 Designing the first model

The design choices of the first proposed model, before any experi-mentation or iterative refinement, were based on previous work in saliency detection, computer vision and deep learning. A schematic of the model is shown inFigure 4.1, and a Keras model summary is presented inAppendix A.

(25)

CHAPTER 4. METHODS 17 Video frames Optical flow frames Video Appearance attention out s ta c k in g F e a tu re -ConvLSTM Final saliency out 1 x 1 c o n v Appearance attention module Pre-trained feature extraction Motion attention out Motion attention module Pre-trained feature extraction trainable in out p re p ro c e s s e d 1 x 1 c o n v A p p e a ra n c e s tr e a m M o tio n s tr e a m

Figure 4.1: High-level view of the proposed neural network architec-ture. Two streams, each with feature extraction and trainable attention module. Convolutional fusion and ConvLSTM. The trainable parts are blue.

4.2.1 Feature extraction

(26)

operations per inference [1] than VGG-16 [25], while performing better on the ImageNet challenge [23]. A comparison with VGG-16 is relevant because it was successfully used for feature extraction in a saliency context by Wang et al. [31].

The desired visual features can be obtained from these pre-trained net-works by removing the last few layers, since those are specific to the image classification task for which they are originally trained.

The ResNet-50 architecture was further modified in order to get feature maps of larger spatial size, because the default downsampling with a factor 32 from the input resolution was deemed too aggressive. These modifications included changing the stride of the last conv-block from (2, 2) to (1, 1), and finally upsampling the final features using bilinear interpolation.

After a change of architecture like this, there is a risk that the trained weights no longer fit as well as they used to. As a coarse but quick way to test this, the final global pooling and classification layers were re-activated: this modified ResNet-50 was then compared to the orig-inal on a subset of the ImageNet dataset. The classification accuracy was indeed higher for the original ResNet-50 but the modified ver-sion performed reasonably. The same quick test was performed on the original VGG-16 vs the VGG-16 used by Wang et al. [31] (the last two pooling layers were removed), and in that case also there was a decrease in classification accuracy for the modified model. Since that modified VGG-16 performed well on the saliency task, the modified ResNet-50 seemed like a good choice for the first model.

Experimentation with the choice of feature extractor was done in the model refinement (section 4.7), the results of which are available in section 5.3.

4.2.2 Two streams

(27)

CHAPTER 4. METHODS 19

While the feature extractors are not trained on optical flow frames, the working hypothesis is that the domain of optical flow frames is at least similar enough to the domain of pictures, so that the large amount of feature channels extracted by a pre-trained network can be combined by later stages of the model into useful encodings of saliency.

The two streams would be combined by convolutional fusion of the features, motivated not only by the ubiquitous application of 1x1-convolution to make features more condensed but also by previous findings that it performs better than other types of fusion [2].

4.2.3 Attention module

Because it worked well for Wang et al. [31], the model would make use of residual attention module(s) [28] to further enhance the value of the features. Each stream would have an attention module.

The attention outputs are 2D matrices that specify where the neural network should focus its attention. They can be thought of as "draft saliency predictions" and can be manually inspected as such, which will be beneficial when evaluating performance of different parts of the model (section 4.6).

The two attention modules are architecturally identical and consist of convolutional layers (with batch normalization) and max pooling lay-ers which reduce the spatial extension from 32x24 (as output by the feature extractor) to 8x6. This downscaling facilitates a large receptive field, meaning that each pixel depends on a large number of pixels in the model input, which is necessary for the attention module to coher-ently take the full input image into account. The final convolutional layer in the attention modules has only one filter, resulting in a "draft output" of size 8x6x1 per image or optical flow sample. This is then up-sampled back to the original feature size of 32x24, and used to weight all feature maps in a residual fashion:

ˆ

(28)

The attention module in the appearance stream can be trained explic-itly on static eye tracking data (images), in addition to being trained implicitly when the full model is trained. Explicit training is done by training directly on the output from the attention module. This consti-tutes the submodel training mentioned in the overview (section 4.1). The motion attention would mainly be trained implicitly. Experimen-tation with explicit training of motion attention was performed during model refinement (section 4.7).

4.2.4 ConvLSTM

Up to this point, each mechanism in the model operates on only a sin-gle video/optical flow frame. In this final stage, a ConvLSTM module is employed to model the tendency of viewers to keep looking where they were previously looking. As in Wang et al. [31], only a single layer on ConvLSTM is used. The initial parameters of the module were based on that previous work.

4.2.5 Loss function, regularization, batch

normaliza-tion etc.

The same loss function was used on all types of training, its exact for-mulation guided by the evaluation of saliency metrics [6] and by pre-vious deep learning VSD [18] [31]. Kullback-Leibler divergence (KL) was used as the main part of the loss function. A small term of Pear-sons Correlation Coefficient (CC) was added as well, since that mea-sure also says something meaningful about saliency [6]. The loss func-tion is the weighted sum of the two metrics, with these weights:

L = 1.0 ∗ LKL+ 0.1 ∗ LCC. (4.2) Note that the "dissimilarity version" of CC is used, i.e. LCC = 1−CCsim if CCsim is the standard formulation of the metric which quantifies similarity.

(29)

J = L + λ X W ∈W

kW k2, (4.3)

where W is a weight (matrix) in the set of all model weights (matrices) W.

This cost function J was minimized by the Adam optimizer [20]. Batch normalization was added after each convolutional layer. Dropout was used in the ConvLSTM module with a rate of 0.4, i.e. each neuron has a 40 % chance to be inactive in each batch.

The sequence length of a video sample (the ConvLSTM deals with se-quences) was initially 20 frames. The batch size for a video batch was then determined experimentally to utilize all available GPU memory on the cloud server. The static image batch size for training of attention module was not picked to maximize GPU memory utilization since that would result in a much larger amount of distinct empirical data per batch, compared to the video batches. This is because each video sample is very self-similar between its frames.

4.3 Training schedule

The model has three outputs, each of which allows a loss value to be calculated. The complete model was optimized by repeated consecu-tive optimization of the submodels defined by these outputs. While neural network weights are shared for the submodels, optimizer mo-mentum parameters are not shared. In order to avoid the problem of discontinuities in optimizer momentum, it was a priority to intermin-gle the different types of training as finely as possible, i.e. on the batch level.1

1_{A number of custom modifications were done to Keras in order to}

(30)

4.4 Acquiring training data

Three types of training data were used: video, image and optical flow. Each type was used in conjunction with ground truth saliency maps.

The way to train the full proposed model is to input a sequence of video frames and a corresponding sequence of optical flow frames, and compare the sequence of predicted saliency maps to the sequence of ground truth saliency maps.

Training the appearance stream requires only individual images: the predicted appearance attention output is compared to corresponding ground truth saliency maps.

Training of the motion stream is similar, but the input is individual op-tical flow frames, and it is the predicted motion attention output which is compared to corresponding ground truth saliency maps.

Eye fixation recordings are used as ground truth saliency maps for all training types. A number of viewers watch the stimuli (video or im-age), and a 2D Gaussian is centered at each fixation point, resulting in a blurry heat map approximating the intrinsic saliency of the stimuli. For the sake of consistency, the size of the Gaussian was determined by the condition that the ratio

standard deviation of Gaussian length of image diagonal

should equal the ratio in the DHF1K dataset, which provides precom-puted saliency maps.

(31)

4.4.1 Video datasets

The following four public video datasets with eye tracking data were used in the training process: DHF1K [31], LEDOV [18], Hollywood-2 [21], UCF-Sports [21].

In addition to training data, these datasets include testing data which can be used for model evaluation (section 4.6).

DHF1K and LEDOV are both designed with the purpose of training models for VSD. The content is purposely diverse and there is gener-ally no scene changes in the individual videos.

Hollywood-2 and UCF-Sports were originally created for the domain of action recognition, but were extended by Mathe and Sminchisescu [21] to also include fixation data. The videos originally contain scene changes, but in the work by Wang et al. [31] the videos were split on scene changes. These modified versions of the datasets were used in this work.

Scene changes are unhelpful in training data because they introduce temporal complexity of another kind than the motion which the ConvLSTM is intended to account for. Appropriate response to scene changes is not, from the perspective of the designed model, considered learnable to the same degree as the appropriate response to continuity between frames.

Hollywood-2 is made up of videos extracted from 69 movies, grouped into classes of human actions such as DriveCar, Eat and HandShake. UCF-Sports is made up of videos of various sports.

The total amount of video training data constituted 703747 frames from 3930 videos.

4.4.2 Image datasets

(32)

In total, 14331 images were used for training (the rest were used for validation).

4.4.3 An in-house dataset

In addition to the public datasets, an in-house video dataset was created at Entecon. 14 videos of high quality (resolution 1280x720) were assembled and eye fixations were recorded from 17 people who were instructed to watch the videos normally (i.e. free viewing in saliency terms). All videos would automatically play in sequence, with 5 seconds black screen between videos to reset the gaze. Test subjects were seated with eyes approximately 75 cm from a 24 inch monitor with 1920x1200 resolution, fixations were recorded using a Tobii Eye Tracker 4C (not meant for very precise scientific mea-surements). In total, fixations were recorded for 13262 frames. The advantage over the publicly available datasets is the larger resolution of the videos.

This dataset was used for subjective evaluation and demonstration purposes, including testing of saliency enhanced video compression (section 1.2).

4.4.4 Data augmentation

The image training data was augmented by horizontal flipping. Each epoch in the image domain used each image twice: once in the origi-nal orientation and once after flipping horizontally. The ground truth saliency was flipped in the same way. This is a typical augmenta-tion strategy which artificially extends the dataset in a meaningful way: convolution operations are not equivariant to flipping or rota-tion.

(33)

4.5 Generating optical flow

For each of the acquired video datasets, optical flow frames were gen-erated from the video frames. The TV-L1 formulation of optical flow [34] was calculated on the cloud server GPU using a Python wrapper [33] for the OpenCV [4] C++ GPU implementation. Optical flow can’t be generated for the first frame in a video because there is no previous frame with which to compare movement, so the flow of the second frame was used for the first frame as well.

4.5.1 Data for explicit training of motion stream

When training the motion stream explicitly, individual optical flow frames with corresponding eye tracking data are needed. The optical flow frames generated from the video datasets were used, but horizon-tal flipping was applied to the input and ground truth output in order to not use the exact same data as when training the full model.

4.6 Evaluating the model

4.6.1 Benchmarking model performance

In order to benchmark the performance of the final model, established metrics for saliency similarity (subsection 2.1.1) were used to quantify the similarity between predicted saliency and ground truth saliency. This was done on a number of previously published datasets, which allows comparison to previously suggested models.

As mentioned insubsection 2.1.1, different metrics are relevant for dif-ferent applications, and therefore a number of metrics were used. In particular, metrics AUC-Judd, Pearson’s Correlation Coefficient, Nor-malized Scanpath Saliency and Histogram Intersection were used on the testing part of datasets Hollywood-2 [21] and UCF-Sports [21]. These results are presented insection 6.3.

(34)

Python implementations were rather inefficient so a speedier version was written. This implementation also avoids a small issue appearing in the reference Matlab implementation, which in some test cases caused a deviation in the fourth significant digit from what is proba-bly meant as the true value. The new implementation will be made available on the following link:

http://www.jakobwiesinger.com/link/saliency2018.

4.6.2 Evaluation for use in model refinement

When refining the model by experimenting with architecture and hy-perparameters, it is not feasible to use benchmarks on large datasets as the single way to evaluate performance, because it is too compu-tationally expensive and takes too long time. Additionally, the sin-gle number does not help much in understanding the shortcomings of the model. Particularly helpful ways of evaluation included bench-marking on smaller datasets (validation during training) and manual inspection of model output.

A program was written to overlay a heat map of predicted saliency on the input video. A number of videos were selected to showcase model behavior in certain interesting scenarios, e.g. lots of move-ment, faces, multiple separate areas of interest, etc. Manual inspection of predicted saliency heat map overlaid on these handpicked videos was educational in regards to understanding modes of failure of the model. This process could be repeated on multiple proposed models and compared. Figure 4.2shows an example frame generated in such a way.

By comparing the outputs from appearance attention, motion atten-tion and full model, it was possible to investigate specific parts of the model.

4.7 Refining model design

(35)

Figure 4.2: Frame from video of predicted saliency map overlaid on a video with interacting objects.

Hyperparameters and other common deep learning properties • loss function

• learning rate schedule and values • amount of L2 regularization (if any) • how to use batch normalization • how to use dropout

• choice of optimizer, momentum • activation functions

• kernel initializers

• length of video sequence in training (influences ConvLSTM) • batch size (for video, image and flow)

• input size of frames (W x H in pixels)

(36)

• whether to train motion stream explicitly

• if so: number of optical flow epochs per video epoch • which feature extractor to use (VGG or ResNet) • number of merge channels

• number of ConvLSTM channels

• whether to interconnect the attention modules or in other ways create spatiotemporal features (section 3.4)

There was repeated experimentation with these properties, in parallel with development of the training regimen. Some results from this pro-cess can be seen inChapter 5. This large number of properties can not feasibly be tested in all possible combinations, so it was often taken as a simplifying assumption that changing the model in one aspect does not invalidate conclusions from previous comparisons of other model properties.

4.8 Final training

When a final model configuration had been decided upon, this con-figuration was trained until as good a convergence as possible was reached.

The properties of the final model were as follows: VGG-16 was used for feature extraction. The motion stream was trained explicitly in the initial part of the training, but for the majority of the training it was only trained implicitly. The purpose of the initial explicit training was to quickly enforce meaningful motion features that later stages of the model could learn to use. There was no interconnection of attention modules. The features from the two streams were merged to 512 chan-nels. The ConvLSTM module had 256 channels and was left very sim-ilar to the one used by Wang et al. [31].

Details about the final model architecture can be seen in Ap-pendix A.

(37)

(38)

Chapter 5 Experiments

5.1 Learning rate

To find a suitable learning rate, an experiment was performed in or-der to approximate upper and lower bounds. As explained in [26], training was started with a very low learning rate. For each trained batch the learning rate was increased. The increase was exponential in order to more quickly go through a large interval of possibly suitable values. During this training the loss function was monitored. When the loss starts to change, the lower bound has been reached. When the loss becomes very unstable or goes up, the upper bound has been reached.

Figure 5.1shows an example output from this process. It is the train-ing of the full model which diverges first. This observation motivated a lower learning rate for the full model than for the attention mod-ules.

Repeated experimentation, using this method and small training at-tempts with proposed values, resulted in separate sets of values which worked well for training the full model and the appearance attention module respectively.

Two learning rate scheduling approaches were evaluated. Triangu-lar cyclical learning rate [26] and simple linearly decaying learning rate. Initially cyclical learning rate seemed to perform very well, but at closer inspection it did not converge faster, and it was less stable

(39)

CHAPTER 5. EXPERIMENTS 31

Figure 5.1: Learning rate finding. The learning rate was increased for each batch, from a very low value until divergence. As a consequence of the full model divergence around batch number 360, the appearance attention diverged around batch number 510. Vertical lines on the val-idation loss curves show the standard deviation within the respective validation sets. In stable training the standard deviation tended to re-main constant.

than the linearly decaying learning rate.

Optical flow learning rate was taken to be the same as the image learn-ing rate. This decision was motivated by the fact that the architectures of the two streams are the same.

5.2 Regularization

(40)

32 CHAPTER 5. EXPERIMENTS

Equation 4.3) performed worse than not using any L2 regularization at all.

Perhaps the regularizing effect of batch normalization and dropout [15] was enough. Batch normalization was applied after each stan-dard convolutional layer. Dropout was used only in the ConvLSTM module.

5.3 ResNet-50 vs VGG-16

The initial decision of using ResNet-50 for feature extraction rather than VGG-16 was reevaluated in the experimentation phase. After removing downsampling layers located at the very end of the stack, ResNet-50 outputs 1024 feature channels at a resolution which down-samples the input with a factor 32. Similarly truncated VGG-16, as used by Wang et al. [31], outputs 512 feature channels at a factor 16 downsampling rate. Since the output of the VSD task relies very much on spatial information, perhaps more so than it needs a large number of features, it seems that (512,₁₆1 )would be better suited than (1024,₃₂1).

Initial experiments did not show a noticeable difference in perfor-mance between the two, so the decision was settled by the fact that VGG-16 initialized much quicker, meaning that the first few batches started training in less than a quarter of the time it took for the ResNet-50 version of the model.

5.4 Explicit training of motion stream

(41)

CHAPTER 5. EXPERIMENTS 33

to weights in the motion stream to become very small, and therefore further training would hardly improve the motion stream.

In other words, the model could get stuck in local minima of disre-garding motion features.

Explicitly training the motion stream caused the motion attention to improve drastically, but the convergence of the full model became less stable. This is perhaps an effect of the increased complexity of three parallel training regimens.

A compromise was possible: train the motion stream explicitly only during the initial parts of training, after which only appearance stream training and full model training would take place. This seemed to work: the motion attention became good enough so that the model avoided the local minima and instead kept utilizing both streams.

5.5 Interconnecting attention modules

As mentioned insection 3.4, combining spatial and temporal features has been tried successfully before. Low-level integration between the two streams could provide good results but would be time consuming to implement and evaluate. A simpler way of utilizing information from the appearance stream in the motion stream is to connect the residual attention of the appearance stream to the motion stream as well.

(42)

34 CHAPTER 5. EXPERIMENTS

5.6 Sequence length

(43)

Chapter 6 Results

6.1 Convergence of final training

The final training was monitored during the multiple days of compu-tation. The monitoring included loss function values and values of Pearsons Correlation Coefficient. All three outputs were monitored. The scheduled learning rates were included in the same graphs in order to allow inspection of how learning rate affects performance changes.

The initial 10000 batches of training included explicit training of the motion stream. This can be seen inFigure 6.1, which shows the initial 16000 batches of training. Note that the loss function of the motion stream is much higher than that of the appearance stream. Metric val-ues of attention outputs can not be readily compared to full model output in these graphs due to the full model outputting batches with sequences of frames rather than batches with individual frames. An example of how the network diverged during final training is shown in Figure 6.2. Commonly, as is the case here, the loss in the appearance stream went up first and diverged, after which the full model soon diverged as well. A solution to this problem was explained insection 4.8.

Figure 6.3 shows an attempt to circumvent instability in appearance attention training by training only the full model. This didn’t result in any noticeable decrease in video validation loss, so the training was

(44)

36 CHAPTER 6. RESULTS

Figure 6.1: First 16000 batches of final training. Motion stream loss is much higher than all the other metrics. Note also that it is unstable, this was a recurring pattern during experimentation as well.

stopped and a checkpoint with low video validation loss used as the final model.

6.2 Examples of predicted saliency

Figure 6.4 shows the prediction on one frame of a video in the UCF-Sports dataset. Figure 6.5demonstrates a set of predictions from dif-ferent videos in the Hollywood-2 and UCF-Sports datasets.

(45)

CHAPTER 6. RESULTS 37

Figure 6.2: The appearance stream (and then the full model) diverge after around 137000 batches. Note that instability in the appearance stream surprisingly starts after over 80000 batches of stable training.

6.3 Saliency metrics benchmarks compared

to other models

(46)

(47)

(a) (b)

(c) (d)

(48)

(a) (b)

(c) (d)

(e) (f)

(g) (h)

(49)

Hollywood-2 [21] UCF-Sports [21]

AUC-J SIM CC NSS AUC-J SIM CC NSS

ACL [31] 0.913 0.542 0.623 3.086 0.897 0.406 0.510 2.567 Two-stream [2] 0.863 0.276 0.382 1.748 0.832 0.264 0.343 1.753 OM-CNN [18] 0.887 0.356 0.446 2.313 0.870 0.321 0.405 2.089 Fang et al. [11] 0.859 0.272 0.358 1.667 0.845 0.307 0.395 1.787 DVA [29] 0.886 0.372 0.482 2.459 0.872 0.339 0.439 2.311 Full model 0.912 0.473 0.578 2.819 0.901 0.425 0.509 2.432 App_attention 0.892 0.322 0.452 1.925 0.865 0.292 0.391 1.664 Mot_attention 0.873 0.212 0.406 1.687 0.866 0.203 0.405 1.716

(50)

Chapter 7 Conclusion

Our trained model performs best of all surveyed models on 2 of the 8 benchmarks. The other 6 top scores go to the ACL model by Wang et al. [31], where our model performs second best. Note that the dif-ference in performance between these two best performing models is in the third decimal for 3 of 8 benchmarks.

The ACL model also uses VGG-16 for feature extraction, also has an at-tention module and also models temporal aspects with a ConvLSTM. The largest architectural difference is the additional motion stream in our model. However there were also differences in training data, hy-perparameters, input image size and other implementation properties. This means that differences between the two models can not be com-pletely attributed to the motion stream.

To the extent that it can, however, it is relevant to point out that our model compares better to ACL on the UCF-Sports dataset than on the Hollywood-2 dataset. Since motion likely correlates higher with saliency on the domain of sports than the domain of Hollywood movies, this could be indicative of our model putting motion info to good use. This is also supported by the observation that the motion stream compares much more favorably to the appearance stream on the UCF-Sports dataset than on the Hollywood-2 dataset.

Another angle of comparison is towards previous twostream models (Two-stream and Fang et al.), where our model performs strictly better. This is also true for OM-CNN which does model motion but does not

(51)

CHAPTER 7. CONCLUSION 43

use typical optical flow as input.

During the final training, the validation loss for the full model did not keep decreasing throughout the training. Even when the appearance attention saw significant improvements, direct effects on the video val-idation could not be observed. This phenomenon justifies the question of whether the final parts of the model architecture, i.e. the feature merging and ConvLSTM which affect only the full model training, could be improved.

7.1 Future work

The use of two streams in deep learning models for saliency detec-tion can not be considered a solved problem. Interesting future work would be to individualize and specialize the two streams according to the data they process. In particular investigate whether the opti-cal flow data can be used more efficiently using something other than feature extraction trained on regular images (subsection 4.2.2).

7.2 Societal and ethical considerations

One potential domain of societal impact by higher quality video saliency detection is through the video compression application. Viewers who stream video on limited broadband connections could experience a higher subjective video quality. More efficient compres-sion, in terms of higher subjective video quality per bitrate, generally leads to a reduction in resources required for video storage and transmission.

(52)

Bibliography

[1] Samuel Albanie. convnet-burden.

https://github.com/albanie/convnet-burden. 2018. URL:

https://github.com/albanie/convnet-burden_{(cit. on} pp.13,18).

[2] Cagdas Bak et al. “Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction”. In: IEEE Transactions on Multimedia 20.7 (2018), pp. 1688–1698. DOI: 10 . 1109 / TMM . 2017 . 2777665. URL:

https : / / ieeexplore . ieee . org / document / 8119879 (cit. on pp.5,10,18,19,41).

[3] Ali Borji and Laurent Itti. “CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research”. In: CVPR 2015 workshop on "Future of Datasets" abs/1505.03581 (May 2015). URL:https://arxiv.org/abs/1505.03581(cit. on p.23). [4] Gary Bradski. Open Source Computer Vision Library.

https://github.com/opencv/opencv. 2018. URL:

https://github.com/opencv/opencv_{(cit. on p.}₂₅_).

[5] Zoya Bylinskii et al. “Intrinsic and extrinsic effects on image memorability”. In: Vision Research (2015). ISSN: 18785646. DOI: 10.1016/j.visres.2015.03.005(cit. on p.23).

[6] Zoya Bylinskii et al. “What do different evaluation metrics tell us about saliency models?” In: IEEE transactions on pattern analysis and machine intelligence (Apr. 2018).URL:http://arxiv.org/ abs/1604.03605(cit. on pp.5,20).

[7] François Chollet et al. Keras. https://keras.io. 2015.URL:https: //keras.io(cit. on p.15).

(53)

BIBLIOGRAPHY 45

[8] François Chollet. “Xception: Deep Learning with Depthwise Separable Convolutions”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 1800–1807. URL: https : / / ieeexplore . ieee . org / document / 8099678 (cit. on p.13).

[9] Guanqun Ding and Yuming Fang. “Video Saliency Detection by 3D Convolutional Neural Networks”. In: CoRR abs/1807.04514 (July 2018).URL:http://arxiv.org/abs/1807.04514(cit. on p.12).

[10] Shaojing Fan et al. “Emotional Attention: A Study of Image Sen-timent and Visual Attention”. In: 2017 IEEE International Confer-ence on Image Processing (ICIP) (2017). DOI: 10 . 1109 / ICIP . 2017.8296357(cit. on p.23).

[11] Yuming Fang et al. “Video saliency incorporating spatiotemporal cues and uncertainty weighting”. In: IEEE Transactions on Image Processing (2014). ISSN: 10577149. DOI: 10.1109/TIP.2014.2336549_{(cit. on p.}₄₁_).

[12] Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. “Spa-tiotemporal Multiplier Networks for Video Action Recognition”. In: 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) (2017), pp. 7445–7454.DOI:10.1109/CVPR.2017. 787_{(cit. on p.}₁₄_).

[13] Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. “Spa-tiotemporal Residual Networks for Video Action Recognition”. In: 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) (2017), pp. 7435–7444.DOI:10.1109/CVPR.2017. 786. URL: http : / / arxiv . org / abs / 1611 . 02155 (cit. on p.14).

[14] Joel Gibson and Oge Marques. “Optical flow and trajectory methods in context”. In: SpringerBriefs in Computer Science. 2016. ISBN: 978-0-85709-471-1; 978-0-85709-470-4. DOI: 10.1007/978-3-319-44941-8_2(cit. on p.8).

[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. URL:

(54)

46 BIBLIOGRAPHY

[16] Kaiming He et al. “Deep Residual Learning for Image Recog-nition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 770–778. DOI: 10.1109/CVPR. 2016.90_. URL:http://arxiv.org/abs/1512.03385 _(cit. on pp.13,14,17).

[17] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Mul-tilayer feedforward networks are universal approximators”. In: Neural Networks (1989). ISSN: 08936080. DOI: 10 . 1016/ 0893 -6080(89)90020-8(cit. on p.6).

[18] Lai Jiang, Mai Xu, and Zulin Wang. “Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM”. In: CoRR abs/1709.06316 (2017). URL: http://arxiv.org/abs/1709.06316(cit. on pp.10,12,20, 23,41).

[19] Ming Jiang et al. “SALICON: Saliency in Context”. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 1072–1080. DOI: 10 . 1109 / CVPR . 2015 . 7298710_. URL: https : / / ieeexplore . ieee . org / document / 7298710 (cit. on p.23).

[20] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015) (2015). URL: https://arxiv.org/abs/1412.6980_{(cit. on p.}₂₁_).

[21] Stefan Mathe and Cristian Sminchisescu. “Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition”. In: IEEE Transactions on Pattern Analysis and Ma-chine Intelligence (2015).ISSN: 01628828. DOI: 10.1109/TPAMI. 2014.2366154_{(cit. on pp.}₂₃_,₂₅_,₄₁_).

[22] Nicolas Riche et al. “Saliency and human fixations: State-of-the-art and study of comparison metrics”. In: Proceedings of the IEEE International Conference on Computer Vision. 2013. ISBN: 9781479928392. DOI:

(55)

BIBLIOGRAPHY 47

[23] Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge”. In: International Journal of Computer Vision (2015). ISSN: 15731405. DOI:

10.1007/s11263-015-0816-y_{(cit. on pp.}₁₃_,₁₈_).

[24] Xingjian Shi et al. “Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting”. In: NIPS (2015).URL:https://arxiv.org/abs/1506.04214(cit. on p.8).

[25] Karen Simonyan and Andrew Zisserman. “Very Deep Convolu-tional Networks for Large-Scale Image Recognition”. In: (Sept. 2014). URL: http : / / arxiv . org / abs / 1409 . 1556 (cit. on pp.13,18).

[26] Leslie N. Smith. “Cyclical Learning Rates for Training Neural Networks”. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (2017), pp. 464–472. URL: http://arxiv.org/abs/1506.01186_{(cit. on p.}₃₀_).

[27] Juddk Tilke et al. “Learning to predict where humans look”. In: Proceedings of the IEEE International Conference on Computer Vi-sion. 2009. ISBN: 9781424444205. DOI: 10 . 1109 / ICCV . 2009 . 5459462_{(cit. on p.}₂₃_).

[28] Fei Wang et al. “Residual attention network for image classifi-cation”. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 2017. ISBN: 9781538604571. DOI:10.1109/CVPR.2017.683_{(cit. on pp.}₁₂_,₁₉_).

[29] Wenguan Wang and Jianbing Shen. “Deep Visual Attention Pre-diction”. In: IEEE Transactions on Image Processing (2018). ISSN: 10577149.DOI:10.1109/TIP.2017.2787612(cit. on p.41). [30] Wenguan Wang, Jianbing Shen, and Ling Shao. “Video Salient

Object Detection via Fully Convolutional Networks”. In: IEEE Transactions on Image Processing 27.1 (2018), pp. 38–49. DOI: 10. 1109 / TIP . 2017 . 2754941. URL: https : / / ieeexplore . ieee.org/document/8047320_{(cit. on p.}₁₂_).

(56)

48 BIBLIOGRAPHY

[32] Zheng Wang et al. “A deep-learning based feature hybrid framework for spatiotemporal saliency detection inside videos”. In: Neurocomputing (2018). ISSN: 18728286. DOI:

10.1016/j.neucom.2018.01.076_{(cit. on p.}₁₂_).

[33] Gengshan Yang. cv-gpu-py.

https://github.com/gengshan-y/cv-gpu-py. 2017. URL: https :/ /github . com/ gengshan - y/ cv - gpu - py (cit. on p.25).

[34] C. Zach, T. Pock, and H. Bischof. “A Duality Based Approach for Realtime TV-L 1 Optical Flow”. In: Pattern Recognition. 2007. ISBN: 0302-9743; 978-3-540-74933-2. DOI: 10 . 1007 / 978 3 -540-74936-3_22_{(cit. on p.}₂₅_).

[35] Shiping Zhu and Ziyao Xu. “Spatiotemporal visual saliency guided perceptual high efficiency video coding with neural network”. In: Neurocomputing (2018). ISSN: 18728286. DOI:

(57)

Appendix A

Keras model summary

More details regarding the architecture of the full model can be seen in the following which is output from Keras command model.summary()_.

The two first dimensions in output shape correspond to batch size and sequence length respectively, which are dynamic in the sense that the model doesn’t require a particular value. Layer type "TD" is TimeDis-tributed.

________________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to

(58)

50 APPENDIX A. KERAS MODEL SUMMARY ________________________________________________________________________________________________________ AttMod_mot_8_Conv2D_3 (TD) (-, -, 6, 8, 64) 8256 AttMod_mot_7_MaxPooling2D_2[0][0] ________________________________________________________________________________________________________ AttMod_app_9_BN_3 (TD) (-, -, 6, 8, 64) 256 AttMod_app_8_Conv2D_3[0][0] ________________________________________________________________________________________________________ AttMod_mot_9_BN_3 (TD) (-, -, 6, 8, 64) 256 AttMod_mot_8_Conv2D_3[0][0] ________________________________________________________________________________________________________ AttMod_app_10_Conv2D_4 (TD) (-, -, 6, 8, 128) 73856 AttMod_app_9_BN_3[0][0] ________________________________________________________________________________________________________ AttMod_mot_10_Conv2D_4 (TD) (-, -, 6, 8, 128) 73856 AttMod_mot_9_BN_3[0][0] ________________________________________________________________________________________________________ AttMod_app_11_BN_4 (TD) (-, -, 6, 8, 128) 512 AttMod_app_10_Conv2D_4[0][0] ________________________________________________________________________________________________________ AttMod_mot_11_BN_4 (TD) (-, -, 6, 8, 128) 512 AttMod_mot_10_Conv2D_4[0][0] ________________________________________________________________________________________________________ AttMod_app_12_Conv2D_5 (TD) (-, -, 6, 8, 1) 129 AttMod_app_11_BN_4[0][0] ________________________________________________________________________________________________________ AttMod_mot_12_Conv2D_5 (TD) (-, -, 6, 8, 1) 129 AttMod_mot_11_BN_4[0][0] ________________________________________________________________________________________________________ AttMod_app_13_BN_5 (TD) (-, -, 6, 8, 1) 4 AttMod_app_12_Conv2D_5[0][0] ________________________________________________________________________________________________________ AttMod_mot_13_BN_5 (TD) (-, -, 6, 8, 1) 4 AttMod_mot_12_Conv2D_5[0][0] ________________________________________________________________________________________________________ AttMod_app_14_UpSampling2D_1 (TD) (-, -, 24, 32, 1) 0 AttMod_app_13_BN_5[0][0] ________________________________________________________________________________________________________ AttMod_mot_14_UpSampling2D_1 (TD) (-, -, 24, 32, 1) 0 AttMod_mot_13_BN_5[0][0] ________________________________________________________________________________________________________ AttMod_app_15_Flatten (TD) (-, -, 768) 0 AttMod_app_14_UpSampling2D_1[0][0] ________________________________________________________________________________________________________ AttMod_mot_15_Flatten (TD) (-, -, 768) 0 AttMod_mot_14_UpSampling2D_1[0][0] ________________________________________________________________________________________________________ AttMod_app_16_RepeatVector (TD) (-, -, 512, 768) 0 AttMod_app_15_Flatten[0][0] ________________________________________________________________________________________________________ AttMod_mot_16_RepeatVector (TD) (-, -, 512, 768) 0 AttMod_mot_15_Flatten[0][0] ________________________________________________________________________________________________________ AttMod_app_17_Permute (TD) (-, -, 768, 512) 0 AttMod_app_16_RepeatVector[0][0] ________________________________________________________________________________________________________ AttMod_mot_17_Permute (TD) (-, -, 768, 512) 0 AttMod_mot_16_RepeatVector[0][0] ________________________________________________________________________________________________________ AttMod_app_18_Reshape (TD) (-, -, 24, 32, 512) 0 AttMod_app_17_Permute[0][0] ________________________________________________________________________________________________________ AttMod_mot_18_Reshape (TD) (-, -, 24, 32, 512) 0 AttMod_mot_17_Permute[0][0] ________________________________________________________________________________________________________ AttMod_app_19_Multiply (Multiply) (-, -, 24, 32, 512) 0 AttMod_app_1_TimeDistributed_vgg[0][0] AttMod_app_18_Reshape[0][0]

________________________________________________________________________________________________________ AttMod_mot_19_Multiply (Multiply) (-, -, 24, 32, 512) 0 AttMod_mot_1_TimeDistributed_vgg[0][0] AttMod_mot_18_Reshape[0][0]

________________________________________________________________________________________________________ AttMod_app_20_Add (Add) (-, -, 24, 32, 512) 0 AttMod_app_1_TimeDistributed_vgg[0][0] AttMod_app_19_Multiply[0][0]

________________________________________________________________________________________________________ AttMod_mot_20_Add (Add) (-, -, 24, 32, 512) 0 AttMod_mot_1_TimeDistributed_vgg[0][0] AttMod_mot_19_Multiply[0][0]

________________________________________________________________________________________________________ Merged_1_Merge (Merge) (-, -, 24, 32, 1024) 0 AttMod_app_20_Add[0][0]

AttMod_mot_20_Add[0][0] ________________________________________________________________________________________________________ Merged_2_Conv2D (TD) (-, -, 24, 32, 512) 524800 Merged_1_Merge[0][0] ________________________________________________________________________________________________________ Merged_3_BN (TD) (-, -, 24, 32, 512) 2048 Merged_2_Conv2D[0][0] ________________________________________________________________________________________________________ ConvLSTM_1_ConvLSTM2D (ConvLSTM2D) (-, -, 24, 32, 256) 7078912 Merged_3_BN[0][0]

(59)

TRITA -EECS-EX-2019:87

Video Saliency Detection Using Deep Learning

Video Saliency Detection

Using Deep Learning

JAKOB WIESINGER

Video Saliency Detection

using Deep Learning

JAKOB WIESINGER

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1

Video saliency detection

1.2

Scientific contribution

1.3

Report outline

Chapter 2

Background

2.1

Video saliency detection

2.1.1

Evaluating VSD accuracy

2.2

Deep neural networks

2.2.1

Optimization with neural networks

2.2.2

Convolutional Neural Networks (CNN)

2.2.3

LSTM and ConvLSTM

2.3

Optical flow

Chapter 3

Related Work

3.1

Previous VSD

Non-DL

models

Deep Learning models

D

y

n

a

m

ic

m

o

d

e

ls

S

ta

ti

c

m

o

d

e

ls

Attenti

on mo

dule

Two streams

ConvLSTM

Tra

in

als

o w

ith

st

ati

c

da

ta

Pre-tr

ained

featu

re

_t

_{le lev}

_{r training}