Five years after the Deep Learning revolution of computer vision : State of the art methods for online image and video analysis

(1)

Five years after the Deep Learning revolution of computer

vision: State of the art methods for online image and video

analysis

Michael Felsberg

Institutionen för systemteknik

Linköpings universitet

michael.felsberg@liu.se

December 14, 2017

1 Introduction

The purpose of this document is to reect on novel and upcoming methods for computer vision that might have relevance for application in robot vision and video analytics. The document covers many dierent sub-elds of computer vision, most of which have been addressed by our research activity at the computer vision laboratory. The report has been written based on a request of, and supported by, FOI.

2 Historical Development

The area of computer vision has recently seen a Deep Learning revolution, when performance in, e.g., object classication on ImageNet [74] has improved vastly from top-5 error of 26% in 2011 to 16% in 2012, see gure 1. The major paradigm shift has been to move from engineered image features ("pixel fucking" [47]) to learned Deep Features.

Many people name the increase of computational power and availability of huge amount of data as the enabler of this success, missing the fact that decades of intensive research have formed the basis to this breakthrough. Also, some fundamental methodological insights have triggered progress during the past ve years [29], and the present document is an attempt to reect on their relevance for future development.

The progress of Deep Learning is usually accredited to be built on work from the late 90ies [52], missing the fact that deep hierarchical structures have been suggested long before [56], also in combination with optimization or learning methods in nested frames or receptive elds [30], pp. 8-9.

3 Relevant Sub-elds of Computer Vision

The eld of computer vision can be sub-divided into many sub-areas, however, no consistent taxonomy exists. Some areas are named according to application areas, others according to generic problem formulation, and even others according to methodological approaches.

In the sections below, standard categories have been used and recent trends, mostly in con-junction with Deep Learning approaches, are reected upon.

(2)

Figure 1: Top-5 error of the winners of the ImageNet Large Scale Visual Recognition Challenge 20102015. Image source http://paddlepaddle.org/.

3.1 Surveillance

Surveillance is central to protect critical infrastructure and to detect inappropriate behavior. The typical setting is to cover all relevant areas or the relevant volume by the elds of view of a set of cameras. The classical approach is to have human operators who are checking a number of screens displaying respective camera streams and to record those streams.

Automating the surveillance process has a number of advantages, such as reduction of operator load (and costs), fewer missed events caused by drowsiness of operators, and fusion of information from an arbitrary number of streams. However, automated processes have to rely on methods in sections 3.2 and 3.53.10, all subject to potential failures.

Those failures can be errors of type-1 (false positives) or type-2 (false negatives). Depending on the particular case, errors of these types might have high or low impact. Optimizing one type of error comes at the cost of the other. However, using the idea of boosting, i.e., having a cascade of weak classiers [23], helps to reduce the type-2 error without increasing the type-1 error.

In current systems, typically a human operator is acting as nal stage in the cascade, i.e., the automated system reports with a very low false negatives rate and a reasonable false positive rate and the human operator veries all detections reported by the system. This approach reduces the operator load and the risk of missed events, but does not fully address the fusion of multiple streams.

As Deep Learning methods achieve super-human performance on narrow visual tasks [34], it is expected that most surveillance systems will be possible to fully automate in the near future. However, if the surveillance problem is very generic and requires semantic understanding of the situation, human operators will still be needed. In order to improve fusion capabilities in those cases, systems for situation awareness will help the human operator.

3.2 Situation Awareness

Similar to surveillance (section 3.1), situation awareness aims at using sensors, such as cameras, to acquire information about the environment for further decision making. Besides the fact that situation awareness might use additional sensors other than cameras, there is also a major focus on the fusion of acquired information. This section mostly reects on camera-based situation awareness.

(3)

In the simplest case, several camera views are simply stitched together to form a panoramic view [57] or dynamic views based on the operators viewpoint and rendered in a VR-headset [8]. Both display techniques can be combined with stereo vision, e.g. by using multiple cameras, or additional depth sensors, such as time-of-ight. The combination with stereo is more common for VR-headsets, but still suers from latency-generated eects of VR-sickness.

Also the problem of situation awareness has recently seen a signicant move towards automa-tion, mostly by deep learning methods [29]. If using learning-based methods, the generation of stitched panoramas or latency-free 3D-visualization are no longer relevant, as algorithms can di-rectly and more eciently work on the input data streams. However, problems regarding type-1 and type-2 errors as elaborated in section 3.1 will still limit the use of fully automated systems in the near future.

The most common use-cases for situation awareness will presumably be to support human operators by automated pre-processing and tailored display of information, e.g., by marking objects (with the help of methods in sections 3.6 and 3.8) or replacing parts of the environment (segmented with methods from sections 3.7 and 3.9) with alternative content. This approach to augmented reality (AR) will also be relevant for human-machine interaction in vision-based navigation and control, sections 3.3 and 3.4.

3.3 Vision-based Navigation

Vision-based navigation is highly relevant in two cases, if GPS (or other infrastructure-based nav-igation) fails and for understanding or predicting human navigation (which is mostly based on vi-sion). The eld of vision-based navigation is classically dominated by geometry-driven approaches, such as visual odometry [66], structure from motion [36], and visual SLAM [17]. Structure from motion is closely related to pose estimation, see section 3.11, with the relative pose between camera and background as state to be estimated.

The geometry-based approaches have achieved a high level of functionality and usability, but require still substantial computational resources, high-quality cameras that are accurately cali-brated, and a dominatingly static environment with texture-rich surfaces. The geometry inference from monocular cameras is ill-posed and if the inference fails, it commonly collapses completely.

The risk for failures in the geometry estimation is mitigated by the use of depth sensors, such as time-of-ight sensors, or stereo camera settings [66]. Segmenting out independent motion (section 3.9) helps to relax the requirement for a static environment. However, similar to the situation awareness (section 3.2), the generation of the 3D geometry is just an intermediate step that can be avoided if a direct mapping from sensor streams to positions on a map is learned.

In the DeepDriving project at Princeton [10] and ALVINN [68], the generation of 3D models is simply skipped. Instead, direct perception and behavior reexes are learned and applied to navigation at a local scale. The connection to navigation at global scale is however lost by these approaches and need to be re-established by a hierarchical map. The position in the respectively coarser level of the map can be achieved by various inference methods, e.g. particle lters [2].

3.4 Vision-based Control

Navigation at local scale, see section 3.3, is closely related to vision-based control. Examples for local navigation are path-following [22] and virtual leashing [38], but also visual servoing for grasping [51]. Thus, physical platforms include all types of vehicles (ground, airborne, underwater, surface), walking robots, and manipulators.

The control of most of these systems is nowadays, at least if unmanned, by means of tele-operation. This applies, e.g., to drones, such as the predator as known from the news, marine vehicles, such as the piranha from SAAB, and walking robots, such as the CENTAURO sys-tem https://www.centauro-project.eu/. Future initiatives will move from tele-operated to autonomous systems. Since vehicles and robots will share their workspace with humans in many cases and since humans perceive dominantly by vision, vision-based control will be essential to those systems.

(4)

Largest progress on vision-based control has presumably been made for ground vehicles, driven by major initiatives for autonomous driving. The omni-hyped Tesla cars make use of Mobile-Eye's system to support their autopilot, as known since the fatal Tesla accident and the sub-sequent dispute about responsibility (https://arstechnica.com/?post_type?post&p=929773). The high-level models from Daimler make use of a stereo vision system nominated for a ma-jor innovation award (http://www.deutscher-zukunftspreis.de/de/node/251), and Autoliv's NightVision system aids many drivers with IR-based detection of road users and animals during night, e.g. by automatic braking.

With regard to civil drones development, Amazon has made major progress regarding the use of drones for delivery of parcels and their landing is guided. Slightly dierently from vision-guided cars, where the detection of obstacles initiates a single breaking action, i.e., open-loop control, landing the Amazon drone is performed in a closed-loop approach, with continuous visual feedback. Also for grasping objects with a robotic hand, either of these two approaches can be applied, and the research community still debates about the golden solution.

Irrespective whether closed-loop or open-loop control is applied, object detection (section 3.6) and object pose estimation (section 3.11) are central vision capabilities that are required. In most failure cases of vision-based control, object detection failed or the estimated pose is too inaccurate, i.e., both robustness and accuracy are still major challenges. If closed-loop control is applied, object tracking (section 3.8) is a further required functionality and it is believed that the superior performance of fusing detection results and tracking results [38] as well as ltering the pose-state will make closed-loop control being favorable in the next decade.

3.5 Object Classication

The section title has been chosen to avoid confusion regarding terminology. Often, the term object recognition is used, but since this is often confused with object detection (section 3.6), object aordances [19], or generation of generic descriptions of image contents, we suggest to avoid it. Another, often used, term is image classication, but also this is slightly misleading as typical benchmarks require classication of images according to a xed number of object classes.

Object classication is the task to assign one or several object labels out of a known set to an image and it is considered successful if the object of the reported label is visible in the image. Prior to 2012, the most successful classication methods on the ImageNet Large Scale Visual Recognition Challenge [74] followed the typical three step pipeline a) feature extraction (e.g. SIFT [55], HOG [13], LBP [67]); b) feature representation (e.g. vector quantization [78], sparse coding [62], Fisher vectors [65]); and c) classication (e.g. SVMs [75], Random Forests [5]). Since 2012, Deep Learning techniques applying Convolutional Neural Networks (CNNs [52]) are dominating. As illustrated in gure 1, the rst successful deep network was AlexNet [49]. Later approaches such as VGG [9] increased requirements in memory and computational power signicantly (see gure 2). In the further development, some paradigm shift applied and more powerful networks such as GoogleNet [80] apply structural or topological decompositions (e.g. Network in Network, NiN [53]) to limit memory requirements. In 2015, ResNet [35] achieved super-human performance, as assumed to be 5% top-5 error.

Major progress has been made due to introduction of Batch Normalization [39], replacing fully-connected (FC) layers with convolutional layers and applying average pooling in the same number of channels as categories [53], and inception blocks that stack features from dierent convolutional layers in combination with dimensionality reduction [80]. GoogleNet is also using auxiliary classiers in a compound loss [80], ResNet makes additionally use of short-cuts [53], and XNOR-net makes use of reduced bit-depths [69].

The common denominator of these approaches is to reduce the number of parameters and computational load, while maintaining a highly non-linear structure. This trend will presumably continue and in particular inference on very powerful networks will work in real-time on embedded platforms with low power-consumption. In the area of autonomous vehicles, Nvidia's PX-series illustrates the development with currently up to one Deep Learning Tera-OPS per watt (Xavier and Pegasus, see https://www.nvidia.com/en-us/self-driving-cars/drive-px/).

(5)

Figure 2: ImageNet classication error as a function of memory consumption and computational load. Figure from [7].

3.6 Object Detection and Localization

The ILSVRC contains, besides classication, two other challenges: detection and localization [73]. The terminology easily leads to confusion here, but following the assessment in ILSVRC, detection requires to report bounding boxes and condences (possibly zero) for a xed set of object classes, whereas localization focuses on the bounding box of one single object of unknown class. Similar to the classication challenge, ve hypotheses are considered in ILSVRC-localization, and thus object detection is the more relevant problem in the context of this report.

In contrast to classication, detection (and localization) is a regression problem as parame-ters of bounding boxes and condences need to be estimated. In the detection task, the number of bounding boxes to be regressed is unknown and therefore the methodology of region propos-als [27] is applied to CNNs. The resulting RCNNs perform for each region proposal bounding-box regression and classication by means of an SVM [27].

The two successors to RCNNs, Fast RCNN [26] and Faster RCNN [71] speed up the detection by regressing and classifying based on the same feature vector and FC layers (Fast RCNN) and further by also determining the region proposals from a convolutional network (Faster RCNN). The common limitation of all these approaches is the staged processing and the sequential classication and regression of all proposals.

This is dierent for single shot detectors, such as Yolo [70] and SSD [54]. The Yolo (You only look once) detector applies a single, customized CNN model to the resized image and its output is ltered with non-max suppression. The SSD (Multibox Single Short Detector) extracts feature maps with a standardized network, such as VGG, and then applies multiple layers with dierent scales. The output of these layers is stacked and fed through non-max suppression.

Despite the fact that methods get faster, more robust, and more accurate, object detection is still a challenging problem and, in contrast to classication, has not surpassed human-level performance yet. However, this is expected to happen within the next ve years, although only if visual input is of sucient quality [25]. Current trends point towards object detection by instance segmentation though, i.e., labeling pixels instead of regressing bounding boxes, see section 3.7.

(6)

3.7 Object and Instance Segmentation

Object or instance segmentation is a major development that has partly its roots in semantic segmentation, where the task is to segment the whole image according to the image content, and partly in object detection, but replacing bounding box output with pixel-wise labeling. Recently, the trend is to move from bounding boxes to pixel-wise labeling, because the assessment of accuracy using the Jaccard index (intersection over union) is strongly inuenced by background pixels inside the bounding box and object pixels outside the bounding box. A similar trend starts to develop in object tracking, see section 3.8.

Similar to object detection, the focus here will be on machine learning approaches and pre-vious techniques, such as conditional random elds [48], will be considered in context of their integration into deep networks. In contrast to previously discussed deep networks, segmentation requires output in the same resolution as the input. For this purpose, nal FC layers are replaced with convolutional layers with learnable upsampling, forming fully convolutional neural networks (FCNs) [76].

Due to the intermediately reduced resolution, the output of the single-stream FCN lacks ne details and accurate contours. This problem is addressed by introducing additional shortcuts that connect previous, higher-resolution layers with the upsampling layers close the network output. This shortcuts are similar to inception modules or ResNet (compare section 3.5), but instead of concatenation, fusion is applied.

Besides using shortcuts, also deconvolution, or transposed convolution, can be applied [60]. To train the deconvolutional network is computationally very demanding and the network re-quires a special unpooling layer that records the maximum positions. A further, very recent approach that achieves top results on e.g. the COCO benchmark (http://cocodataset.org/ #detections-challenge2017), is Mask R-CNN [33]. Mask R-CNN extends Faster RCNN (see section 3.6) with masks for pixel-wise segmentation.

Despite signicant progress on instance segmentation, also on this problem machine learning approaches have not surpassed human-level performance. For the next few years, further progress will presumably be made by combining approaches and concepts from object detection and seg-mentation, e.g. single short detectors and masks or incorporating CRFs. Instance segmentation will eventually be dominating compared to detection with bounding boxes. This transition is also discussed in context of object tracking, see section 3.8.

3.8 Object Tracking

Object tracking is similar to object detection, repeated on each frame of a sequence, and reporting a bounding box for the object to be tracked. The main dierence is that object categories are usually not known beforehand and the appearance of the object has to be inferred from one single sample (bounding box) in the beginning of the sequence.

Despite much eort, deep learning has seen only limited success in object tracking, when compared to image classication, section 3.5, and object detection, section 3.6. This is attributed to three main causes: (1) limited online training data is challenging for the, in general, data hungry deep learning methods; (2) absence of prior knowledge about the target implies that the features must generalize to unseen object classes; (3) the problem of leveraging both accuracy and robustness by exploiting the complementary properties of shallow and deep features.

Still, most of the more current tracking methods exploit deep features by: (a) employing a classication setting [87, 58, 82] where a pre-trained CNN model is updated online to con-sider target-specic appearance changes during tracking; (b) learning a similarity metric in a Siamese architecture [32, 4, 81]; (c) employing a discriminative correlation lter (DCF) based framework [16, 84, 14]. These DCF trackers learn a correlation lter in an online fashion from example image patches to predict the target location. By exploiting the FFT algorithm, the least-squares regression formulation exhibits a particularly sparse structure, enabling tractable online implementations. Previously, conventional DCF methods [15, 37] applied hand-crafted features, such as HOG [13] and Color Names [85].

(7)

Recently however, DCF based trackers have shown state-of-the-art performance on standard tracking benchmarks, such as the VOT challenge, by integrating deep convolutional features [16, 14]. The VOT (Visual Object Tracking) challenge (http://www.votchallenge.net/) attempts to become a driving force for tracking, similar to ILSVRC for image classication and localiza-tion/detection and COCO for object detection and segmentation.

As indicated in the previous section 3.7, also object tracking tends to move away from the classical concept of bounding boxes and towards pixel-wise labeling. It is hoped that tracking accuracy can be assessed more correctly by computing the Jaccard index of segments instead of bounding boxes, where current state of the art suggest about 1/3 intersection over union. This is expected to further drive the development of object tracking methods and towards motion segmentation, see section 3.9.

3.9 Motion Segmentation

Motion segmentation is the task to segment instance or objects in image sequences based on their relative motion. The variety of sub-problems is large due to dierent modalities such as video or scene ow and dense or sparse data, e.g. [89]. Methods can base on geometric approaches, optical ow, or machine learning. Classical approaches to the motion segmentation typically fall behind human-level performance [88] and therefore, focus lies here on machine learning approaches, too. Techniques can be based on unsupervised or semi-supervised learning, where the former is the practically more relevant case and will be considered here.

The DAVIS (Densely Annotated VIdeo Segmentation) challenge http://davischallenge. org/ has triggered signicant amount of research and the leaderboard is dominated by Deep Net-works, except for a few exceptions, e.g. [63, 20]. Among the Deep Learning based approaches, [83] and [41] are two of the most successful ones. The main dierence is the way they are trained; whereas the former one uses synthetic sequences with ground truth segmentation, the latter uses a heuristic method to generate annotations to real sequences. The manual annotation of motion segmentation is of prohibitively large eort and thus a fundamental problem of learning motion segmentation.

The method by Tokmakov et al. [83] integrates a two-stream process and visual memory. The streams consist of deep networks addressing appearance and motion features, respectively. The memory module updates the streams in a temporally consistent way. The work by Jain et al. [41] suggests to combine a segmentation network with an appearance stream and trains a layer to combine the predictions of the two, however without temporal processing to enforce consistency. Other recent approaches such as [6, 64] are addressing the semisupervised motion segmentation problem and are practically less relevant.

Yet another category of machine learning approaches is based on recurrent neural networks or improved variants such as GRU [11] and ConvLSTM [21]. Such methods can be used for frame prediction [79] or for video representation [3]. The latter method, ConvGRU, has been applied as visual memory component in [83]. Video representation is also closely connected to action recognition [59], see also section 3.10.

Motion segmentation is currently one of the most rapidly emerging sub-elds of Deep Learn-ing for computer vision and presumably the future development will be very dynamic. Besides integration of spatio-temporal aspects of the input data, other, more implicit information such as dynamical models of the segmented objects will be subject of research. A prediction of the state of the art even for the near future is rather impossible, but denitely important to follow-up on.

3.10 Action Recognition

As pointed out in section 3.9, motion segmentation methods are closely related to action recog-nition [59]. In general, the eld of action recogrecog-nition can be split, similar to object recogrecog-nition, into action classication and action detection. However, because these subelds are much smaller than the corresponding object-related counterpart, both areas are summarized below. Also, action

(8)

recognition can be performed on image sequences and still images [46], but the focus is here on action recognition in image sequences, as this is more relevant to the application area.

CNNs have also shown signicant improvement in state of the art in action recognition [77]. In this work, the authors propose a two-stream architecture where separate networks are trained to process spatial (RGB) and temporal (optical ow) information. Other approaches following the two-stream idea make use of pose information [12] or region proposals [28]. An essential aspect in multi-stream processing is the choice of fusion method: early, late, and attention-based fusion [45]. The latter method has been applied to two-stream based action recognition [1].

As pointed out in the beginning, action recognition is tightly connected to motion segmentation approaches and the presumed development on the latter, see section 3.9, will strongly inuence the development of action recognition methods. The methods from this area are however still signicantly below human-level performance and it will take presumably more than ve years to reach a level of maturity. Among other aspects, also ecient methods for pose estimation (see section 3.11) will need to be integrated.

3.11 Pose Estimation

The area of pose estimation covers classical topics of computer vision, such as human pose esti-mation and 6D pose estiesti-mation of rigid objects. For the former task, a new challenge has recently emerged, the posetrack challenge (https://posetrack.net/workshops/iccv2017/index.html). It covers three categories, depending on the number of people and views. As pointed out in sec-tion 3.10, results from human pose estimasec-tion are useful for acsec-tion recognisec-tion [12]. However, for the application in mind here, human pose estimation for action recognition might be interesting, but the pose estimation itself is of limited relevance.

Thus, this section focuses on pose estimation for rigid objects, relevant to all types of control tasks, see section 3.4. Even for this particular problem the literature gives a vast amount of methods [50]. However, when focus lies on robust methods, machine learning approaches compare often favorable [42, 61]. With the advent of Deep Learning, also pose estimation has been addressed with CNNs, for instance with the PoseNet approach [44] to estimate the 6D camera pose.

It turned out that regression problems with geometric output require special loss functions [43] in order to achieve similar performance as geometry-based approaches that exploit available 3D models, e.g. terrain models [31]. Notably, although camera pose estimation and object pose estimation are related, i.e., if instance segmentation is performed rst, object pose estimation is simply camera pose estimation on the image segment, rigid object pose estimation is rarely addressed by Deep Learning methods. More concretely, Garon and Lalonde claim that their very recent work is the rst end-to-end deep learning method [24].

Similar to action recognition, see section 3.10, pose estimation has only been addressed to a limited amount with Deep Learning methods and a highly dynamic development is expected for the near future. To combine machine learning and geometric accuracy is a fundamental challenge and novel loss functions have to be developed. For applications in the near future, classical approaches remain presumably the rst choice.

4 A Remark on Synthetic Data

Generally deep networks require large amount of labeled data for training. Recent works have investigated the use of data synthesis as a solution to train CNNs when only limited data is available. The work of Dosovitskiy et al. [18] generated a synthetic dataset to learn optical ow with CNNs. Their work showed that deep networks trained on unrealistic data is able to generalize well to existing real datasets. In case of scene text recognition, the work of Jaderberg et al. [40] proposed an approach to train deep networks solely on data produced by a synthetic text generation engine. In the context of human pose estimation, recent works [72, 86] have investigated training CNNs on synthetic human data capturing rich variation in poses, clothing, hair styles, body shapes, occlusions, viewpoints, motion blur and other factors.

(9)

5 Conclusions

This overview report summarizes the development in computer vision research of the past ve years, which have been dominated by the eects of the Deep Learning revolution. Many application areas of computer vision, such as surveillance, situation awareness, vision-based navigation, and vision based control have been inuenced by major progress in classical areas such as object classication, object detection and localization, object and instance segmentation, object tracking, motion segmentation, action recognition, and pose estimation. However, several of the latter problems still have large potential for future improvements.

References

[1] R. M. Anwer, F. S. Khan, J. van de Weijer, and J. Laaksonen. Top-Down Deep Appearance Attention for Action Recognition, pages 297309. Springer International Publishing, Cham, 2017.

[2] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle lters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174188, 2002.

[3] N. Ballas, L. Yao, C. Pal, and A. C. Courville. Delving deeper into convolutional networks for learning video representations. CoRR, abs/1511.06432, 2015.

[4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese networks for object tracking. In ECCV workshop, 2016.

[5] L. Breiman. Random forests. Machine Learning, 45(1):532, 2001.

[6] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In Computer Vision and Pattern Recognition (CVPR), 2017.

[7] A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network models for practical applications. CoRR, abs/1605.07678, 2016.

[8] J. Chandaria, G. Thomas, B. Bartczak, K. Koeser, R. Koch, M. Becker, G. Bleser, D. Stricker, C. Wohlleber, M. Felsberg, F. Gustafsson, J. Hol, T. B. Schön, J. Skoglund, P. J. Slycke, and S. Smeitz. Realtime camera tracking in the MATRIS project. SMPTE Motion Imaging Journal, 116:266271, 2007.

[9] K. Chateld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014.

[10] C. Chen, A. Se, A. Kornhauser, and J. Xiao. DeepDriving: Learning aordance for direct perception in autonomous driving. In Proceedings of the 15th International Conference on Computer Vision, 2015.

[11] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 17241734, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.

[12] G. Chéron, I. Laptev, and C. Schmid. P-cnn: Pose-based cnn features for action recognition. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 32183226, Dec 2015.

[13] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886893 vol. 1, June 2005.

[14] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. ECO: ecient convolution operators for tracking. In CVPR, volume abs/1611.09224, 2017.

[15] M. Danelljan, F. S. Khan, M. Felsberg, and J. Van de Weijer. Adaptive color attributes for real-time tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014 :, 2014.

[16] M. Danelljan, A. Robinson, F. Khan, and M. Felsberg. Beyond Correlation Filters: Learning Contin-uous Convolution Operators for Visual Tracking. In Computer Vision - ECCV 2016, Pt V, Lecture Notes in Computer Science, pages 472488. SPRINGER INT PUBLISHING AG, 2016.

(10)

[17] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. MonoSLAM: Real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):116, 2007.

[18] A. Dosovitskiy, P. Fischery, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical ow with convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 27582766, Dec 2015.

[19] L. Ellis, M. Felsberg, and R. Bowden. Aordance mining: Forming perception through action. In R. Kimmel, R. Klette, and A. Sugimoto, editors, Computer Vision ACCV 2010, volume 6495 of Lecture Notes in Computer Science, pages 525538. Springer Berlin / Heidelberg, 2011.

[20] A. Faktor and M. Irani. Video segmentation by non-local consensus voting. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014, 2014.

[21] C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 6472, 2016.

[22] K. Öfjäll, M. Felsberg, and A. Robinson. Visual Autonomous Road Following by Symbiotic Online Learning. In Intelligent Vehicles Symposium (IV), 2016 IEEE, pages 136143, 2016.

[23] Y. Freund and R. E. Schapire. A short introduction to boosting. Journal of Japanese Society for Articial Intelligence, 14(5):771780, 1999.

[24] M. Garon and J. F. Lalonde. Deep 6-dof tracking. IEEE Transactions on Visualization and Computer Graphics, 23(11):24102418, Nov 2017.

[25] R. Geirhos, D. H. J. Janssen, H. H. Schütt, J. Rauber, M. Bethge, and F. A. Wichmann. Compar-ing deep neural networks against humans: object recognition when the signal gets weaker. CoRR, abs/1706.06969, 2017.

[26] R. Girshick. Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 14401448, Dec 2015.

[27] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):142158, Jan 2016.

[28] G. Gkioxari and J. Malik. Finding action tubes. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 759768, 2015. [29] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.

deeplearningbook.org.

[30] G. H. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer Academic Pub-lishers, Dordrecht, 1995.

[31] B. Grelsson, M. Felsberg, and F. Isaksson. Highly accurate attitude estimation via horizon detection. Journal of Field Robotics, accepted, 2016.

[32] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang. Learning dynamic siamese network for visual object tracking. In ICCV, 2017.

[33] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.

[34] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiers: Surpassing human-level performance on imagenet classication. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV '15, pages 10261034, Washington, DC, USA, 2015. IEEE Computer Society. [35] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. [36] J. Hedborg, P.-E. Forssén, M. Felsberg, and E. Ringaby. Rolling shutter bundle adjustment. In IEEE

Conference on Computer Vision and Pattern Recognition, 2012.

[37] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correla-tion lters. TPAMI, 37(3):583596, 2015.

[38] G. Häger, G. Bhat, M. Danelljan, F. S. Khan, M. Felsberg, P. Rudl, and P. Doherty. Combining visual tracking and person detection for long term tracking on a uav. In ISVC, 2016.

(11)

[39] S. Ioe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML'15, pages 448456. JMLR.org, 2015.

[40] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolu-tional neural networks. Int. J. Comput. Vision, 116(1):120, Jan. 2016.

[41] S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21172126, July 2017.

[42] E. Jonsson and M. Felsberg. Accurate interpolation in appearance-based pose estimation. In Proc. 15th Scandinavian Conference on Image Analysis, volume 4522 of LNCS, pages 110, 2007.

[43] A. Kendall and R. Cipolla. Geometric loss functions for camera pose regression with deep learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

[44] A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the International Conference on Computer Vision (ICCV), 2015. [45] F. Khan, R. Muhammad Anwer, J. Weijer, A. Bagdanov, A. Lopez, and M. Felsberg. Coloring action

recognition in still images. International Journal of Computer Vision, pages 117, 2013.

[46] F. S. Khan, R. M. Anwer, J. van de Weijer, M. Felsberg, and J. Laaksonen. Deep semantic pyramids for human attributes and action recognition. In SCIA, 2015.

[47] J. J. Koenderink and A. J. van Doorn. Image processing done right. In Proceedings European Conference on Computer Vision, pages 158172, 2002.

[48] P. Krähenbühl and V. Koltun. Ecient inference in fully connected crfs with gaussian edge potentials. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 109117. Curran Associates, Inc., 2011. [49] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classication with deep convolutional neural

networks. In Advances in neural information processing systems, pages 10971105, 2012.

[50] G. l. Shan, B. Ji, and Y. f. Zhou. A review of 3d pose estimation from a monocular image sequence. In 2009 2nd International Congress on Image and Signal Processing, pages 15, Oct 2009.

[51] F. Larsson, E. Jonsson, and M. Felsberg. Simultaneously learning to recognize and control a low-cost robotic arm. Image and Vision Computing, 27(11):17291739, 2009.

[52] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.

[53] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.

[54] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single Shot MultiBox Detector, pages 2137. Springer International Publishing, Cham, 2016.

[55] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91110, 2004.

[56] D. Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Henry Holt and Co., Inc., New York, NY, USA, 1982.

[57] G. Meneghetti, M. Danelljan, M. Felsberg, and K. Nordberg. Image alignment for panorama stitching in sparsely structured environments. In SCIA, 2015.

[58] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 2016.

[59] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classication. In Computer Vision and Pattern Recognition, 2015.

[60] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV '15, pages 15201528, Washington, DC, USA, 2015. IEEE Computer Society.

[61] K. Öfjäll and M. Felsberg. Approximative coding methods for channel representations. Journal of Mathematical Imaging and Vision, 2017.

(12)

[62] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23):3311 3325, 1997.

[63] A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV '13, pages 17771784, Washington, DC, USA, 2013. IEEE Computer Society.

[64] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A.Sorkine-Hornung. Learning video object segmentation from static images. In Computer Vision and Pattern Recognition, 2017.

[65] F. Perronnin, J. Sánchez, and T. Mensink. Improving the Fisher Kernel for Large-Scale Image Classication, pages 143156. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.

[66] M. Persson, T. Piccini, M. Felsberg, and R. Mester. Robust stereo visual odometry from monocular techniques. In IEEE IV, 2015.

[67] M. Pietikäinen and G. Zhao. Two decades of local binary patterns: A survey. CoRR, abs/1612.06795, 2016.

[68] D. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Proc. of NIPS, 1989. [69] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classication using

binary convolutional neural networks. In European Conference on Computer Vision, pages 525542. Springer, 2016.

[70] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unied, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779788, June 2016.

[71] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 9199. Curran Associates, Inc., 2015. [72] G. Rogez and C. Schmid. Mocap-guided data augmentation for 3d pose estimation in the wild. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 31083116, 2016.

[73] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211252, 2015.

[74] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575, 2014.

[75] B. Schoelkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, P. Poggio, and V. Vapnik. Comparing support vector machines with gausian kernels to radial basis function classiers. IEEE Transaction on Signal Processing, 45(11):27582765, 1997.

[76] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640651, Apr. 2017.

[77] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.

[78] J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In Proceedings Ninth IEEE International Conference on Computer Vision, pages 14701477 vol.2, Oct 2003.

[79] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. ICML, 2015.

[80] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. Conf. Computer Vision and Pattern Recognition, pages 19, 2015.

[81] R. Tao, E. Gavves, and A. W. M. Smeulders. Siamese instance search for tracking. In CVPR, 2016. [82] Z. Teng, J. Xing, Q. Wang, C. Lang, S. Feng, and Y. Jin. Robust object tracking based on temporal

and spatial deep networks. In ICCV, 2017.

[83] P. Tokmakov, K. Alahari, and C. Schmid. Learning Video Object Segmentation with Visual Memory. In ICCV - IEEE International Conference on Computer Vision, Venice, Italy, Oct. 2017.

(13)

[84] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. End-to-end representation learning for correlation lter based tracking. In CVPR, 2017.

[85] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus. Learning color names for real-world applications. Image Processing, IEEE Transactions on, 18(7):15121523, 2009.

[86] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning from Synthetic Humans. In CVPR, 2017.

[87] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks. In ICCV, 2015.

[88] L. Zappella, X. Lladó, and J. Salvi. Motion segmentation: A review. In Proceedings of the 2008 Conference on Articial Intelligence Research and Development: Proceedings of the 11th International Conference of the Catalan Association for Articial Intelligence, pages 398407, Amsterdam, The Netherlands, The Netherlands, 2008. IOS Press.

[89] V. Zografos, R. Lenz, E. Ringaby, M. Felsberg, and K. Nordberg. Fast segmentation of sparse 3d point trajectories using group theoretical invariants. In Proc Asian Conference Computer Vision, LNCS, 2014.

Five years after the Deep Learning revolution of computer vision : State of the art methods for online image and video analysis