Detection of Humans in Video Streams Using Convolutional Neural Networks

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

Detection of Humans in Video

Streams Using Convolutional

Neural Networks

HUIJIE WANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Detection of Humans in Video Streams

Using Convolutional Neural Networks

HUIJIE WANG

Master in Machine Learning Date: July 31, 2017

Supervisor: Mårten Björkman, André Lovtjärn Examiner: Danica Kragic

Swedish title: Detektion av människor i videoströmmar med hjälp av convolutional neural networks

(3)

i

Abstract

This thesis is focused on human detection in video streams using Convolutional Neural Networks (CNNs).

In recent years, CNNs have become common methods in various computer vision problems, and image detection is one popular application. The performance of CNNs on the detection problem has undergone a rapid increase in both accuracy and speed. In this thesis, we focus on a specific sub-domain of detection: human detection. Furthermore, it makes the problem more challenging as the data are extracted from video streams captured by a head-mounted camera, and therefore include difficult view points and strong motion blur.

Considering both accuracy and speed, we choose two models with typical structures–You Only Look Once (YOLO) and Single Shot MultiBox Detector (SSD)–to experiment how robust the models perform on human domain with motion blur, and how the differences between the structures may influence the results.

Several experiments are carried out in this thesis. With a better design of structure, SSD outperforms YOLO in various aspects. It is further proved as we fine-tuned YOLO and SSD300 on human data in Pascal VOC 2012 trainval dataset, showing the efficiency of SSD with fewer classes trained. As for motion blur problem, it is shown in the experiments that SSD300 has good ability to learn blurred patterns.

(4)

ii

Sammanfattning

Detta examensarbete undersöker problemet att detektera människor i videströmmar med hjälp av convolutional neural networks (CNNs).

Under de senaste åren har CNNs ökat i användning, vilket medfört stora förbättringar i noggrannhet och beräkningshastighet. CNN är nu en populär metod i olika datorseende- och bildigenkänningsproblem. I detta projekt fokuserar vi på en specifik subdomän: detektion av människor. Problemet försvåras ytterligare av att vår videodata är inspelad från en huvud-monterad kamera. Detta medför att vårt system behöver hantera ovanliga betraktningsvinklar och rörelseoskärpa.

Efter att ha tagit hänsyn till beräkningshastighet och detektionskvalitet har vi valt att undersöka två olika CNN-modeller: You Only Look Once (YOLO) och Single Shot MultiBox Detector (SSD). Experimenten har designats för att visa hur robusta metoderna är på att detektera människor i bilder med rörelseoskärpa. Vi har också undersökt hur modifikationer på nätverksstrukturer kan påverka slutresultaten.

Flera experiment har gjorts i detta projekt. Vi visar att SSD ger bättre resultat än YOLO i många avseenden, vilket beror på att SSD har en bättre designad nätverksstruktur. Genom att utföra fin-anpassning av YOLO och SSD på bildkollektionen i Pascal VOC 2012 kan vi visa att SSD fungerar bra även när vi tränar på färre objektklasser. SSD300 har också god förmåga att lära mönster som påverkats av oskärpa. Vi analyserar även hur valet av position och skalor av de predefinierade sökområdenen påverkar resultaten från SSD300.

(5)

Introduction

1.1 Background and motivation

The applications of convolutional neural networks (CNNs) have become fast-growing since [1] achieved significantly high accuracy in Large Scale Visual Recognition Challenge 2012 (ILSVRC2012)[2]. As deeper and larger neural networks are applied, the errors on ILSVRC datasets have undergone a rapid decrease, and much research has also endeavored to find out ways to transfer such models to various applications.

One of the major transferred tasks is object detection. The detection problem can be regarded as a task of labeling objects in an image with the correct class as well as predicting bounding boxes associated with real-valued confidence. Lots of researchers have proposed dif-ferent structures of deep neural networks on this problem, such as Overfeat[3], DeepMultibox [4], Region CNN[5], YOLO[6], SSD[7], etc. The detection problem is also shown up as a sub-task in two of the computer vision challenges, Pascal Visual Object Classes (VOC) Challenge [8] and Large Scale Visual Recognition Challenge (ILSVRC). These two challenges conve-niently provide researchers with valuable datasets, as well as elaborately annotated ground truth. Furthermore, detailed evaluation methods are also designed in the challenges, which make the methods comparable to each other.

Taking the advantage of rapidly developing methods, we here focus on a specific detection task: human detection in video streams. We are also interested in how the CNN models understand human patterns in natural viewing behavior. In human behavior analysis, one of the key elements that affect human attention is daily interactions with other people. Therefore, it is of high relevance to be able to identify and track the position of every person that is visible in a video stream recorded aligned with human vision. In our work, we take a pure detection method for human localization in video streams, so that this work can be a prior study for human tracking and human behavior analysis.

For research on human behavior, the naturality is one of the most important factor to data quality. Fortunately, we get access to one of the most advanced tools, Tobii Pro Glasses 2, which is a pair of 45-gram weighed glasses, designed to capture natural viewing behavior. The videos are recorded with a head-mounted camera assembled in the glasses. As the agile glasses are able to provide the most flexible movement of vision, the videos streams usually contain fast motion, unpredictable rotation, and strong motion blur, making the problem more challenging.

Given prior research, which will be discussed in Chapter 2, it can be seen that CNNs are superior to other models for computer vision problems due to their wide and deep structures.

(8)

2 CHAPTER 1. INTRODUCTION

Therefore, it is also believed to be possible to gain reasonable accuracy and speed via deep CNNs in human detection in challenging video streams.

1.2 Research question

This thesis adopts single-frame detection for humans in video streams from head-mounted cameras, i.e. the detection is only based on the content of video at each frame and no temporal information is included. In this way, it can provide a prior study for the human tracking problem. On the other hand, owing to the promising development in CNNs, we strive to research on how robust different structures of CNNs (You look only once (YOLO)[6] and Single Shot Multibox Detector (SSD)[7]) perform on challenging videos with fast motion and strong blur, and how the structures may influence the results.

The trend of solutions to detection problems with CNN technology can be summarized from Chapter 2: first of all, the accuracy of CNNs on detection problems is improving with carefully designed network structures. Most of the detection models are modified based on classification models. However, different from the classification problem, as more factors are supposed to be predicted, prior knowledge can be important when designing networks. For example, scale, region proposal and aspect ratio of objects. Secondly, rather than a domain specific model, a model with generalization ability is more popular in recent development. Redmon and Farhadi [9] adopts joint training strategy from two datasets so that it even extend the network with the ability to detect object with unseen classes. Thirdly, current technologies are already capable to detect objects at high speed as well as relatively high accuracy. This factor is of high importance in practical applications.

Two CNN structures are studied in this thesis. However, the research question in our work differs from common detection problems in CNN area in the following aspects:

1. Our detection target is only for a specific domain: human.

Lacking of annotation in video streams, data size is limited and annotation available is also limited to human. However, deep neural networks benefit from huge dataset. And in recent research, stand-alone human detection is not so popular using CNNs, but appears as an important class integrated in the models (e.g. [10], [6], [7], etc.). It is interesting to know how the reduction of class size would affect the performance of the models, which is not common in recent studies.

2. Our target dataset is different.

Commonly used datasets for detection, such as ImageNet[2], Pascal VOC[8] and Mi-crosoft COCO[11] are mainly based on static photographs. Some pedestrian detection benchmarks, such as Caltech [12] and ETH [13], are recorded from vehicle-mounted camera, which has a fixed point of view. Also, some object tracking benchmarks, such as MOT[14] and YouTube-BoundingBoxes Dataset[15], are not practicable for us due to different ways of annotation: the former annotates all kinds of objects, including non-human objects, while the latter is designed for single-object tracking, so that only one object is annotated for each video.

(9)

CHAPTER 1. INTRODUCTION 3

Therefore, in this master thesis, our research is carried out to study how the general models can handle the domain-specific detection problem, including ability of generalization on the blurred data in video streams, domain-transfer with fine-tuning on models, and influences of some factors in structure design. Experiments on quantitative evaluation of accuracy and computational cost will be taken. Considering both accuracy and speed, we choose YOLO and SSD as two benchmarks. The two models also have different structures of CNNs. YOLO mainly follows the classical classification model–GoogLeNet–with small modifications, while SSD adopts a fully-convolutional structure, with clear definitions of scales and aspect ratios for different feature maps. The two models both have the versions pre-trained on Pascal VOC dataset. Our methodology and experiments on these two models will be introduced in Chapter3and 4. Furthermore, this thesis will be a support to further tracking methods with regard to temporal information.

1.3 Thesis organization

This chapter provides a brief description of the thesis. Chapter 2 will give a detailed introduction to previous studies and related research. Thesis methodology will be introduced in Chapter3, including datasets related to our thesis, basic technologies in CNNs, and the two main networks–YOLO and SSD. We will also introduce the evaluation methods we choose.

Chapter 4will give detailed explanations and analyses on the experiments carried out, on both YOLO and SSD models. Our training strategy will be discussed in this chapter. The performance of both original models, and our fine-tuned models will be evaluated. Meanwhile, detailed tests will be carried out to examine some factors of the networks which may have significant influence on our research question. Furthermore, a brief test on inference speed of the models will be done to show the ability of real-time detection of the models.

(10)

Chapter 2

Literature Review

This chapter gives a detailed introduction to related works. Human detection is a classi-cal detection problem. Various methods have been introduced to improve the performance. Nowadays, deep learning methods show further possibilities of the solutions to this problem. As this thesis strives to research how CNNs perform on human detection problem, the dis-cussion will be focused on previous studies on human detection, as well as different structures of CNNs. Specially, as detection networks are to some extent a transfer learning (the defini-tion of transfer learning will be introduced in3.4) from CNNs for classification, many of the structures are developed based on classical classification architectures. Therefore, the related work in CNNs will also include commonly-used structures in the classification problem.

2.1 Human detection

Human detection is a task of localizing all objects regarded as humans in static images or video streams. Generally, the process of human detection is performed through the following steps: extracting prospective regions of interest from an image, describing the regions with descriptors, and then classifying the regions into human or non-human, followed by post-processing methods[16].

In traditional methods, human descriptors are usually constructed by locally extracted features, such as edge-based shape features(eg. [17]), appearance features (eg. color[18], texture[19]), motion features (eg. temporal differences[20], optical flows[21]), and their com-binations [22]. In these methods, features are mostly manually designed, which benefits from the simpleness of definition and ease of intuitive understanding. They also have been proved to work well on small-size training datasets.

The previous state-of-the-art method for detection is Deformable Part-based Model (DPM) [23]. The method can be regarded as an extension of the histograms of oriented gradients (HOG). The predicted objects are scored according to both a coarse global template of the entire image and six parts of the object in a higher resolution. Each input is described with HOG. In this way, the multi-model of HOG can deal with the problem of changing of point of view. During the training, latent support vector machine (latent SVM) is used, reducing the detection problem into a classification of regions. Positions of parts are treated as latent variable. The method was highly influential due to its robustness.

However, manually defined features could fail to describe more complex information of objects. Especially, they are challenged by motion blur, occlusion, change of illumination, background, etc. As a result, recently, deep learning methods are also considered to be more

(11)

CHAPTER 2. LITERATURE REVIEW 5

effective ways for human detection, since they are able to learn more complex features from images. Related research includes [24], [25] and [26]. These early deep models have already shown some advantages compared to classical ones, but the features are still manually designed and the main ideas are an extension on previous models. Some research also use deep CNNs as feature extractor, such as [27]. The extracted features are followed by complexity aware cascade training for pedestrian detection.

Recent deep learning methods treat detection problem in different ways. One of the most successful structures is Convolutional Neural Networks (CNNs). Deep CNNs have the ability to learn object features by themselves, so that they are also less dependent on object classes. On the other hand, as deep learning requires to be trained with large amount of data, training class-independent models indicates accessibility to learn with more data, compared to training a domain-specific model. Only several papers are conducted on human domain using CNN method. Tian et al. [28] use a CNN to learn segmentation attributes of human (e.g backpack, hat, etc.), but the network part only help to improve prediction performance instead of producing prediction directly, by re-classify the predicted objects to positive or negative. Li et al. [29] develop a network based on Fast R-CNN, by adding a sub-network to handle objects with small scale. Zhang et al. [30] directly test how Faster R-CNN, a cross-class CNN detection model, performs on stand-alone pedestrian detection, and result to a positive performance. In the three methods, except [28] that does not provide detection directly, [29] and [30] both do experiments on the basis of cross-class detection models. The development of CNNs and popular architectures will be introduced in the following section.

2.2 Convolutional Neural Networks

2.2.1 Popular CNNs structures

Convolutional neural network is a specific artificial neural network, using convolution instead of vector/matrix operation in at least one layer[31]. Initially, the CNN method was designed for image classification problems. The prototype of CNNs, known as LeNet, was proposed in 1998, by LeCun et al., used for the task of handwriting recognition[32]. LeNet is a 7-layer network, including 3 convolutional layers and a fully-connected layer, which was of great depth at that time. However, despite the good result it produced, the method was largely limited by available computational resource and amount of data. As a result, the deep learning methods applied on computer vision were not able to be commonly used until 2012, when Alexnet[1] achieved great success in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2012[2].

(12)

6 CHAPTER 2. LITERATURE REVIEW

Figure 2.1: The architecture of Alexnet (reprint from [1]).

Alexnet is later on popularly used as a backbone architecture for various problems such as object detection and semantic segmentation. Meanwhile, the architecture has been further analyzed and extended so that more efficient CNNs have been introduced. Simonyan and Zisserman [33] investigated how the depth of CNNs affects the classification accuracy on the ImageNet dataset and proposed VGG nets. The research found that, using small convolutional filters (3 × 3), the accuracy largely increased when the depth was pushed into 16-19 layers. GoogLeNet[34], also inspired by LeNet[32], achieved state-of-the-art result in ILSVRC 2014. The network is a 22-layer CNN with muti-scale processing of images, so that the network is both available for classification and detection. Furthermore, although the depth and width of the network are both increased, the computational complexity remains constant. However, as stated in [35], the degradation of training accuracy becomes a severe problem as CNNs go deeper, which makes deep neural networks hard to train. In order to solve this problem, ResNet is proposed in the paper. In this method, instead of optimizing the network layer by layer directly, it optimizes a residual mapping of a block of the layers so that the optimization process is much easier and a network can easily reach a very deep level. Again it proves that deeper structures still have potential ability to improve the performance of the classification accuracy, as shown in Figure 2.2, which give a comparison of different network depths with the correlated mean average precision (mAP) on some of the state-of-the-art methods.

Figure 2.2: Comparison of network depth and accuracy on state-of-the-art methods [36].

(13)

classification problems. As a result, these classical classification-oriented structures are largely used as backbones for other computer vision problems, such as detection, segmentation and image colouring.

2.2.2 CNNs for detection problems

Different from the classification problem, detection, however, is a more complicated case. The detection problem is a regression problem, which require real-value prediction. A de-tection task is to label objects in an image with three values: the most possible class, the bounding box and the associated real-valued confidence. Inspired by the satisfactory perfor-mance of CNNs on classification, lots of researchers have proposed methods to adapt such deep neural network structures to this problem. It is not surprising when it turns out CNNs largely outperform the previous state-of-the-art method–Deformable Part-based Model (DPM)[23].

Early CNN models for detection problems were largely domain-specific, such as Overfeat[3] and DetectorNet[37]. Overfeat[3] was the winner of the localization task of ILSVRC 2013, which applied sliding windows and multi-scale methods on input images, with integrated functions for classification, localization and detection. In this method, the network is first modified from Alexnet to be multi-resolution capable. Six scales of each image, ranging from 245 × 245 to 461 × 569, with 5 random crops, are used as input, leading to varying resolutions in the feature output layer (the fifth convolutional layer). On the other hand, the originally fully-connected layers in Alexnet are here regarded convolutional so that multi-resolution inputs can be handled. Then the localization function is added to the network, by sharing the first five convolutional layers with the classification model (used as feature descriptor), and adding an extra regression layer, trained with L2 loss to predict the location of a bounding box. DetectorNet[37] adopts a similar architecture, only that it produces prediction masks of target objects instead of directly regressing the parameters of bounding boxes. For each detection, five binary masks (top, bottom, left, right and the whole mask of an object) are trained as ensemble. The computational complexity was to some extent optimized by avoiding sliding window method, but CNNs were still domain-specific. And as the detection of each domain requires aggregating from 5 models, the demand for storage is high. In conclusion, both Overfeat and DetectorNet fully take the advantage of the success of CNNs on classification problems and consider detection as a problem of classification plus localization, in which the localization part has to be domain-specific. In order to fulfill the task of detection, they have to integrate multiple models from all of the classes.

DeepMultiBox[4] further improves these methods and makes the model class-independent, realized by building CNNs in an inverse way: firstly, a class independent localization model is trained for finding possible regions of interest, and then the model classifies the regions into most possible classes. In other words, it regards the detection problem as a pipeline of

local-ization plus classification, so that this method guarantees a good cross class generallocal-ization.

What’s more, in order to reduce the complexity caused by sliding windows, prior matching is applied as the replacement.

(14)

8 CHAPTER 2. LITERATURE REVIEW

is category independent and only around 2000 regions of interest (RoI) remain after selective search, which is far less than the number of regions searched in the sliding window method. Secondly, the features are extracted by CNNs to predict bounding boxes. The process is done by a domain-specific fine-tuning on Pascal VOC 2012 dataset, based on a supervised pre-training on ImageNet dataset (using Alexnet as backbone). Finally, the predicted bounding boxes are labeled through an SVM classifier.

This method achieved a mean average precision (mAP) of 53.7% on Pascal VOC 2010, which was almost double of the result of the previous state-of-the-art methods. However, the method severely suffers from computational complexity, since training is a multi-stage pipeline, and being expensive in space and time. The inference process is also slow, as it may take 10 to 45 seconds to detect a single image. This method was then improved step by step in several aspects: using Spatial Pyramid Pooling to extract feature maps from the entire image instead of each RoIs[38], integrating SVM into CNNs by replacing with a parallel output layer of softmax (Fast R-CNN[39]). Until Faster R-CNN[10] was developed, all the pipeline models were integrated into one CNN and the computational complexity was largely reduced: a deep learning model region proposal networks (RPNs) is introduced as the substitution of selective search. The RPNs transform the images into an n × n feature map, which is used as region proposals. It achieved state-of-the-art mAP on VOC07, 2010, and 2012. It is notable that the RPNs share the convolutional layers with Fast R-CNN. However, to reach this goal, a three-step training is adopted in the Faster R-CNN model:

1. Train RPNs for region proposal: fine-tune RPNs based on Imagenet CNN. The outputs are binary class label proposals.

2. Train the detection network: fine-tune Fast R-CNN detector based on Imagenet CNN, with RPN proposals as input.

3. Integrate RPNs into the detection network: fine-tune RPNs based on Fast R-CNN detector.

The series of region-proposal based methods treat the detection problem as classification: classifying region proposals into correct classes. Such methods take the advantage that CNNs perform extremely well on classification problems rather than detection problems. However, in this way, it can be hard to realize end-to-end training. On the other hand, despite the improvement on speed, the models are still far from real-time detection. “You only look once” (YOLO)[6] realized online detection–processing about 155 frames per second, by tak-ing detection as a regression problem and directly predicttak-ing boundtak-ing boxes, correspondtak-ing classes and confidence levels within a single network. YOLO uses GoogLeNet as a backbone structure, replacing the final output layer with a regression layer. In this way, the model can be trained end-to-end which benefits to the speed at the cost of accuracy.

“Single Shot MultiBox Detector” (SSD)[7] adopts a similar idea with YOLO and per-formed online-detection with as high accuracy as Faster R-CNN. It further replaces the fully-connected layers and adds up several extra feature layers. Multiple feature maps are extracted from different layers in order to deal with multi-scale detection. Despite that the model is still not good enough on small-object detection, the design of the model appears to be more natural and functional.

(15)

of objects. Especially, YOLO9000 uses jointly training with Microsoft COCO and ImageNet and it is reported to be able to identify 9000 classes, including unseen ones during training. However, discussion about the generalization ability for unseen classes is not included in our scope.

(16)

Chapter 3

Methodology

This chapter gives an introduction to the methodology of the thesis. Firstly, the datasets used for experiments are introduced and some characters of the different datasets are discussed. Secondly, the technologies and strategies used in CNNs are included, and then the detailed architectures of YOLO and SSD are shown and compared. Finally, it is important to define an evaluation method to give quantitative comparisons between different models. In this thesis, the evaluation criteria are defined according to the Pascal VOC challenge, which will be described in detail in this chapter.

3.1 Datasets

In this section, the datasets used in our work are introduced: Pascal VOC[8] and video datasets extracted from video streams. It is clear that the video datasets are the target in our work, both for training and evaluation. However, we also choose Pascal VOC due to the following reasons:

1. The model we choose, YOLO and SSD, are pre-trained on Pascal VOC. As our target is limited to human, training on the same but reduced dataset (limited to human related data), can help understand the ability of the networks to learn on a stand-alone task, comparing with the pre-trained models.

2. Pascal VOC is larger with more complete annotation. Deep learning methods benefit from large amount of data. In [40], the experiments show that CNNs have ability to learn the information from a large dataset even the dataset is randomly made up. Therefore, training on large datasets helps models to learn better variance of training patterns, which also provides a better generalization ability and reduces the risk of overfitting. Therefore, we also use Pascal VOC as a benchmark in our work. The differences between the two datasets are discussed in this section.

(17)

CHAPTER 3. METHODOLOGY 11

3.1.1 Pascal VOC

Pascal VOC[8] was an annual challenge and workshop from 2005 to 2012. Five challenges were included: classification, detection, segmentation, action classification and person layout. The challenge provides a publicly available dataset of images as well as corresponding anno-tations with reasonably high quality. In the classification and detection dataset, 20 classes of objects (including “person”) are annotated. The main dataset for classification and detection are divided into training-validation (trainval) and test sets. The size of the dataset increases year by year, and in dataset Pascal VOC 2012[41], the total number of images reaches 11,540, with 31,561 objects, in which 4,087 images are annotated with class “people”, with 10,129 person objects. Fortunately, Pascal VOC 2007[42] also has made test dataset public, in which 2,007 images are annotated with “person” objects. Some examples of the Pascal VOC dataset are shown in Figure 3.1.

The annotations of images are saved as XML files. In each of the files, the root nodes include folder, filename, source, size, object, and for each object, nodes include the name of its class, pose, truncated sign, difficulty and bounding box. Here the bounding boxes are anno-tated by the top-left and bottom-right corner of the objects, denoted as [x_min, ymin, xmax, ymax].

It is important that the Pascal VOC challenge also provides a commonly used evaluation method for detection problem, which will be introduced in Section3.5.

Figure 3.1: Examples of Pascal VOC “person” dataset. The blue bounding boxes show the

class “person”.

3.1.2 Shopping and Office dataset

(18)

12 CHAPTER 3. METHODOLOGY

The annotations of the video streams are only restricted to the specific human domain. The training dataset is developed based on several videos on scenario tasks, occurred in two different supermarkets/convenient stores. In these videos the participants are required to walk around in the stores and do some shopping, with Tobii Pro Glasses 2 recording the whole process. Limited by time for annotating, and since temporal information is not assumed in our methods, we decided to only extract about 100 - 300 frames (with/without human) from each video as training data. As a result, it counts to 2,914 frames from 12 videos. Within the 2,914 frames, 500 are left out for quantitative evaluation. In this paper, we will refer to the training dataset as Shopping dataset. Figure3.2shows some examples of training data.

We also have a test dataset for qualitative research. The test dataset is videos recorded with our team in an office environment, which ensures the significant differences of scenes and participants between training and testing data. Similarly, in this paper we will refer to the test data as Office dataset.

As can be seen, compared to the Pascal VOC dataset, the images are of high resolution but includes difficult view-points and strong motion blur during fast head movement. Also, illumination is usually different from Pascal VOC, since the videos are all taken indoor.

(19)

3.2 Convolutional Neural Networks (CNNs)

Goodfellow et al. [31] describes CNNs as “neural networks that use convolution in place

of general matrix multiplication in at least one of their layers”. Therefore, convolution layers

are of high importance in the structure. A typical process in a convolutional layer includes convolution, activation function, and a pooling stage. An example of a convolutional layer is shown in Figure3.3.

Figure 3.3: An example of convolutional layer (reprint from [31]).

In the strategy of convolution, instead of connecting layers with simple weights over the whole matrix, the connections are through convolutional kernels. Applying small convo-lutional kernels on an image preserves spatial information, since the kernels only convolve locally. On the other hand, the kernels also act as feature extractors, and different kernels are able to extract different information from images. As a result, in lower convolutional layers, features learned are simple, such as lines and gradients, but in higher layers, via convoling through layers, more complicated features, such as texture, complex shapes, can be learned. The convolution operation is usually followed by an non-linear activation function, which fre-quently is rectified linear unit (ReLU) in CNNs. ReLU is given by f (x) = max(0, x), as stated in [43]. Since ReLU simply discards negative responses of an input, it reduces the activation by about 50%. Pooling layers also take important parts in CNNs. A pooling is inserted be-tween convolutional operations which downsamples the feature maps. A 2 × 2 pooling largely suppresses the size of a layer by 75%, controlling the network to a reasonable computational level[44].

Recall from Figure2.1, a typical classification CNN structure is usually followed by one or several fully-connected layers. It flattens the convolutional maps into normal neural network layers. Through adding fully-connected layers, it also reduces the dimension of last few layers, making the network easier to train.

(20)

neurons, taking several days running using cross-GPU parallelization[1].

3.3 Architecture

3.3.1 You Only Look Once (YOLO)

Before YOLO, the state-of-the-art methods were mainly developed based on classifier-based systems, such as Fast R-CNN[39], Faster R-CNN[10]. These methods divided detection problems into a pipeline of region proposal extraction, bounding box regression, and class classification, which yield a good accuracy. However, the multi-stage pipeline of detection makes the classifier-based methods hard to learn the required information directly from sample data (end-to-end training) and the detection speed is also hard to be improved. YOLO network[6] was inspired by the fact that, in real world, human eyes are available to detect objects at one glance, so that it is more natural to treat the problem as a pure regression and directly predict bounding boxes as well as the corresponding classes and confidence levels with a single network. As a result, YOLO realized online detection, reported as processing about 155 frames per second.

YOLO network is designed based on the backbone of GoogLeNet[34]. As shown in Figure

3.4, the structure consists of 24 convolutional layers and 2 fully-connected layers.

Figure 3.4: Architecture of YOLO (reprint from [6]).

YOLO divides resized images (of fixed size 448 × 448) into S × S grid cells. Each grid cell is responsible for the bounding box whose center is at the location of the grid cell, and predicts B bounding boxes as well as confidence level and class probability. For a dataset with C class labels, the output tensor is S × S × (C + B × 5)). In the original paper, S, C and B are assigned to 7, 20 and 2 respectively, so that the output tensor is 7 × 7 × 30. As a result, the network can produce at most S × S × B predictions (98 if S = 7 and B = 2), which are filtered by a threshold on confidence levels and a non-maximum suppression. The interpretation of YOLO output vector will be further discussed in Section4.3.1.

(21)

CHAPTER 3. METHODOLOGY 15 Loss = λcoord S2 X i=0 B X j=0 1obj_ij [(xi− ˆxi)2+ (yi− ˆyi)2] +λ_coord S2 X i=0 B X j=0 1obj_ij [(√wi− p ˆ wi)2+ ( p hi− q ˆ hi)2]                coordinate loss +λ_obj S2 X i=0 B X j=0 1obj_ij (C_i− ˆCi)2+ λnoobj S2 X i=0 S2 X j=0 1noobj_ij (C_i− ˆCi)2   

conf idence loss

+λ_cls S2 X i=0 1obj_i X c∈classes (p_i(c) − ˆp(c))2    class loss (3.1)

The training of YOLO adopts a winner-take-all strategy. As seen in the loss function, 1obj_ij denotes that the ith grid cell is responsible for jth ground truth. In other words, the network only penalize on one out of B grid cells, which is the winner, for each object on localization loss and confidence loss. The predicted location is based on the center of bounding boxes and the width and height, referred to [x_i, yi, w, h]. As small objects with smaller widths and heights should get similar penalty as large objects, in the loss function the two values are normalized by taking square root of the width and height. As for confidence, the penalty is of different scales on “responsible” grid cell and non-object cells. The different parts of the loss are regularized by different scales denoted as λcoord, λobj, λnoobj, λcls, valued as 5, 1, 0.5, 1 respectively in the original paper.

It is notable that the output tensor is followed by a fully-connected layer instead of a convolutional layer. The output is a vector which is later on reshaped manually according to the definition of grid cells. In this way, the information of feature position and object scale are not naturally implied in the output vector, instead, the network requires to learn by itself. Also, the network has no manual design for different scales of object compared to some other detection models, such as SSD. These factors to some extent weaken the performance of the network.

YOLO network was further inspired by SSD and improved as shown in YOLO9000 [9]. However, this improved method is not included in the experiments of our thesis, only some discussion is included in 5.2.2.

3.3.2 Single Shot MultiBox Detector (SSD)

YOLO realized online testing by directly using regression on predictions in the detection problem, at the cost of accuracy. SSD[7] follows the idea, and furthermore, inspired by [45], SSD adopts a fully-convolutional structure, which ensures that the output layers naturally implying information of locations. Comparing to YOLO, another superior advantage of SSD is the design of structure to handle multi-scale prediction, shown as the following aspects:

1. Multi-layer feature maps

(22)

2. Offset prediction

As stated before, similar to YOLO, SSD also divides images to grid cells on feature and output maps. However, instead of predicting bounding boxes directly, SSD predictions are based on “default boxes” (similar to anchors in Faster R-CNN[10]).

Each grid cell is defined with multiple default boxes with different scales and aspect ratios, which provides rough information about possible shapes and scales of objects, and predicting offset will be more accurate than bounding boxes themselves.

One scale size is assigned to each feature layer. As the scales are manually designed, the network does not require the scales to be identical to the actual receptive field of each layer. Instead, only the center positions of the grid cells should correspond to the center of each receptive field. The scales are evenly spaced between the layers. For a network using m feature maps that defines the scales within range [smin, smax] ([0.2, 0.9] in original paper), the scale of each feature map is calculated by:

sk= smin+

smax− smin

m − 1 (k − 1), k ∈ [1, m] (3.2)

On the other hand, each layer is assigned with several aspect ratios. The aspect ratios are designed as a_r ∈ {1, 2, 3,1

2, 1

3}. Within these sizes, the aspect ratio 1 is applied for

every feature map, with 2 different scales sk and √

sksk+1. And {2,1₂} and {3,1₃} are given in pairs for different layers. The width and height of a default box is calculated as: s_k√ar and sk/

√

arrespectively. As a result, the number of default box at each scale is calculated by number of aspect ratio + 1. An example of default boxes at a feature layer with aspect ratio set {1, 2, 1/2} is shown in Figure 3.5.

Figure 3.5: An example of SSD default boxes.

With the design of default box, SSD gives more accurate results. However, as the centers of the default boxes have to be identical to the original receptive fields, the image input size of the SSD has to be fixed according to the network design. SSD proposed two VGG based networks where the input size are 300 × 300 (SSD300) and 512 × 512 (SSD512). The two networks integrate 6 and 7 feature layers respectively.

SSD also adopts a different strategy of training and loss calculation. As a result of default-box-based prediction, the labels also record the offsets from default boxes. On the other hand, rather than using a binary sign of “winner” grid cell such as YOLO, SSD assigns the Intersection over Union (IoU) between ground truth and default boxes as both class probability and confidence level. In this way, multiple hypotheses are kept for each object.

(23)

task. Different from YOLO, in which all of the losses are calculated by L2 norm, SSD uses different types of losses for the classification part and the localization part–a softmax loss and a Smooth L1 loss respectively:

L(x, c, l, g) = 1 N(Lconf(x, c) + αLloc(x, l, g))             

Sof tmax Loss : Lconf(x, c) = − N X

i∈P os

xp_ijlog(ˆcp_i) − log(ˆc0_i) where ˆcp_i = exp(ˆc p i) P pˆc p i

Smooth L1 Loss : Lloc(x, l, g) = N X i∈P os X m∈bbox xk_ijsmoothL1(lmi − ˆgmi ) (3.3) The softmax loss gives a better classification result when the classes are independent to each other, which is exactly what we assumed in this detection problem. Further explanation of the loss fuctions are included in [7]. It is also worth to mention that SSD adds up a “background” class to target classes. In this way redundant bounding boxes can be suppressed to the extra class with no harm to accuracy. Also, it adopts “hard negative mining” strategy, in which it controls negative to positive ratio of training samples at 3:1. A comparison of architectures between SSD and YOLO is shown in Figure3.6:

Figure 3.6: Comparison of architectures between SSD and YOLO (reprint from [7]).

In the YOLO structure, the output layer is a 7 × 7 grid cell map with 30 tensors, where 30 is related to class size.

3.4 Transfer learning and Fine-tuning

(24)

re-trained. In the latter case, it restores certain weights from a pre-trained model as ini-tialization, and trains the network with different data or for different tasks. It is believed that fine-tuning is beneficial for shortening the time of training, speeding up convergence and increasing generalization[47].

Detection CNNs are to some extent a transfer learning from classification models. This characteristic is shown as the detection models are usually developed based on pre-trained classification models. The classification models are trained on large datasets, such as Ima-geNet. The weights trained are a good initialization for detection models. In this way, only the modified layers need to be initialized with random variables. A lot of detection models adopt the fine-tuning strategy, for example, Overfeat[3], DetectorBox[37] and Fast R-CNN use AlexNet[1] as pre-trained model; YOLO uses GoogLeNet[34] and SSD uses VGG16[33] as backbones.

In our work, two datasets are used for training: Pascal VOC 2012 and Shopping dataset. As introduced in Section3.1, Pascal VOC is a larger dataset with more complete annotation. It can be beneficial to our model, but our final target is detection in video streams. As a result, we choose a two-step fine-tuning strategy: First of all, we fine-tune our models on Pascal VOC on only human data, and then fine-tune on Shopping dataset. In the first step, the last layer of pre-trained YOLO and SSD models are modifided to fit “person” class design and the last-layer weights are replaced with random initialization. As for the second step, the structures are kept from the first step, only the dataset is changed to Shopping dataset. The detailed introduction to the experiments will be illustrated in Section4.1.

3.5 Evaluation of accuracy

The performance will be analyzed based on both Pascal VOC2007 test data, as well as video datasets. Different CNN models will be implemented and experiments on accuracy and computational complexity will be carried out. As stated in Section 3.1.1, the criteria of accuracy evaluation will be identical to the criteria in Pascal VOC Challenge[8], which is commonly used in recent research.

The performance of a detection module is estimated by whether detected bounding boxes match the corresponding ground truths, using precision and recall. In the detection prob-lem, the definition of precision and recall are similar to those in the classification problem. Precision evaluates the fraction of true positive detected bounding boxes among the retrieved predictions, calculated by:

precision = true positive predictions

all predictions (3.4)

Recall evaluates the fraction of true positive detected bounding boxes among all relevant ground truths, calculated by:

recall = true positive predictions

all ground truths (3.5)

(25)

information of threshold to select in the models, which is of high importance in practical application, we also show a threshold curve for each model. From a threshold curve, it can be easily read what level of threshold, the precision and recall can reach. An example of precision-recall curve, as well as the relationship between the evaluated accuracy and threshold is shown in Figure 3.7. Threshold curves will mostly be included in AppendixA.

(a) An example of precision-recall curve (b) Threshold curve

Figure 3.7: An example of evaluation results. In the threshold curve, the x-axis indicates

different levels of thresholds, and the y-axis shows at a particular threshold what precision and recall rate the model reach. Specially the red dot shows the threshold where precision = recall in the model.

In this method, the correctness of detection is defined by calculating the overlapping area of a prediction and a ground truth using intersection over union (IoU), shown as following function:

IoU = area(Bp∩ Bgt) area(Bp∪ Bgt)

. (3.6)

(26)

Chapter 4

Experiments and results

4.1 Brief introduction to experiments

As stated in 3.1.2, different from Pascal VOC data, Shopping dataset is manually an-notated with only one class “person”, lacking information of other objects. Meanwhile, due to motion blur that occurs from fast head-movement during recording, it is also valuable to improve the performance of model with blurred information. Therefore, we design the experiments from the following aspects:

1. The generalization ability of the models on video data with strong motion blur. As human detection is included in the challenge of object detection, it is possible that the models can also to some extent handle data from Tobii Pro Glasses 2 videos, despite the differences between the two datasets. Therefore, we first evaluate the performance of pre-trained models on both datastes. The evaluated models include YOLO, SSD300 (SSD with input size 300 × 300) and SSD512 (SSD with input size 512 × 512).

2. Fine-tuning models with Pascal VOC 2012 dataset.

As can be seen in Section 3.1, Pascal VOC has larger set of images with more complete annotation. In contrast, although Shopping dataset is extracted from 12 videos with 2,914 frames, the amount can still be insufficient for a deep-learning network, not to mention the annotations are only limited to human objects. Taking the advantage of fine-tuning technology introduced in 3.4, we first of all fine-tune our models on Pascal VOC with “person” related data.

We also use data augmentation method during the training phase. Data augmentation is largely used to increase the size of dataset. Typical methods includes image cropping, flipping, rotating, etc. It is also said that data augmentation increases the performance of models, as well as helps generalization [48][1].

In this thesis, data augmentation not only enlarges the size of the dataset, but also helps narrow down the differences between Pascal VOC and video data. Aside from image cropping and flipping, we also simulate motion blur in images. As the videos are recorded by a head device which enables movement in all directions, we adopt 4 motion blur directions–up-down, left-right, and two diagonals. The ratio of original : f lipped :

blurred : cropped images is approximately 2 : 2 : 2 : 1.

Furthermore, experiments are designed to test several detailed factors of the networks, according to the architectures and observed results.

(27)

CHAPTER 4. EXPERIMENTS AND RESULTS 21

3. Fine-tuning on Shopping dataset.

The last step we adopt is fine-tuning on Shopping dataset. In this way, we can find the differences how the different data may influence the networks. Especially the general-ization abilities of the models to the other dataset are evaluated.

The work is established based on TensorFlow 1.0.1. For training and inference phases,

NumpPy and OpenCV2 are also necessities. For evaluation and further experiments,

in-cluding K-means implementation, are implemented based on SciPy 0.19.0.

The experiments are conducted on Windows 10, with Intel(R) Xeon(R) CPU E5-1650 v3 and 16,0 GB RAM, as well as advanced GPU resource with GeForce GTX 1080.

4.2 Generalization ability of YOLO and SSD on Shopping dataset

In3.1, we discussed the datasets we use for our experiments, and it is also clear that data from video streams show differences in resolution, aspect ratio, illumination, motion blur, etc. However, as human detection is a common sub-field in detection problem, and Pascal VOC data is also annotated with class of “person”, it is interesting to know how the detection models trained with Pascal VOC can generalize to data extracted from videos with motion blur. In this section, we evaluate the generalization ability on Shopping dataset.

Figure 4.1 shows the evaluation results of 3 pre-trained models–SSD300, SSD512 and YOLO–on Pascal VOC 2007 test data and Shopping dataset respectively.

(a) Evaluation on Pascal VOC 2007 test set (b) Evaluation on Shopping dataset Figure 4.1: The precision-recall curves of 3 pre-trained models. All models are evaluated

on class “person”. The results on Pascal VOC 2007 are slightly different from the accuracy reported in original papers (79.4% for SSD300, 82.5% for SSD512 and 63.5% for YOLO). It may caused by different ways of implementation, however, the reasons behind require further discussion.

(28)

22 CHAPTER 4. EXPERIMENTS AND RESULTS

During the evaluation of YOLO, it can be easily noticed that the results predicted as human are quite rare. The maximum recall rate is only 55.2%.

4.3 Fine-tuning on Pascal VOC 2012 human data

4.3.1 Fine-tuning YOLO

The idea of YOLO is extracting features with a deep network and be trained with a regression output layer. As stated in Section3.3.1, in YOLO network, the output is a vector of size S × S × (C + B × 5)), initially 7 × 7 × (20 + 2 × 5)) in [6], counting for a vector of 1,470. The layer is designed as a regression of the bounding box prediction. As the output layer comes after a fully-connected layer, the interpretation of location has to be gained from training. In the code, the first S × S × C dimensions reshape to C class maps predict the class probability, followed by a S × S dimension map for confidence level. The last S × S × 4 dimensions predict the location of object, where the 4 maps predict {xcenter, ycenter, w, h} respectively. The numbers are resized to range [0, 1]. The interpretation of the output layer is shown in Figure4.2.

Figure 4.2: YOLO output layer.

(29)

(a) Evaluation on Pascal VOC 2007 test (b) Evaluation on Shopping dataset Figure 4.3: Evaluation of fine-tuned YOLO models.

The training is a fine-tuning on pre-trained 20-class YOLO network. We use stochastic gradient descent (SGD) + momentum of 0.9 as optimizer. The dropout rate for training process is set to 0.5. It is shown that not only the generalization on Shopping dataset but also the detection ability on Pascal VOC 2007 test have a worse performance than the original model.

Some patterns in YOLO training also come to our notice that it is hard to converge to a reasonable small loss level, and the difference of loss level between training and validation is large. A plot of loss level along iteration is shown in Figure 4.4.

Figure 4.4: An example of YOLO training process.

As a result, we conclude that YOLO does not perform well when class number is small. On the other hand, as the difference between training and validation loss is huge, there could be potential overfitting on training data.

4.3.2 Fine-tuning SSD

Due to the unsatisfactory results of YOLO, in this thesis, most of the analyses will focus on SSD models.

SSD predicts bounding boxes basing on an integration of several feature layers. For each

m×m×p feature layer (where m is the grid cell size and p is the channel size), k ×c×(3×3×p)

(30)

produce bounding box predictions, where k is the number of default boxes of the layer and c is the number of target classes. Different from YOLO, the target classes also include an extra background class, which takes care of where there is no object. An example of a prediction layer of SSD is shown in Figure4.5.

Figure 4.5: An example of the convolutional operations on a SSD feature layer.

The design of “background” class makes the prediction of class probability more robust when there is only one class in the training data. On the other hand, SSD also adopt hard-mining strategy in training: SSD selects top negative bounding boxes as opponents to positive samples. The ratio is kept to around 3:1, as negative samples in detection problem are at most percentage. To see how the design of “background” class and hard-mining help to improve the robustness of the model, two types of data used in our experiments. First we simply use only “person” annotations in Pascal VOC 2012 (with data augmentation) to train the very first SSD300. Later on, we train another model which not only includes “person” class, but also sums up the annotations of other classes into class “other”, in other words, we use opponent-class training so that the model is able to keep the information from other objects’ ground truths.

(31)

(a) Evaluation on Pascal VOC 2007 (b) Evaluation on Shopping dataset Figure 4.6: SSD models using single class/opponent classes.

On the other hand, the training process shown in Figure 4.7indicates that SSD is able to fine-tune a model with far less iterations than YOLO, benefiting from preserved information of location from pre-trained models. Meanwhile, comparing the two SSD models, using opponent classes for training means several extra class layers than using only one class, so that the single class-training is supposed to generally have lower loss along the training and take only half time of the opponent classes’ to converge.

Figure 4.7: The training process of SSD models.

(32)

Figure 4.8: Precision and recall rate at different threshold levels on Shopping dataset.

From this figure, it is clear how the single-class model differs from the opponent-class model. Both the single-class model and the opponent-class model outperform the original model for recall rate but are less precise than the original model. On the other hand, the opponent-class model has more advantage at precision while the single-class model has better recall rate. This figure also shows the trade-off between precision and recall.

Some of the predictions are visualized and compared in Figure 4.9. In which the single-class model tends to have more false detection comparing to the opponent-single-class model, at threshold level 0.4.

Figure 4.9: Samples of results from SSD single-class (first row) / opponent-class models (second row).

4.4 Further experiments on SSD

(33)

4.4.1 Simulating motion blur on Pascal VOC 2012

As stated in the previous experiment, both the SSD300 models trained using single class and opponent classes have a better generalization on Shopping dataset. The modifications of the model on SSD300 not only include the change of output layer on classes, but also include “blurring” in data augmentation. As a result, it is hard to tell which has a greater influence on the model. Especially, as introduced in 3.1, the largest difference between data from the video streams and Pascal VOC is that the former is more challenging with difficult head poses, motion blur, etc. Therefore, it is valid to assume that simulating “blur” during training can help a model generalize on video data from a head-mounted camera.

In order to evaluate how “blur simulation” may affect the performance, we compare two SSD models, using the same fine-tuning strategy and opponent-class annotation. In the first model, we use data augmentation from Section4.1: four motion blur directions–up-down, left-right, and two diagonals are randomly selected and simulated. The ratio of original : f lipped :

blurred : cropped images is approximately 2 : 2 : 2 : 1. In the second model, the training

strategy is the same, except no blur simulation is included. Figure4.10shows the simulation of motion blur in training dataset. In training process, first we resize the images to fit SSD model, and then apply simulation of motion blur at size of (imagewidth + imageheight)/120. In future works, the size and direction of blurring could be extended with more variances.

(a) Original image

(b) horizonal (c) vertical (d) diagonal 1 (e) diagonal 2 Figure 4.10: An example of simulation of motion blur on a Pascal VOC 2012 image.

(34)

with opponent classes gets similar mAP on Pascal VOC2007 test data. The model tends to have a higher recall rather than precision. Fine-tuning with simulation of blurring has a lower performance, which is also stated in the previous section. However, it is clearly shown in Figure

4.11bthat the simulation of motion blur helps the models generalize on data from Shopping dataset, with 4% mAP outperforming the model trained without motion blur. Meanwhile, training with opponent classes also helps on special target detection, with 2% mAP increased comparing to original model. As a result, simulating motion blur can be of high importance on human detection on video data.

(a) Evaluation on Pascal VOC 2007 (b) Evaluation on Shopping dataset Figure 4.11: Simulating blurring on Pascal VOC 2012.

4.4.2 K-means on SSD default boxes

It is obvious that SSD has more complex design providing prior knowledge of objects. Therefore, the better performance of SSD may result from such designs. One of the important design choices in SSD is the default boxes, including prior scale and aspect ratio information for objects. In this section we study the patterns of bounding boxes. From this study, we can see how SSD default boxes express the patterns of the target objects, and whether it may be improved by adjusting the parameters to target datasets.

SSD defines different scales and aspect ratios for different feature maps. In SSD300, 6 feature maps are used, and in the pre-trained model we found that only layer 2, 3, 4 use aspect ratio set{1, 2, 3, 0.5, 0.33} and layer 1, 5, 6 use only set {1, 2, 0.5}.

As SSD manually defines default boxes as a base of prediction, it is reasonable to assume that aspect ratios at different scales can affect the performance of the results. Fu et al. [49] applies K-means clustering on the training boxes. K-means is an unsupervised learning method, which iteratively clusters data into K sets according to similarity[50]. As a result, in [49], the aspect ratios of ground truths are clustered into K = 7 sets, and aspect ratio resulted from K-means, 1.6, is added to the default box design. The aspect ratio are clustered regardless of bounding box scale, and the aspect ratio set {1, 2, 3, 0.5, 0.33, 1.6} are used in every layer.

However, as the default boxes at different feature layers are independent, it is interesting to know whether different aspect ratios have different patterns at different scales. Meanwhile, as stated in3.1, the characters of the dataset used in pre-trained model have various differences. The several aspects that affect the aspect ratio include:

(35)

height. In Pascal VOC dataset, the aspect ratios of images are not identical, while those of images in our videos are fixed to 1, 920 × 1, 080.

2. As human is the only target we are interested in, the bounding boxes should be usually taller rather than wider, despite the exceptions caused by different view points and poses.

As scales in SSD models are evenly-spaced among layers, we first assign the scale of a bounding box to the closest scale defined by Equation 3.2. The scale of a bounding box is calculated by:

scale =

s

widthbbox× heightbbox

widthimage× heightimage

. (4.1)

In our work, the K-means algorithm is run for each scale, and the cluster size K for each scale is decided by trial and error.

K-means on original designed scale in [7]

Initially the design of scale is stated in Section 3.2, i.e. the scale set with bound [0.1, 0.95] in SSD300 is [0.1, 0.27, 0.44, 0.61, 0.78, 0.95]. The method is firstly run on Pascal VOC 2012 trainval dataset. According to experiments, for SSD300, the cluster size for the 6 feature layers are defined as [6, 6, 6, 4, 3, 1]. The result is shown in Figure4.12.

(a) All bounding boxes (b) Bounding boxes for “person” Figure 4.12: K-means result of aspect ratios at different scales for Pascal VOC 2012 trainval dataset.

(36)

Scale Aspect ratio (Percentage of data)

0.09 0.32 (10.76%) 0.66 (8.13%) 1.09 (5.14%) 1.71 (2.70%) 2.63 (0.98%) 4.42 (0.19%) 0.27 0.37 (11.12%) 0.77 (7.54%) 1.31 (3.53%) 2.15 (1.39%) 3.79 (0.37%) 7.67 (0.05%) 0.44 0.41 (7.53%) 0.75 (5.22%) 1.19 (3.08%) 1.88 (1.49%) 2.96 (0.52%) 4.66 (0.18%) 0.61 0.58 (6.85%) 1.07 (4.00%) 1.76 (2.08%) 2.60 (0.91%) 0.78 0.73 (4.64%) 1.10 (3.52%) 1.56 (2.01%) 0.95 1.01 (6.08%)

Table 4.1: K-means result of aspect ratios at different scales for Pascal VOC 2012 trainval dataset. Run for all bounding boxes.

0.09 0.23 (8.73%) 0.35 (7.42%) 0.52 (6.37%) 0.71 (3.62%) 0.99 (1.45%) 1.51 (0.41%) 0.27 0.26 (10.54%) 0.45 (9.08%) 0.71 (4.41%) 1.16 (1.28%) 2.06 (0.22%) 4.28 (0.04%) 0.44 0.33 (8.43%) 0.50 (7.90%) 0.73 (3.85%) 1.07 (1.19%) 1.66 (0.32%) 2.66 (0.09%) 0.61 0.48 (8.11%) 0.74 (4.80%) 1.20 (1.00%) 2.15 (0.18%) 0.78 0.68 (4.33%) 0.98 (2.02%) 1.41 (0.55%) 0.95 0.98 (3.64%)

Table 4.2: K-means result of aspect ratios at different scales for Pascal VOC 2012 trainval dataset. Run for “person” bounding boxes. Numbers in bold show K-means results with

aspect ratios within 0.71 ± 0.03.

As can be seen in Table 4.1and 4.2, other than the aspect ratios pre-defined in SSD, i.e. {1, 2, 0.5, 3, 0.33}, aspect ratio 0.71 ± 0.03 is also a common ratio which shows population in the 5 layers out of 6 in “person” data, counting for 21.01% bounding boxes.

Coming to Shopping dataset, the K-means clustering result is shown in Figure 4.13 and Table4.3. With smaller dataset and less diversity in Shopping dataset, the cluster sizes are changed to [4, 4, 3, 2, 2, 1] instead.

(37)

0.09 0.23 (11.66%) 0.37 (6.59%) 0.54 (4.66%) 0.81 (0.93%) 0.26 0.20 (12.99%) 0.33 (7.83%) 0.49 (3.90%) 0.87 (0.24%) 0.42 0.23 (16.39%) 0.35 (8.04%) 0.54 (3.50%) 0.58 0.32 (13.04%) 0.45 (6.33%) 0.74 0.48 (3.21%) 0.61 (0.67%) 0.91 0.70 (0.05%)

Table 4.3: K-means result of aspect ratios at different scales for Shopping dataset.

Numbers in bold show K-means results with aspect ratios within 0.2 ± 0.03.

It is obvious that most of the aspect ratios are smaller than 1, i.e. the bounding boxes are taller rather than wider. This phenomenon mainly comes from two reasons:

1. Similar to Pascal VOC data, a person as an object mostly has a smaller aspect ratio. 2. The resolution of our videos is fixed to 1, 920 × 1, 080. As SSD resizes images to square,

the aspect ratio of bounding boxes get even lower.

On the other hand, it shows that it is very rare to have large-scale objects in Shopping dataset. Also, according to Table 4.3, 0.2 can also be a candidate of default bounding box aspect ratio at lower layers.

From the results from K-means, it is also notable that a smaller scale may require more aspect ratios to cover the diversity, while at higher layers aspect ratios can be less important. As a result, we suggest that it may be helpful to use {1, 2, 0.5, 3, 0.33, 0.2, 0.7} at first 3 layers, and {1, 2, 0.5, 0.33, 0.7, 0.2}, {1, 0.2, 0.7}, {1, 0.7} at last 3 layers.

K-means on scale set in our experiments

Aligned with the core code implemented by [51], in this thesis the scale set [0.07, 0.15, 0.33, 0.51, 0.69, 0.87] is used. Apparently, the scales are not evenly spaced and the lower layers are given more dense scale size. It is believed that it can have better performance, as Table 4.1 also shows that more bounding boxes concentrate at smaller scales, counts for 27.9% of the bounding boxes at scale 0.09 and 24% at scale 0.27. It becomes more obvious in Shopping dataset, that only less than 5% of the bounding boxes are at scale larger than 0.63, as shown in Table 4.3.

Therefore, here we carry out the same experiment on the new scale set. The K-means results on Pascal VOC 2012 trainval dataset and Shopping dataset are shown in Figure4.14

(38)

(a) All bounding boxes (b) Bounding boxes for “person” Figure 4.14: K-means result of aspect ratios at different scales for Pascal VOC 2012 trainval dataset.

0.07 0.28 (7.41%) 0.56 (4.52%) 0.98 (2.38%) 1.66 (1.00%) 2.96 (0.29%) 6.51 (0.02%) 0.15 0.30 (10.80%) 0.60 (7.31%) 1.00 (3.67%) 1.67 (1.59%) 2.68 (0.42%) 4.43 (0.09%) 0.33 0.35 (12.66%) 0.70 (7.94%) 1.29 (2.95%) 2.39 (0.79%) 4.55 (0.22%) 10.20 (0.01%) 0.51 0.49 (11.85%) 0.98 (4.92%) 1.83 (1.36%) 3.41 (0.52%) 0.69 0.63 (6.17%) 1.11 (3.17%) 1.84 (1.35%) 0.87 1.00 (6.59%)

Table 4.4: K-means result of aspect ratios at different scales for Pascal VOC 2012 trainval dataset. Run for all bounding boxes.

0.07 0.23 (4.62%) 0.35 (3.97%) 0.52 (3.55%) 0.73 (1.80%) 1.03 (0.65%) 1.54 (0.15%) 0.15 0.23 (7.12%) 0.37 (6.11%) 0.54 (4.53%) 0.72 (2.73%) 1.04 (1.12%) 1.68 (0.29%) 0.33 0.28 (11.41%) 0.48 (8.62%) 0.73 (4.02%) 1.19 (1.28%) 2.25 (0.23%) 4.28 (0.04%) 0.51 0.41 (11.74%) 0.67 (7.35%) 1.11 (1.44%) 2.25 (0.19%) 0.69 0.58 (7.00%) 0.91 (2.80%) 1.50 (0.67%) 0.87 0.94 (6.57%)

Table 4.5: K-means result of aspect ratios at different scales for Pascal VOC 2012 trainval dataset. Run for “person” bounding boxes. Numbers in bold show K-means results with

(39)

Figure 4.15: K-means result of aspect ratios at different scales for Shopping dataset.

0.07 0.24 (3.53%) 0.37 (2.72%) 0.51 (1.33%) 0.80 (0.10%) 0.15 0.22 (12.71%) 0.36 (6.78%) 0.53 (3.31%) 0.83 (0.71%) 0.33 0.21 (17.80%) 0.33 (10.44%) 0.54 (4.15%) 0.51 0.30 (23.12%) 0.50 (4.02%) 0.69 0.42 (6.13%) 0.53 (2.98%) 0.87 0.67 (0.16%)

Table 4.6: K-means result of aspect ratios at different scales for Shopping dataset.

Numbers in bold show K-means results with aspect ratios within 0.2 ± 0.04.

For general bounding boxes in Pascal VOC 2012, aspect ratio 0.7 ± 0.03 is still frequent in person data in Pascal VOC 2012, shown as Table4.5, counting for 15.9% of data, ranging from 0.07 to 0.5 in scale. Similarly, in Shopping dataset, aspect ratio 0.2 is also a strong candidate in the first 3 feature layers. As a result, we suggest that adding the two aspect ratios to corresponding feature maps may hav potential to help improvement of SSD.

4.4.3 SSD performance on different scales and locations of objects

SSD has clear definitions on scales and aspect ratios, which give a better performance on different scales of objects than YOLO. However, still it shows different ability at different scales and locations. In this experiment, we plot the centers of ground truth bounding boxes as well as their scales, normalized to [0, 1]. Then we evaluate the SSD models and record the positive detection results, i.e the predictions matching ground truth with IoU > 0.5, along with their confidence level. Those ground truths not able to be detected are set with confidence level 0.0.

(40)

(a) on Pascal VOC 2007 test

(b) on Shopping dataset

Figure 4.16: SSD prediction confidences on different location of images. Each point

Detection of Humans in Video Streams Using Convolutional Neural Networks

Detection of Humans in Video

Streams Using Convolutional

Neural Networks

HUIJIE WANG

Detection of Humans in Video Streams

Using Convolutional Neural Networks

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1

Background and motivation

1.2

Research question

1.3

Thesis organization

Chapter 2

Literature Review

2.1

Human detection

2.2

Convolutional Neural Networks

Chapter 3

Methodology

3.1

Datasets

3.2

Convolutional Neural Networks (CNNs)

3.3

Architecture

3.4

Transfer learning and Fine-tuning

3.5

Evaluation of accuracy

Chapter 4

Experiments and results

4.1

Brief introduction to experiments

4.2

Generalization ability of YOLO and SSD on Shopping dataset

4.3

Fine-tuning on Pascal VOC 2012 human data

4.4

Further experiments on SSD