Machine Learning can Reduce False Alarms when Detecting Humans in Surveillance Systems

(1)

Machine Learning can Reduce

False Alarms when Detecting

Humans in Surveillance Systems

NICLAS SVÄRDLING

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

False Alarms when Detecting

Humans in Surveillance

Systems

NICLAS SVÄRDLING

Master in Computer Science Date: June 18, 2019

Supervisor: Mats Nordahl Examiner: Olof Bälter

School of Electrical Engineering and Computer Science Host company: Stanley Security

(4)

(5)

Abstract

Video surveillance is becoming increasingly common, but with the increased number of cameras comes an increased demand for human operators to watch and issue alarms when needed. If these video feeds are left unsupervised they will be of little use in a moment of need and would only be used to find out what has happened long after it already happened. For a surveillance company simply hiring more operators to watch video feeds to meet the growing demands is not a sustainable approach.

Instead, in this thesis machine learning approaches to perform human detection in a surveillance feed are investigated and two different implementations are de-veloped, one based on convolutional neural networks and one based on the less complicated support vector machines. Since this human detection need to take place in real-time, speed is an important concern and so is the accuracy of the detections. If the models cannot be relied upon to provide accurate alarms it can have serious consequences if they were ever put into use.

It was found that the implementation based on convolutional neural networks had a better accuracy in terms of classifying people as people and also not falsely classifying other objects as people. The convolutional neural network in this thesis had an accuracy of 88.4% on positive video examples while the support vector machine only had an accuracy of 63.2% on the same test samples.

(6)

Sammanfattning

Videoövervakning har blivit allt vanligare, och med den ökande mängden kameror finns det också ett ökande krav på männskliga operatörer som kan se över kame-rorna och slå alarm när det behövs. Ifall övervakningskamekame-rorna skulle lämnas oövervakade skulle de inte vara användbara i nödsituationer utan bara kunna an-vändas för att ta reda på vad som har hänt långt efter att det redan har hänt. Att försöka möta de ökande kraven genom att anställa fler och fler operatörer är inte hållbart för säkerhetsföretagen.

Därför har i det här examensarbetet maskininlärningsmetoder för att genomföra människodetektion undersökts och två olika modeller har utvecklats. En imple-mentation är baserad på support vector machines och den andra på convolutional neural networks. Då människodetektionen måste kunna ske i realtid så är has-tigheten en viktig faktor såväl som träffsäkerheten. Ifall modellerna inte kan ge träffsäkra alarm kan det leda till allvarliga konsekvenser när systemet ska använ-das.

Resultatet blev att implementationen baserad på convolutional neural networks hade en bättre träffsäkerhet både på att klassifiera människor som människor och att inte klassifiera icke-människor som människor. Den hade en träffsäkerhet på 88,4% på positiva exempel medans support vector machine hade en träffsäkerhet på 63,2% på samma testvideoklipp.

(7)

1 Introduction 1

1.1 Problem statement . . . 2

1.2 Goal . . . 2

1.3 Limitations . . . 3

1.4 Ethical and Societal Aspects . . . 3

2 Background 5 2.1 Machine Learning . . . 5

2.2 Computer Vision . . . 6

2.3 Support Vector Machines . . . 7

2.4 Neural Networks . . . 8

2.4.1 Convolutional Neural Networks . . . 10

2.4.2 Batch Normalization . . . 11

2.4.3 Activation Functions . . . 11

2.4.4 Pooling . . . 12

2.4.5 Fully Connected layer . . . 13

2.4.6 Loss Functions . . . 14

2.4.7 Residual Neural Networks . . . 14

2.4.8 Single Stage and Two Stage Detectors . . . 15

2.5 Object databases for Detection . . . 16

2.6 Data Augmentation . . . 17

2.7 Related Work . . . 17

3 Methods 20 3.1 SVM Implementation . . . 20

3.2 Training Data for the CNN . . . 22

(8)

3.2.1 Annotation Method . . . 23

3.2.2 Anchor Box Generation . . . 23

3.3 Technical Implementation . . . 23

3.4 Network Architecture . . . 24

3.4.1 Implemented Loss Function . . . 26

3.5 Evaluation of the Models . . . 28

4 Results 31 4.1 CNN Results . . . 31

4.1.1 Training the Model . . . 31

4.1.2 CNN Accuracy . . . 32 4.1.3 CNN Speed . . . 34 4.2 SVM Results . . . 35 4.2.1 SVM Accuracy . . . 35 4.2.2 SVM Speed . . . 36 4.3 YOLOv3 Results . . . 37 4.3.1 YOLOv3 Accuracy . . . 37 4.3.2 YOLOv3 Speed . . . 39 5 Discussion 40 5.1 Accuracy of the Developed Models . . . 40

5.2 Speed of the Developed Models . . . 42

5.3 The Developed Models Compared to YOLOv3 . . . 43

6 Conclusions 45

(9)

Introduction

Safety and security systems in the form of surveillance cameras are commonplace today and are only increasing in number as time goes on. As an example, in 2017, 176 million surveillance cameras were in operation in China and by 2020 an estimated 626 million cameras are expected to be in use[1]. That corresponds to a 300% increase over 3 years. These systems are used by individuals for private security, by companies and by the police. For these cameras to be useful in real time, there usually needs to be a human observing the video stream to flag any abnormal activity that gives cause for concern. This approach has limitations as humans have limited abilities when it comes to monitoring multiple surveillance streams. There are also limits to how many employees a company can hire to watch video feeds, so this solution is not viable in the long run.

The degree project is an attempt to investigate how video analysis powered by machine learning might be used in surveillance services to make them easier and cheaper to manage for a security company. If a more accurate filter than one based on simple motion detection can be developed, not only will camera operators’ time free up for real alarms but it will also be possible to handle more clients that need video monitoring.

The work in this thesis has been done in collaboration with Stanley Security, a company that, among other things, provides both cameras and monitoring services to customers. The company has its own security operations center (SOC) where operators can monitor incoming alarms.

(10)

1.1 Problem statement

Surveillance cameras can be an important tool in aiding law enforcement in crim-inal investigations or to detect an attempted break-in in real time or to detect hu-mans in disaster areas. However, with a large number of surveillance cameras it becomes increasingly difficult for operators to monitor all the security feeds for activity. The likelihood of missing a few important frames of a video stream due to human error is a problem in most security systems utilizing human observers today. Another complication is the fact that some surveillance cameras are old and provide low resolution images and low frame rates as well as weak hardware to run software on.

Research Questions

Therefore, the research questions posed in this thesis are:

• How can a system capable of recognizing humans in surveillance videos be implemented to run on weak hardware and how accurate and fast is it compared to standard algorithms for real time object detection?

• Given the hardware limitations would a simple SVM (support vector ma-chine) or a more complicated CNN (convolutional neural network) be better suited for the task, in terms of accuracy and speed?

1.2 Goal

(11)

1.3 Limitations

The project is limited to detecting and classifying people only, meaning it will only be able to say “person” or “not person” in any given frame. Another delimi-tation is that the system will only be in areas where not many people are expected to go, so it will not have to deal with picking out many individual people from larger crowds. Another limiting factor is that the implementation in the thesis does not take advantage of temporal information that is available in video feeds. One could likely incorporate movement speed and characteristics of objects in a video to make a more accurate classifier but due to wildly differing frame rates of surveillance cameras available for testing in the project this was not imple-mented.

1.4 Ethical and Societal Aspects

It is important to discuss the ethical and societal aspects of any project one under-takes, but it is especially crucial in work such as that of this thesis. Surveillance is a hotly debated topic where many discuss the merits of giving up privacy in return for security.[2] An example of the ongoing debates is a new proposed law in Washington State in the US that would require notifying people of the use of facial recognition technology in public places.[3]

A fully functioning automatic people detector would undoubtedly lead to surveil-lance becoming easier to both implement and to scale in order to meet a growing demand. The result of making surveillance more effective and automated could make places with surveillance cameras safer as there would always technically be someone watching for people and could alert authorities in case of break-ins or similar occurrences. One could even argue that since no person is watching the video feeds anymore peoples’ privacy is better preserved.

(12)

watched, so they have to behave as if they were at all times. The Panopticon was at the time described as "a new mode of obtaining power of mind over mind", and one can hope that the people detector is looked back on in a more favorable light. The hope is that people feel more safe than paranoid because an algorithm is watching them.

A people detector like the one in this thesis does not only detect humans but can in fact locate where in an image they are located. It is easy to see how research into this kind of technology can lead to advancements in automated weaponry designed to target humans. It is also possible to make the naive argument that the technology might be used to avoid targeting people.

The perhaps least dramatic but nonetheless important ethical consideration is the one for camera operators who are being replaced by algorithms in the workforce. There will, at least initially, be a need for human operators to validate that detec-tions the system makes are accurate and what the appropriate action is. So there will still be a need for human camera operators.

(13)

Background

This chapter presents the underlying theories used when developing the system.

2.1 Machine Learning

Machine learning is classically defined as a field that allows computers to learn without being explicitly programmed, see [5] for a more thorough description. Instead of being painstakingly programmed, the computer can use its own ex-periences to make predictions or improve performance. The experience, or the training data, is information about past events. The information could be labeled, meaning that it has some known value or category associated to it. This is called

supervised learning. There are other kinds of learning, namely unsupervised and reinforcement learning. In unsupervised learning unlabeled training data is used.

The goal in these systems is often to place data into groups that are not labeled, only separated from other types of data. Cluster analysis is an example of un-supervised learning in action. Reinforcement learning on the other hand allows a system to determine the best course of action to maximise its performance by using rewards to incentivize the desired behavior. The system is never explicitly told what is wished of it. It is simply punished for doing the wrong thing and rewarded for doing the right thing. The system itself has to figure out what it did to deserve the reward and try to repeat it. This thesis utilizes supervised learning

(14)

since labels are required for the training data to describe what is a person and what is not.

In all kinds of machine learning the quality and quantity of the available training data has a large impact on the accuracy of any predictions made. If the model or the algorithm is not suitable to use for the data it cannot generalize well enough to allow accurate predictions to be made. This problem is referred to as underfitting and can generally be said to occur when the model used is too simple. On the other side of the spectrum lies overfitting, which occurs when the underlying model used is very complex and the algorithm learns and incorporates the noise in the training data to make its predictions. This results in an algorithm that can predict outputs on its training data but will be inaccurate on any data never before seen. Figure 2.1 demonstrates these concepts.

Figure 2.1: What underfitting and overfitting could look like.

Machine learning can generally be used anytime a pattern needs to be recognized. This ability means that research problems such as speech recognition[6], natural language processing[7] and spam detection[8] have become practical applications where machine learning is used today.

2.2 Computer Vision

(15)

Recognition is one of the most difficult visual tasks for a computer to handle since the real world is hard to predict and an object a computer is tasked to find might appear in different poses, shapes, sizes and locations. Even within the same class of objects, say houses, one can find large variation between one house and another. Try telling a computer that it should classify a castle and skyscraper under the same label and it would shake its head at you if it could.

Recognition problems come in a few different flavors. Two of these are object detection and instance recognition. Object detection is used when trying to detect distinct objects like cars or tennis rackets. If a specific instance of an object should be recognized it is said to be instance recognition. This can for instance be facial recognition, since a specific face is being looked for and not just any face, or the act of identifying a specific tennis racket as opposed to just any racket. The kind of recognition used in this project is object recognition since the system should only be able to classify a human but does not need to differentiate between different humans.

2.3 Support Vector Machines

Support vector machines (SVMs) [10] are learning algorithms that work by sepa-rating data by using a hyperplane. A hyperplane can look very different depending on which dimensional space one is working in. In essence, a hyperplane is a shape which can separate two classes of data points in and n-dimensional space, Rn, so in 2D the hyperplane is just a line while in 3D it is a plane. The purpose of the hyperplane is not only to separate data points of different classes but also to find the maximum distance between data points of the classes. During the training of an SVM, labeled data is put into the SVM and an optimal hyperplane is then generated to attempt to separate the two kinds of labels on either side of the hyper-plane. When the hyperplane has been generated from the training data the training is done and the SVM can be used to classify new data. Classification on new data is done by investigating on which side of the hyperplane the data has landed on, where the sides correspond to different classes.

(16)

transform the original feature space to a higher dimensional one.

A general linear hyperplane that separates data points into classes can be defined by the equation

f (x) = ~w · ~x − b = 1 _(2.1)

where ~w is the normal of the hyperplane, ~x a set of feature vectors and b the per-pendicular distance from the origin to the plane.

SVMs, since they can only classify two kinds of classes, are somewhat limited in their general usefulness. They are however quick and fairly accurate if trained properly, makes them particularly attractive for single object detection in images. When we want to find an object in an image one side of the hyperplane becomes the object and the other side becomes no object. A real life practical example is face detection used in some cameras which if not powered by deep learning are more often than not powered by SVMs instead. In practise when using SVMs there are a few input parameters than can be adjusted during the training. One of the most impactful ones is regularization parameter, it is often denoted by C which controls how prone the the SVM is to perform misclassifications on the training data. This means that a large value of C results in that the chosen optimal hyperplane will have a smaller margin and vice versa for a small value of C. The smaller margin between the hyperplane and training data means it is more likely that training data is placed correctly.

2.4 Neural Networks

(17)

also transfers information from the network to the outside world. That is the way the network can produce results for a given input. The hidden layer is called hid-den since it has no access to the outside unlike the input or output layers. The hidden layers are the most important part of a neural network, they are where most computations are performed and can consist of any number of layers. The most basic building block in a neural network is the neuron, also known as a node. The layers mentioned earlier are simply a collection of specialized nodes, so the input layer has input nodes and the output layer has output nodes and so on. A hidden node’s basic purpose is to receive an input, make some computation and then send an output. Every input to a node, of which there could be many, has an associated weight to it. In essence, the larger the weight the more important the input, meaning that the neuron should put more stock in the inputs with higher weight than those with a lower weight. The way a neuron decides whether it should fire or not is through a function called an activation function. There are many kinds of activation functions but in general their purpose is to make the output from the neuron nonlinear. The activation function returns a result that signals if the input exceeded some pre-defined threshold or not. A few of the more popular activation functions are Sigmoid and Rectified Linear Unit, ReLU for short. See section 2.4.3 below for more on these activation functions.

There is another kind of input a neuron can take, when is an input that is not affected by any outside factors or other neurons. It is called a bias and will be explained in the discussion on how the neural network is trained below, but in short a bias can shift the activation function to the left or right so we can have more fine control over when the activation function is triggered. Mathematically a node can be described in the following way

yk= ϕ b + n X i=1 wkixi ! (2.2)

where ykis the output of the k-th node ϕ is an activation function, b is the bias, xiis

the input, wkiis the weight and lastly n is the number of inputs going into the node.

(18)

on. As mentioned earlier all outputs from neurons that lead into other neurons as inputs have associated weights to them. Initially these weights are completely randomized. The goal is to adjust the weights and also the bias in the network for every neuron so that the error is minimized. For every data point in the training set the data is pushed through the network and a gradient is calculated. The output is observed and since the data is labeled the network knows the answer and can calculate how far off it was using what is known as a loss function. The error is then propagated back the other way through the network and the weights and biases are adjusted to account for this loss. This process is repeated until the loss falls below some defined acceptable range. So in essence backpropagation works by sending data into the network so that it can learn from its mistakes.

2.4.1 Convolutional Neural Networks

Convolutional neural networks (CNN)[15][16] are a subset of ordinary feedfor-ward neural networks with one important distinction, they have at least one so called convolution layer. They are very effective when working with image clas-sification or object detection.

The convolution layer is responsible for reducing the dimension of the input image and extracting certain features from images by applying filters. Each neuron in a convolutional layer receives a portion of the entire input and then performs the convolution of its input with the filter and send the results to an activation function. The filter is applied along the input moving with defined strides in each direction. The convolution operation has three parameters, the stride, the padding and the filter dimensions. The stride dictates how big or small each shift of the filter should be while moving across the input. The padding pads the input volume with zeros around the border, since a convolution reduces the dimensions of the input image the padding is needed to maintain consistent dimensions.

(19)

2.4.2 Batch Normalization

In traditional convolutional neural networks after a convolution layer so called batch normalization, proposed by Ioffe and Szegedy [17], is performed. A batch in the context of CNNs refers to a smaller portion of the training data. The reason the training data needs to be split into batches is because usually there is so much data that not all of it can be loaded into a computer’s memory at one time, so it has to be loaded in smaller, more manageable batches. Batch normalization consists of normalizing the output of the previous layer in the network by subtracting the batch mean and dividing by the batch standard deviation. The problem that is solved by using batch normalization is known as internal covariate shift and occurs when the weight within a dataset varies by a large amount resulting in that the network has to make large changes from layer to layer, which significantly slows down the training process by making convergence slower. So the main reason to use batch normalization is to speed up the training process of the neural network, but it also makes the network slightly better equipped to resist overfitting. For example imagine if we have inputs to one layer ranging from 0 to 100 and another with a range between 0 and 1, then we want to normalize these inputs so that these values are closer together.

2.4.3 Activation Functions

Traditionally after every convolution an activation function is used, often Rectified Linear Unit (ReLU).

Relu(x) = max {0, x} _(2.3)

(20)

whole. Leaky ReLU or LReLU solve this by not having any lines with a deriva-tive of 0, so when a negaderiva-tive value goes into LReLU it is simply multiplied with a small constant value to make the slope of the graph small.

f (x) = (

x x > 0

αx otherwise (2.4)

Parametric ReLU is similar to leaky ReLU but the small constant α is no longer a constant, instead it is a variable that can be trained and optimized along with the rest of the network during the backpropagation. Another kind of activation function is the Sigmoid function.

S(x) = 1

1 + e−x =

ex

ex_{+ 1}. (2.5)

It is a function that has found use in the very end of a network, right before the output. The purpose of this function is to transform the output into a probability distribution, meaning that the total sum of all classifications have to add up to 1 after the function has been applied. This operation is performed so that the predictions that the network performs can be seen as percentages showing how confident the network is of prediction that it made.

2.4.4 Pooling

(21)

Figure 2.2: An example of max pooling

Average pooling sums the values in the window and divides them by the window’s dimension to get the average. The purpose of pooling is not only to reduce the computational load, it also improves many other things. It makes the network less prone to react to small distortions in the image since small distortions will be filtered out by whatever kind of pooling is used. Pooling also makes the network more resilient to changes of scale, meaning that an object the network is tasked to detect can appear at different sizes or distances and the network will be able to detect it more easily than it would have without pooling.

2.4.5 Fully Connected layer

(22)

2.4.6 Loss Functions

In the case of object detection an image goes through the network along with the correct answer to where an object is in the image. This answer is called the ground truth. When the image data passes through the network a prediction is made, which is a function of this prediction and the ground truth that forms the loss. The loss value is used to update parameters during the training phase. There are many kinds of loss functions but one of the most popular is called cross entropy, and is defined as

H(p, q) = Ep[− log q] = −

X

x∈X

p(x) log q(x) _(2.6)

where p and q are two probability distributions. More specifically the inputs are the distribution of the output and the predicted output that the model has gener-ated.

Another popular loss function is the mean squared error (MSE) function.

MSE = 1 n n X i=1 (Yi− ˆYi)2. (2.7)

where n is the number of classes the network can predict and Yi is the true label

whileYˆi is the confidence that the network has predicted during training that an

object belongs to class i.

2.4.7 Residual Neural Networks

(23)

to train since the weights and biases are trained less efficiently with every pass through.

One way to solve the issue on deeper networks is through Residual Neural Net-works[22]. Intuitively the way they work is by skipping some layers during train-ing, making them seem shallower. Instead of having every layer feed into the next, connections are added between one layer and one further away effectively skipping the layers in between.

2.4.8 Single Stage and Two Stage Detectors

In modern object detection there are mainly two kinds of detectors being used, the single stage detector and the two stage detector. The two stage detectors are more traditional approaches and work by first generating a region of interest where de-tections are likely to occur and then in the second stage performing the detection on those regions. These kinds of detectors are generally more accurate than sin-gle stage detectors but also quite a bit slower. Sinsin-gle stage detectors are a new idea which only in the last few years have started to become popular. The two most used single stage detectors today are YOLO[23] and SSD[24]. Single stage detectors only pass images through the network once and the fully connected lay-ers simultaneously predicts bounding boxes and class scores. Bounding boxes are the areas where objects have been detected in an image. These are usually in the shape of rectangles.

(24)

normally that is the proportions of a human standing upright.

For this thesis single stage detectors are of interest since speed is the most desired property of the network.

K-means Clustering to Generate Anchor Boxes

K-means clustering[26] is very useful in the context of generating anchor boxes for single stage detectors. The basic idea of k-means clustering is to take n ob-servations and then group them into k different clusters, where each observation is assigned into the cluster with the nearest mean, meaning that each observation in a cluster should have something in common with the other observations within that cluster. The input to the k-means algorithm is in this case all the bounding boxes for the training data as well as a value for k which will be the number of anchor boxes required. The output consists of coordinates for k bounding boxes with varying sizes where at least one should be well suited for any new data as long as they resemble anything from the training data.

2.5 Object databases for Detection

In order for machine learning to work a system requires training data to learn from. In this project the system has to learn what a human looks like, so it will need picture examples of humans. In general, how many data points a system will need to train on depends on several factors, mainly the complexity of what it has to learn and the complexity of the learning algorithm. An SVM, for instance, will require far less training data than a neural network to reach its full potential. But other than these factors the quality of the training data also plays a large part, so it is nearly impossible to estimate how much training data a given project will require. Usually an experimental approach is required.

(25)

com-mon dataset is the PascalVoc dataset [28] The latest iteration from 2012 has 20 classes and about 27,450 annotation on those classes.

2.6 Data Augmentation

An essential part of any machine learning endeavor is the acquisition of training data. Usually huge amounts of data is required for a machine learning model to become general enough so any new data can be classified correctly, especially for computer vision and object detection. In order to generate more images from a set of previously collected images something called data augmentation can be used. The idea behind data augmentation is to trick the network into believing that it has more data to train on than it actually does. There are multiple ways of performing data augmentation but they all work by altering existing images in some way to make the network think it is dealing with images it has not seen before. Some common methods of transforming images is to mirror, crop or flip them but naturally there are other approaches.

A newer approach is called style transfer[29] and involves transferring the style of an image onto another image. For example when a style transfer is performed from a snowy winter picture to an images of a desert the resulting image should ideally look like a desert covered in snow. It is easy to see how style transfer is a great way of introducing noise into the image that make it suitable for data augmentation. The downside of style transfer is that it requires a neural network to perform the transfer which makes it a slow process, something that is not desirable when potentially many thousands of images have to be transferred.

2.7 Related Work

(26)

see [30] for additional reading on YOLOv3.

In a paper by Wu and Liao [31] YOLO was compared to SSD in terms of accu-racy and computational load. The objects that were to be detected during their tests were pedestrians and cars. They found that when both the algorithms were trained on the same dataset SSD performed slightly better in both accuracy and computational load than YOLO. One can view these results critically since there is no mention of what version of YOLO was used. The references point to the first version of YOLO and not the current third version so it is possible that these results are misleading for up to date versions of YOLO.

In another study by Lo Presti and La Cascia [32] real time object detection in surveillance videos is investigated and a system is developed. A main focus in that study, much like in this one, is to limit the computational load since the hardware used is weak by modern standards. The study presents a way of at the pixel level distinguish between background and moving objects in the foreground. Despite not using deep learning to detect anything specific, the work by [32] might still be of use since it should be possible to combine some motion detection with deep learning to lessen the computational load on the hardware.

In a study by Zhao et al. [33] the authors test convolutional neural networks that differ from the popular method of construction. The popular method being having a nonlinear activation layer, usually a ReLU, after every convolution layer. They found, after extensive experiments that in certain situations there is an increased prediction accuracy when not using ReLU after every convolution layer. This is something to consider when building the network in this thesis. However in the experiments the networks that saw the most gain from the reduced amount of ReLU operations were the networks with many layers. The network developed in this thesis will likely have to have a smaller amount of layers due to the emphasis on speed in the project.

In a paper by Venieris, Kouris, and Bouganis [34], the authors present several methodologies one could use to deploy a deep network in an embedded system. Though most of the suggestions either require specialized hardware or software, or they sacrifice an unacceptable amount of precision in the classifications to be a viable alternative in this thesis.

(27)

(28)

Methods

This chapter describes the models that were developed in this thesis and explains why certain choices were made in the development process.

3.1 SVM Implementation

The SVM was implemented using a software library called dlib[36]. The library contains functions for computer vision algorithms including SVMs. These func-tions made it possible to generate an .SVM file containing an SVM model gener-ated from the training data in only a few lines of code. Similarly, after the SVM had been generated getting predictions from new data could be done using a call to a simple function in the dlib library, making the SVM a very quick implemen-tation.

It was found that the maximum number of images that could be used as training data before significant drops in the speed of classification was about 200. So 200 images containing people clearly in frame were selected from the COCO dataset and used in the SVM generation. The C and epsilon values used were 5 and 7 respectively, chosen after some time of trial and error.

However, additional effort was put in to make the detection even faster. For the SVM the bigger the image is, the slower the detection becomes and with this in

(29)

mind an improvement was made. This improvement was to before applying the SVM model analyze the image and find the areas in the image where motion had occurred from the previous frame in the video, crop the frame to only include these areas and only then send these smaller images into the SVM to perform the detection. The motion detection works by investigating if there was a large enough change in pixel values in an area from one frame to the next to exceed a threshold. These optimizations allows the model to only analyze at worst a few different smaller areas rather than the whole image and in the best case where there is no motion at all the model does not need to be used at all and the system can simply prepare for the next frame. This optimization is successful since the surveillance cameras are normally placed in fairly isolated places with little or no motion so generally there will only be one or two moving objects in a given frame at one time. If the cameras had been set up on a crowded street the optimization would likely not be an optimization at all since the constant motion of multiple people would cause the overhead of finding motion and cropping smaller areas to exceeds the time spent on applying the model on the entire frame. See figure 3.1 for a visual demonstration of the cropping optimization.

(30)

3.2 Training Data for the CNN

For the convolutional neural network many more images than the 200 used for the SVM needed to be obtained. These images were drawn both from the pascalVOC and COCO datasets but also from Stanley Security’s own SOC. The images se-lected had to be handpicked from the online datasets and could not just be fed into the network en masse, since not all images in these datasets were suitable for the project. The unsuitable images were mainly ones where a person was too close to the camera so their entire body was not in frame. This is an unrealistic scenario for the network since when it is fully implemented the system will only be running on cameras at a higher elevation than ground level so a person is unlikely to be close enough to make use of these training examples.

Images collected from the data were all of very different people in many different poses and backgrounds. The ones where human like features could be seen in the background were of particular interest. These were mostly things with legs like cats, dogs or chairs. The images collected at Stanley were however quite homogeneous, essentially all of them catching people from the same angle slightly above the person. Lighting conditions were also mostly the same for all the Stanley images since most of them were collected during night time with the person lit by similar street lights.

The total number of images collected for the training was about 18000 before data augmentation.

Style Transfer

(31)

3.2.1 Annotation Method

The images collected from datasets available online came with annotation files. These are files that convey how many people are in an image and where they are. The location is often represented as four numbers the minimum x-coordinates, the maximum y-coordinates, the width and the height. With this information a rectangle can be constructed around the person within the image and be sent into the network for training. The network in this thesis has been configured to accept XML annotation files in the Pascal VOC format. In order to annotate the new data collected from Stanley’s SOC an annotation program called LabelImg was used since it has an option to output annotations in the Pascal VOC format making an additional transformation step unnecessary.

3.2.2 Anchor Box Generation

As stated in chapter 2 in order to be able to place bounding boxes around a detected object using a single stage detector, properly generated anchors are needed. For the best performance catch-all default values on these should not be used, they need to be tuned with the use of the training dataset since these hopefully contain objects of the same size as what it is intended to detect. In this thesis 12 anchor boxes were generated using k-means clustering.

3.3 Technical Implementation

The CNN constructed in this thesis was implemented using Keras[37] with Ten-sorflow[38] as a backend using Python. Tensorflow is an open source machine learning library that specifically provides many functions useful for building and deploying convolutional neural networks. Some of the function from Tensorflow used in this project were ones for activation functions, convolutional layers and batch normalization. Having many of the functions required for the thesis avail-able in the library made the implementation significantly quicker not to mention a lot more readable.

(32)

reason to use Keras is to solve one of Tensoflow’s greatest weaknesses, namely that its syntax is not pretty to look at and is not particularly user friendly. Keras solves this problem by providing wrappers for Tensorflow functions to make them more readable and make the code look more like python than it would have using only Tensorflow. Another benefit of using Keras is that if in the future another backend is desired it can simply be swapped in with no changes to the code since Keras keeps all function names and return values consistent across many different libraries.

Hardware

The hardware used for training the model was a powerful computer sporting an RTX 2080TI graphics card, an Intel Core i7-8700K processor and 16GB ram. For testing an old laptop was used to simulate the weak hardware that the models might have to run on, that laptop ran on an i3-5005U processor with a base frequency of 2.00 GHz and 4GB ram.

3.4 Network Architecture

A single stage detector was the choice of model as speed is a necessity in real time systems. The finished model takes inspiration from the YOLOv3 network architecture but differs in a few key places in order to make this network faster and more suited for detecting a single object as opposed to the multiple classes in YOLO.

It was decided due to the findings of Zhao et al. [33] that not every layer should be followed by a ReLU as convention states, instead they are used more sparingly by using one ReLU after every other layer as their paper suggests.

(33)

but it was found that these initial layers required a great deal of processing so in the interest of computational speed this number was decreased.

After the first convolutional layer comes the most important part of the entire network, which consists of residual blocks. These are smaller building blocks of an entire residual network. The blocks are required as the network is relatively deep and thus is at risk of the vanishing gradient problem. With the residual network in place the initial layers will not receive a gradient close to zero and should not be exempt from the training process during backpropagation.

Following the large chunk of residual blocks there are yet another few convolu-tional layers. The purpose of these layers is however different from the initial layers but they share the fact that each layer is followed by a batch normalization. These layers only have to reshape the data to be a certain shape so it can be sent out as output. The layers have to make sure that the data from the residual part of the network is in the shape of 7x7 as this is the grid size the images are divided into. A residual layer has the configuration that can be seen in table 3.1. Notice in the "Residual block" portion of the table that for the two convolutional layers only a single leaky ReLU is used.

Table 3.1: Implemented residual layer structure

Residual layer _Name

Convolutional layer (filters)

Batch Normalization Initial

Leaky ReLU

Convolutional layer(filters/2) Batch Normalization

Convolutional layer(filters) Residual block Batch Normalization

Leaky ReLU

(34)

Table 3.2: Network architecture

Layer Number of filters Number of blocks

Input layer 48 Residual layer 64 1 Residual layer 128 3 Residual layer 160 3 Residual layer 256 1 Residual layer 480 1

The Output Layer

After the last residual layer in table 3.2 comes the output layer. The last layers in the model act a little differently than the other layers. Using upsampling three detections are made at different scales to better accommodate both objects that are far away and those that are closer. A detection is performed right after the last residual layer, which is meant to detect the largest objects, in this case people that are further away from the cameras. After this detection another upsampling is performed and the output goes through more convolutional layers and can then be used to detected medium sized objects. The same process is then repeated and that output can finally be used to detect smaller objects.

3.4.1 Implemented Loss Function

The loss function for the network is similar to the loss in YOLO, where the loss function is composed out of three terms: the classification loss, the localization loss and the confidence loss. The key difference between the loss function in this thesis and the one in the YOLO model is the way confidences for bounding boxes and classification are calculated. The model developed in this thesis uses binary cross-entropy while YOLO uses sum-squared error for those terms. Binary cross-entropy is simply a name given to a cross-entropy function used with two classes. The reason for the change of functions is that binary cross-entropy usually performs better when the output is a probability distribution.

(35)

and location when compared to the ground truth contained within the annotation files. It is defined as:

Loc_loss = α S2 X i=0 B X j=0 1obj ij [(xi− ˆxi)2+ (yi− ˆyi)2] (3.1) + S2 X i=0 B X j=0 1obj ij [( √ wi − p ˆ wi)2+ ( p hi− q ˆ hi)2] !

Where 1objij = 1 if the j:th bounding box in cell i has detected an object, 1 obj ij = 0

if not. A cell in this context refers to a single square in the grid the image has been represented as. ˆxi is the x-coordinate for the predicted bounding box and ˆyi

is the y-coordinate for the predicted bounding box, ˆwiis the width of the predicted bounding box,ˆhi is the height of the predicted bounding box and the versions of

these variables that have no caret are the ground truth equivalent to their caret counterparts.

A problem would occur if the error of small boxes and larger boxes were weighed equally, meaning that a small box with an error of a few pixels compared to the ground truth would generate an error that is equal to that of a large box with the same pixel error. This is an undesired behaviour as the differences are more no-ticeable and thus more severe the smaller the box is. To combat this problem the square root of the height and the width is used instead of the actual height and weight themselves.

For the confidence loss the first sum below gives a loss if an object is detected within a box and the second is the converse, meaning it gives a loss if an object is not detected in the box

(36)

whereCˆiis the confidence score of the j:th box in cell i.

When an object is detected the loss at each cell is the classification loss, it is calculated as: Class_loss = η S2 X i=1 Epi(c)∈p(c)[− log ˆpi(c)] (3.3)

In this thesis’ implementation and in YOLO’s loss function each term is multi-plied by different constants in order to reduce the errors from boxes that do not contain an object that can be detected and increase the error value obtained from placing a bounding box incorrectly. In the formulas above α, β, γ, η, and are these constants. The technique of adding the weights was a good idea in YOLO but the way the weights are distributed did not seem too well suited for this project. Hav-ing misplaced boundHav-ing boxes give a larger error than misclassifyHav-ing an object was not an ideal approach so the weights were shifted to a more equal distribu-tion, which have been given the values 2, 1.3, 0.6 and 2 respectively. The final loss is obtained by adding the localization loss, confidence loss and classification loss together. The full loss function thus becomes this.

Loss = Loc_loss + Conf _loss + Class_loss (3.4)

3.5 Evaluation of the Models

The negative videos do not contain a person but contain motion of some sort that would trigger a commonly used motion based system and waste an operator’s time.

(37)

negatives relates to negative frames that were correctly not classified. Precision = T P

T P + F P (3.5)

Recall only measures correctly classified frames. It is a ratio between correctly classified frames and the total positive frames. It is defined as:

Recall = T P

T P + F N (3.6)

F1-score tries to find a middle ground between precision and recall, and thereby take both false positives and false negatives into account. It is defined as:

F1-score = 2 ∗ Recall ∗ P recision

Recall + P recision (3.7)

The final way to evaluate the accuracy of the model is to measure a slightly more realistic scenario. Since only a single frame needs to be classified as a person in order for an alarm to be raised having a metric for how many videos gave an alarm at all when they were supposed to provides another view into the effectiveness of the model. So say in 10 videos with people in them 8 of them detect people in at least one frame, in this case the model is 80% accurate according to this way of measuring.

This method works similarly with the negative videos. A video that gave only a single frame of false positive classification means that the video in question is labeled as a false positive, meaning that if in 6 out of 10 negative videos there are at least one false positive detection than the model is 40% accurate on negative examples.

For speed, the average detection time per frame was measured using the time function in the time module available for Python.

(38)

(39)

Results

This chapter presents the results obtained in the thesis. The first section presents the results of the convolutional neural network and the the second the results of the support vector machine. All tests have been run on an old low-end laptop with an i3-5005u processor to simulate the weak hardware that the implementation is intended to run on.

4.1 CNN Results

This section details the results the developed CNN managed to attain.

4.1.1 Training the Model

The model was trained for 160 epochs (160 passes through all training data) which took about 50 hours to complete. See figure 4.1 below for a graph plotting the training loss as well as the validation loss. The training loss is the loss generated from the training data that epoch and the validation loss is the loss generated from the validation data. The validation data was created by taking 10% of the training data and using it for validation. The stopping condition used for the training was when 20 epochs had passed without any improvement in the validation loss. As

(40)

one can see the validation loss is consistently higher and more volatile than the training loss.

Figure 4.1: The training loss and validation loss during the training of the CNN

4.1.2 CNN Accuracy

For these tests the detection threshold was set to 0.5. Table 4.1 below shows the accuracy of the CNN on 12 different test videos as well as how many frames with people these videos had and in how many of those a person was detected by the CNN. Table 4.2 shows how many frames each negative video contained, how many frames that were classified falsely in the videos and the calculated false detection rate. See figure 4.2 for an example frame that was detected from video 2.

(41)

Table 4.1: Positive video accuracy table

Video Resolution Frames with People Frames Detected Accuracy

1 320x240 83 61 73.6% 2 640x480 365 317 77.7% 3 704x576 470 433 92.1% 4 704x576 152 152 100% 5 704x576 101 101 100% 6 640x480 832 744 89.4% 7 640x480 154 136 88.3% 8 640x480 532 468 87.9% 9 1280x720 259 215 83.0% 10 1280x720 330 321 96.9% 11 1280x720 770 701 91.0% 12 1280x720 1288 1068 82.9% Total 5336 4717 _88.4%

(42)

Table 4.2: Negative video accuracy table

Video Resolution Frame Length Frames Falsely Detected False Detection Percentage

1 704x576 262 0 0% 2 704x576 155 0 0% 3 704x576 321 15 4.6% 4 704x576 450 0 0% 5 704x576 275 0 0% 6 1280x720 490 20 4.1% 7 1280x720 327 24 7.3% 8 1280x720 498 7 1.4% Total 2778 66 2.37%

In four of the eight negative videos at least one frame was classified incorrectly, meaning that the CNN would in realistic scenarios classify incorrectly 50% of the time. In table 4.3 the precision of the CNN is presented.

Table 4.3: Total CNN Precision, Recall and F1-score Table True Positive Frames 4717

False Positive Frames 66 False Negative Frames 619

Precision _0.986

Recall _0.884

F1-score _0.932

4.1.3 CNN Speed

(43)

4.2 SVM Results

This section details the results the developed SVM managed to attain.

4.2.1 SVM Accuracy

Table 4.3 below shows the accuracy of the SVM on 12 different test videos as well as how many frames with people these videos had and in how many of those a person was detected by the SVM. Table 4.4 shows how many frames each negative video contained, how many frames that were classified falsely in the videos and the calculated false detection rate.

Table 4.4: Positive Video Accuracy table

1 320x240 83 35 42.1% 2 640x480 365 245 67.1% 3 704x576 470 408 86.8% 4 704x576 152 67 44.0% 5 704x576 101 56 55.4% 6 640x480 832 548 87.7% 7 640x480 154 86 79.2% 8 640x480 532 429 80.6% 9 1280x720 259 111 42.8% 10 1280x720 330 231 70.0% 11 1280x720 770 395 51.3% 12 1280x720 1288 790 61.3% Total 5336 3401 _63.2%

(44)

Table 4.5: Negative Video Accuracy table

1 704x576 262 15 5.7%% 2 704x576 155 0 0% 3 704x576 321 3 0.9% 4 704x576 450 417 92.6% 5 704x576 275 100 36.4% 6 1280x720 490 89 18.1% 7 1280x720 327 66 20.1% 8 1280x720 498 123 24.6% Total 2778 813 29.2%

In seven of the eight negative videos at least one frame was classified incorrectly, meaning that the SVM would in realistic scenarios classify incorrectly 87.5% of the time. In table 4.6 the precision of the CNN is presented.

Table 4.6: Total SVM Precision, Recall and F1-score Table True Positive Frames 3401

Recall _0.632

F1-score _0.634

4.2.2 SVM Speed

(45)

But it was found that for every moving object it took on average 0.0367 seconds to classify a frame so the simple formula to calculate the classification speed thus becomes T ime = X × 0.036780 where X is the amount of distinct moving ob-jects. In the testing videos the maximum amount of distinct moving objects was 4 and the minimum was 1, so using the formula the speed of classifying a frame was found to be somewhere between 0.0367 - 0.1412 seconds. This interval can be more easily understood as 7-27 frames per second.

4.3 YOLOv3 Results

This section details the results YOLOv3 managed to attain.

4.3.1 YOLOv3 Accuracy

(46)

Table 4.7: Positive video accuracy table

1 320x240 83 36 43.3% 2 640x480 365 319 87.4% 3 704x5760 470 288 61.3% 4 704x5760 152 152 100% 5 704x5760 101 101 100% 6 640x480 832 730 87.7% 7 640x480 154 122 79.2% 8 640x480 532 481 90.3% 9 1280x720 259 198 76.4% 10 1280x720 330 311 94.2% 11 1280x720 770 654 84.9% 12 1280x720 1288 1126 87.4% Total 5336 4518 _84.6%

In all the positive videos at least one frame was classified correctly meaning that YOLOv3 is 100% accurate in realistic scenarios.

Table 4.8: Negative video accuracy table

1 704x576 262 0 0% 2 704x576 155 0 0% 3 704x576 321 0 8% 4 704x576 450 5 1.1% 5 704x576 275 0 0% 6 1280x720 490 10 2.0% 7 1280x720 327 24 7.3% 8 1280x720 498 17 3.4% Total 2778 56 2.01%

(47)

Table 4.9: Total YOLOv3 Precision, Recall and F1-score Table True Positive Frames 4518

Recall _0.846

F1-score _0.911

4.3.2 YOLOv3 Speed

(48)

Discussion

In this chapter the results of the testing are discussed, and the different models are compared to each other.

5.1 Accuracy of the Developed Models

Overall the accuracy of the convolutional neural network model that was devel-oped was fairly good, at 88.4%, which was a surprising result since the network was relatively shallow. However, the good performance might be a result of the testing data not being too dissimilar to the training data in that it consisted of video feeds taken from cameras placed at a high elevation pointing down on peo-ple below. The camera positioning means that the peopeo-ple the model had trained on and the ones presented during testing were captured from similar angles. A significant chunk of the training data was however not collected from Stanley’s cameras but instead collected from datasets available online so not every person in the training had a similar pose. The examples in the online datasets were quite diverse, however many images did share similar qualities which likely improved the results on the test data.

Looking more at the result of the positive test videos, in all 12 videos the model would have alerted an operator of a person’s presence even if the accurate clas-sification could not be maintained during the entire duration of the video. In the

(49)

video with the least accuracy, which scored only 73.4%, the video is taken from one of the worst cameras still operational today. The feed is extremely blurry at a resolution of 320x240 and to add to the difficulty the person in the video is at times covered by a roof as they walk across the frame. But even during these less than ideal conditions well over half of the frames were classified correctly.

The results of the negative testing videos were also fairly good, of the 2778 frames of video not containing a person only 66 were mislabelled as such. Of the eight negative test videos four would have caused an alarm in a real scenario where the model was running on actual surveillance cameras. That is 50% of the videos would have resulted in a false positive alarm being raised. This was a number that could have been lowered at the cost of some accuracy in the positive tests by increasing the hit threshold for a detection. This adaption would force the system to filter out classification it was less sure of. A decision was made to not make this trade off as it is way more important to classify a person than it is to mislabel something else as a person, from a security perspective, one option only costs an operators time while the other can have more serious and costly consequences.

One can see a clear correlation between resolution and the amount of false posi-tives in the videos. The higher the resolution the higher the chance of false positive alarms. This increased amount is likely due to much of the training data having a lower resolution than 1280x720.

In terms of accuracy the implemented support vector machine was worse in almost all metrics. It detected fewer positive frames and more more negative frames during the the testing overall. The SVM was however not bad by any means, and scored an overall accuracy of 63.2%. In a real scenario it would not have mattered which model was running since both of them would have alerted on operator of the detections even though the SVM was worse at following the person across the frame.

(50)

in a 92.6% false positive rate during that test which inflated this statistic for the overall test. Without video 4 the false positive rate would still have been higher for the SVM but much less than was ultimately the case. For the SVM in seven of the eight test videos false positives occurred, meaning that in a real implementation 87.5% of the videos would have cause a false poitive alarm to be raised

5.2 Speed of the Developed Models

The average time to perform a classification on a frame was calculated to be about 1.72 seconds using an old Intel i3-5005U processor. This is approximately 0.58 frames per second making it slightly too slow to be deployed in a real time system using the i3-5005u processor. Ideally the model would be to run on hardware powerful enough to not limit the native frame rates on the cameras the slowest of which is about one frame per second but can go as high as 15 in others. Hopefully if the model is used it will have access to more powerful hardware which should bring the speeds up to a more usable level.

The SVM was generally much quicker than the CNN, in part thanks to the nature of the algorithm and in part due to an optimization where smaller areas where motion had occurred could be cropped out and analysed on their own and thereby avoiding needlessly looking at the whole image for every frame. The cameras the models are intended to run on are mostly located in quiet areas where not much motion is expected, which means that the SVM rarely has to deal with more than two or three distinct objects moving in the frame, making the optimization very efficient. There is naturally a time component resulting in adding more and more moving objects. On average for every cropped frame the SVM has to perform a classification on about 0.0367 seconds are required. This means that in order for the SVM to be slower than the CNN 0.03671.7 ≈ 46 distinct moving object are required, something

that is unlikely to be found in any of the camera locations.

(51)

5.3 The Developed Models Compared to YOLOv3

All three models investigated are interesting to compare since they each have one metric they excel at over the other two. The SVM is the fastest, the developed CNN is the most accurate with positive data and YOLOv3 is the best at not mis-classifying negative data.

But even though in terms of frames YOLOv3 performed slightly better on negative data than did the CNN developed in this thesis, 2.01% vs 2.37%, in terms of actual alarms one can see that they both would have caused the same number of false positive alarms, for 50% of the videos. All the models also managed to find people in all the positive videos and so would have caused alarms as they should have in 100% of the videos. It is also possible to see how YOLOv3 was the best at avoiding false positive detections through the precision metric. YOLOv3 had, by a slight margin, the highest precision.

The reason the CNN developed in this thesis had better accuracy than YOLOv3 despite being a smaller and less powerful model overall was likely since YOLOv3 was not designed to detect smaller objects. The smaller number of filters in the initial layers makes it really difficult for YOLOv3 to extract smaller features. So even though YOLOv3 has more power to analyze the input frames, the CNN in this thesis beats it by being able to "see" the input frames more clearly once it starts its less powerful analysis.

The time difference between the two deep networks is quite significant, 3.15 sec-ond per frame compared to 1.72. That makes the CNN developed in this thesis 83,14% faster than YOLOv3 and more suitable for use in systems with weak avail-able hardware or strict time constraints.

(52)

(53)

Conclusions

In this thesis two different implementations of a people detecting model have been developed and evaluated. One approach is based on a CNN and the other on an SVM. The accuracy of the CNN both in terms of the hit rate on positive examples but also in its ability to not falsely classify negative examples outperformed the SVM. In one vitally important metric the SVM performed better than the CNN. When it came to the speed of detections the SVM performed much faster detec-tions making it potentially a better choice in situadetec-tions where real time detection is needed and the available hardware is very weak. Though if the SVM were used in a real system and the decreased accuracy is an acceptable trade-off for increased speed, users would still have to deal with a fairly high false positive detection rate, which will require more human operators to check incoming classifications and flag the correct ones.

This is not to say that the CNN was slow by any means. The CNN developed in the thesis was relatively small in comparison to other more traditional networks tasked with object detection. For instance it was 83.14% faster than YOLOv3, a widely used network that influenced the design of the network in the thesis. So it is possible with slightly better hardware than the one used for testing in this thesis that the CNN could become an option for real time detection. Not only was the developed model faster than YOLOv3 it also managed to attain a higher F1-score on the test data, meaning that it is, in general, the better model to use for people detection.

(54)

[1] Frank Hersey. China to have 626 million surveillance cameras within 3

years. 2017. url: https : / / technode . com / 2017 / 11 / 22 / china to have 626 million surveillance cameras -within-3-years/_{(visited on 05/17/2019).}

[2] Nooraneda Mutalip Laidey. “Privacy vs. National Security: Where Do We Draw the Line?” In: International Journal of Social, Behavioral,

Educa-tional, Economic, Business and Industrial Engineering 9.6 (2015), pp. 2235–

2238. issn: eISSN:1307-6892. url:http://waset.org/Publications?

p=102_.

[3] Chris Burt. Proposed Washington State facial recognition and data privacy

law stalls in House. Apr. 2019. url:https://www.biometricupdate.

com/201904/proposed-washington-state-facial-recognition-and-data-privacy-law-stalls-in-house_.

[4] Jeremy Bentham and Miran. Bozovic. The Panopticon writings / Jeremy

Bentham ; edited and introduced by Miran Bozovic. English. Verso London

; New York, 1995, 158 p. isbn: 1859840833 185984958.

[5] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 3rd. Upper Saddle River, NJ, USA: Prentice Hall Press, 2009. isbn: 0136042597, 9780136042594.

[6] KH Davis, R Biddulph, and Stephen Balashek. “Automatic recognition of spoken digits”. In: Journal of the Acoustical Society of America 24.6 (1952), pp. 637–642. url: http : / / www . idemployee . id . tue .

nl/g.w.m.rauterberg/presentations/bell-labs.pdf_.

[7] W. A. Woods. “Transition Network Grammars for Natural Language Anal-ysis”. In: Commun. ACM 13.10 (Oct. 1970), pp. 591–606. issn: 0001-0782.

(55)

doi:10.1145/355598.362773. url:http://doi.acm.org/

10.1145/355598.362773_.

[8] Awad W.A and ELseuofi S.M. “Machine Learning Methods for Spam E-Mail Classification”. In: International Journal of Computer Science

In-formation Technology 3 (Feb. 2011). doi:10.5121/ijcsit.2011. 3112_.

[9] Berthold Horn and Brian G. Schunck. “Determining Optical Flow”. In:

Ar-tificial Intelligence 17 (Aug. 1981), pp. 185–203. doi: 10.1016/0004-3702(81)90024-2_.

[10] Corinna Cortes and Vladimir Vapnik. “Support-Vector Networks”. In:

Ma-chine Learning 20.3 (Sept. 1995), pp. 273–297. issn: 1573-0565. doi:10. 1023/A:1022627411411_{. url:}https://doi.org/10.1023/ A:1022627411411_.

[11] Yann LeCun, Y Bengio, and Geoffrey Hinton. “Deep Learning”. In: Nature 521 (May 2015), pp. 436–44. doi:10.1038/nature14539.

[12] P. J. Werbos. “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences”. PhD thesis. Harvard University, 1974. [13] Stuart Dreyfus. “The numerical solution of variational problems”. In:

Jour-nal of Mathematical AJour-nalysis and Applications 5 (1962), pp. 30–45.

[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning representa-tions by back-propagating errors”. In: 323 (Oct. 1986), pp. 533–536. doi:

10.1038/323533a0_.

[15] Y. LeCun et al. “Backpropagation Applied to Handwritten Zip Code Recog-nition”. In: Neural Computation 1.4 (Dec. 1989), pp. 541–551. issn: 0899-7667. doi:10.1162/neco.1989.1.4.541.

[16] Kunihiko Fukushima. “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”. In: Biological cybernetics 36 (Feb. 1980), pp. 193–202. doi:10.1007/ BF00344251_.

[17] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In:

Pro-ceedings of the 32Nd International Conference on International Confer-ence on Machine Learning - Volume 37. ICML’15. Lille, France: JMLR.org,

2015, pp. 448–456. url:http://dl.acm.org/citation.cfm?

(56)

[18] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. “Rectifier Nonlin-earities Improve Neural Network Acoustic Models”. In: Stanford

Univer-sity, 2013. url:https://ai.stanford.edu/~amaas/papers/

relu_hybrid_icml2013_final.pdf_.

[19] Kaiming He et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. In: CoRR abs/1502.01852 (2015). arXiv: 1502 . 01852. url: http : / / arxiv . org / abs / 1502 .

01852_.

[20] Y-Lan Boureau, Jean Ponce, and Yann LeCun. “A Theoretical Analysis of Feature Pooling in Visual Recognition”. In: Proceedings of the 27th

In-ternational Conference on InIn-ternational Conference on Machine Learn-ing. ICML’10. Haifa, Israel: Omnipress, 2010, pp. 111–118. isbn:

978-1-60558-907-7. url: http : / / dl . acm . org / citation . cfm ? id =

3104322.3104338_.

[21] Sepp Hochreiter. “The Vanishing Gradient Problem During Learning Re-current Neural Nets and Problem Solutions”. In: International Journal of

Uncertainty, Fuzziness and Knowledge-Based Systems 6 (Apr. 1998), pp. 107–

116. doi:10.1142/S0218488598000094.

[22] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:

CoRR abs/1512.03385 (2015). arXiv: 1512 . 03385_{. url:} http : / / arxiv.org/abs/1512.03385_.

[23] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition

(CVPR) (2017), pp. 6517–6525.

[24] Wei Liu et al. “SSD: Single Shot MultiBox Detector”. In: Computer Vision

– ECCV 2016. Ed. by Bastian Leibe et al. Cham: Springer International

Publishing, 2016, pp. 21–37. isbn: 978-3-319-46448-0.

[25] Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. In: CoRR abs/1506.01497 (2015). arXiv:

1506.01497_{. url:}http://arxiv.org/abs/1506.01497_.

[26] S. Lloyd. “Least squares quantization in PCM”. In: IEEE Transactions on

Information Theory 28.2 (Mar. 1982), pp. 129–137. issn: 0018-9448. doi: 10.1109/TIT.1982.1056489_.

[27] Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context”. In:

Computer Vision – ECCV 2014. Ed. by David Fleet et al. Cham: Springer