Use of Thermal Imagery for Robust Moving Object Detection

Full text

(1)LiU-ITN-TEK-A--21/041-SE. Use of Thermal Imagery for Robust Moving Object Detection Hannah Bergenroth 2021-06-18. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--21/041-SE. Use of Thermal Imagery for Robust Moving Object Detection The thesis work carried out in Medieteknik at Tekniska högskolan at Linköpings universitet. Hannah Bergenroth Norrköping 2021-06-18. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Hannah Bergenroth.

(4) Abstract. In recent years, object detection and classification have received a lot of attention, with applications such as surveillance and pedestrian detection among them. Traditional detection methods use cameras that operate in the visible band as the conventional imaging device due to good image contrast and clear silhouettes. Compared to visual cameras, thermal cameras are less sensitive to illumination changes and can operate in darkness. They are useful in outdoor environments and for round-the-clock monitoring, making them instrumental for creating a robust object detection and classification system that utilizes both sensors. This work proposes a system that utilizes both infrared and visual imagery to create a more robust object detection and classification system. The system consists of two main parts: a moving object detector and a target classifier. The first stage detects moving objects in visible and infrared spectrum using background subtraction based on Gaussian Mixture Models. Low-level fusion is performed to combine the foreground regions in the respective domain. For the second stage, a Convolutional Neural Network (CNN), pre-trained on the ImageNet dataset is used to classify the detected targets into one of the pre-defined classes; human and vehicle. The performance of the proposed object detector is evaluated using multiple video streams recorded in different areas and under various weather conditions, which form a broad basis for testing the suggested method. The accuracy of the classifier is evaluated from experimentally generated images from the moving object detection stage supplemented with publicly available CIFAR-10 and CIFAR-100 datasets. The low-level fusion method shows to be more effective than using either domain separately in terms of detection results. Insights into the problems and challenges that exist for real-time surveillance applications in the visible and infrared spectrum, as well as how using multiple sensors and performing low-level fusion can affect the result of moving object detection, are among the contributions of this work..

(5) Contents Abstract. iii. Acknowledgments. iv. Contents. v. List of Figures. vii. List of Tables. viii. 1. 2. 3. Introduction 1.1 Motivation . . . . . 1.2 Aim . . . . . . . . . 1.3 Research questions 1.4 Delimitations . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 1 2 2 2 2. Theory 2.1 Thermal imaging . . . . . . . . . 2.1.1 Sensor fusion . . . . . . . 2.2 Machine learning . . . . . . . . . 2.2.1 Artificial neural network . 2.2.2 Activation function . . . . 2.2.3 Training a neural network 2.3 Convolutional neural network . . 2.3.1 Convolutional layer . . . 2.3.2 Pooling layer . . . . . . . 2.3.3 Fully connected layer . . 2.4 Transfer learning . . . . . . . . . 2.5 Dataset augmentation . . . . . . 2.6 Image recognition with CNNs . . 2.6.1 Base networks . . . . . . . 2.7 Moving object detection . . . . . 2.7.1 Background subtraction . 2.7.2 Gaussian Mixture Models 2.7.3 Background model . . . . 2.8 Measurements . . . . . . . . . . . 2.8.1 Confusion matrix . . . . . 2.8.2 Accuracy . . . . . . . . . . 2.8.3 Precision and Recall . . . 2.8.4 F1 measure . . . . . . . . 2.8.5 Intersection over union .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. 3 3 4 5 5 6 6 6 7 7 7 7 8 8 9 9 10 10 10 11 11 11 11 12 12. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. Method. 14 v.

(6) 3.1 3.2 3.3 3.4 3.5. . . . . . . . . . . .. 14 15 15 15 16 18 18 18 18 19 20. 4. Results 4.1 Moving object detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Target classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21 21 23. 5. Discussion 5.1 Results . . . . . . . . . . . . . 5.2 Method . . . . . . . . . . . . . 5.2.1 Data collection . . . . 5.2.2 Moving object detector 5.2.3 Target classifier . . . . 5.3 The work in a wider context .. . . . . . .. 25 25 26 26 26 27 27. Conclusion 6.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29 29 30. 3.6. 6. Pre-study . . . . . . . . . . . Tools and frameworks . . . System overview . . . . . . Data collection . . . . . . . . Moving object detector . . . 3.5.1 Detector evaluation . Target classifier . . . . . . . 3.6.1 Network architecture 3.6.2 Training network . . 3.6.3 Data augmentation . 3.6.4 Model evaluation . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. Bibliography. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. . . . . . . . . . . .. . . . . . .. 31. vi.

(7) List of Figures 2.1 2.2 2.3 2.4 2.5. The electromagnetic spectrum and the sub-divided infrared spectrum. A neural network with multiple hidden layers. . . . . . . . . . . . . . . Max pooling performed on a 4x4 matrix. . . . . . . . . . . . . . . . . . . The confusion matrix for a binary classifier. . . . . . . . . . . . . . . . . Graphical illustration of Intersection over Union. . . . . . . . . . . . . .. 3.1. Example frames from the sequences in the dataset, including visual, infrared, and ground truth images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of fusion of visual and infrared spectrum for moving object detection. . Convolutional neural network architecture used for image classification. . . . . . . Example of image augmentation. An image is subject to different transformations generating 8 new slightly altered images. . . . . . . . . . . . . . . . . . . . . . . . .. 3.2 3.3 3.4 4.1 4.2 4.3 4.4 4.5. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. Detection result in visible, infrared, and fused spectrum for Close person (CP) sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visual comparison of detection results in visible, infrared, and fused spectrum, for frame 201 in CP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visual comparison of ground-truth with detection result in visible and fused spectrum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix of the class-wise performance of the proposed neural network for classification of the test dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Training and validation accuracy and loss during training of proposed neural network, pre-trained on the ImageNet dataset. . . . . . . . . . . . . . . . . . . . . . . .. vii. 4 5 7 12 13 16 17 19 20 22 22 23 24 24.

(8) List of Tables 3.1. Details of the used sequences from INO’s Video Analytics Dataset. . . . . . . . . .. 4.1. Visual comparison of detection results in visible, infrared, and fused spectrum, for frame 156 in CP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cumulative proportion of extracted images of various sizes. . . . . . . . . . . . . . Table of the class-wise and overall performance of the proposed neural network for classification of the test dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.2 4.3. viii. 16 21 23 23.

(9) 1. Introduction. In computer vision, the procedure of searching in an image or a video to identify one or multiple objects is called object detection. Typically this is carried out by drawing a bounding box around the identified target, which helps to locate the object in a scene and track its movements. The ability to sort the detected regions into various classes and assigning them labels is known as image classification. It is relatively easy for humans to identify objects present in an image or video sequence and differentiate them from each other. For machines, this is the automated recognition of finding patterns and regularities in data. With the increased availability of large amounts of data, a new surplus of computing resources, and better calculations available, it is relatively easy to create a system capable of identifying and classifying different objects within a scene with high accuracy. Object detection and classification have gained much attention in recent years, where applications such as surveillance and security, self-driving cars, pedestrian detection, and face recognition are some of them. The exploitation of these automated vision systems has been a researched area and formed the basis for recent development. Today, regular cameras capturing visible light are the standard imaging device for traditional detection methods. However, interest in thermal cameras has increased in recent years following the decrease in price and size, opening up to new areas and applications [14]. Compared to visual cameras, thermal cameras are less sensitive to illumination changes and can operate in complete darkness. They are instrumental in calculating temperature differences, making them advantageous in outdoor environments and round-the-clock monitoring. Upon the success of applying deep convolutional neural networks (DCNN) for image classification [20], deep learning methods have become state-of-the-art in image recognition and made considerable improvements in the area [35]. Neural networks are an extremely robust way for machines to find patterns in images and classify them accordingly. Computer systems have become more useful thanks to neural networks since their mission is to emulate what humans do well and then scale it up to take advantage of the speed of modern computers.. 1.

(10) 1.1. Motivation. 1.1. Motivation. A trade-off between processing speed and accuracy is required in applications for video surveillance. Since a monitoring system only needs to focus on deviations within an environment, not all objects need to be identified. Detection and classification of moving objects in video is a field of research that has received much attention in the past and where extensive research is still being conducted. Cameras operating in the visible-band are today the primary sensors in object detection methods due to the often fair resolution. This allows to easily extract features in an image, high contrast contributes to clear contours. The biggest challenge with visual cameras, however, is the sensitivity to illumination changes. A sudden change in intensity, color balance or direction, can complicate the object detection process. In addition, it is impossible to capture objects entirely in the dark. An ordinary visual camera is thus not sufficient for a system that will work in all conditions. It is here thermal cameras come into play.. 1.2. Aim. This thesis aims to improve the detection of moving objects by utilizing the information from both the visual and infrared spectrum. The use of two sensors can create a more robust object detection and classification system, which should give a better performance than they both provide individually.. 1.3. Research questions. 1. How can object detection in infrared imagery utilize a pre-trained CNN for classification? 2. How can the number of false alarms be reduced during object detection using both a visual and thermal camera? 3. What are the benefits of first extracting a region of interest before feeding it into a CNN for classification? 4. How should visual and infrared information be combined to improve the accuracy of moving object detection?. 1.4. Delimitations. The master thesis project will be centered around combining information from different sensors. However, no focus will be placed on aligning the two camera-system as this is already assumed. In addition, the detection of moving objects will be based on a static camera system with fixed, non-moving cameras. Finally, this work aims to evaluate the quantitative performance of the proposed method rather than to implement and provide the real-time system.. 2.

(11) 2. Theory. This chapter includes the theory of the thesis. Related research and other information of relevance for the research questions is included. Topics such as thermal imaging and sensor fusion, convolutional neural networks for image recognition, object detection and its challenges, are all covered. Finally, measurements used to evaluate the performance of object detection and classification methods are presented.. 2.1. Thermal imaging. Infrared radiation (IR) is a form of radiant energy that can be interpreted as heat but is not apparent to the human eye. All objects with a temperature above absolute zero emit IR, which is often also referred to as thermal radiation. Within the electromagnetic spectrum, IR is found at 0.7-1000 µm and extends from visible red light to microwaves, hence the name. There exist several sub-divisions schemes in the literature, dividing the IR spectrum into multiple regions. One of the commonly used divides the spectrum into near-infrared (NIR), shortwavelength infrared (SWIR), mid-wavelength infrared (MWIR), long-wavelength infrared (LWIR), and far-infrared (FIR) [7]. An overview of the electromagnetic spectrum is shown in Figure 2.1. Thermal cameras are typically sensitive to long wavelengths and thus commonly operate in the LWIR spectrum. This region is called the "thermal imaging" or "thermal infrared" region, where sensors can obtain a passive image of objects emitting thermal radiation without the need for illumination. In addition, it is possible to visually display the infrared energy emitted, transmitted, and reflected by an item with a thermal image or a so-called thermogram. The ability to "see" temperatures in an image can be beneficial in many applications. Gade et al. present different applications where thermal cameras are currently used [14], and where detection and tracking of humans is one of them. The author claim that a thermal camera often can be a better choice than a visual camera in the context of surveillance. Since thermal cameras are not dependent on illumination for an object’s visibility, they are advantageous in outdoor applications. It is also possible to get information from an image about temperature differences when a scene is captured with a thermal camera. 3.

(12) 2.1. Thermal imaging. Figure 2.1: The electromagnetic spectrum and the sub-divided infrared spectrum. Image processing on images taken with a thermal camera differs from the ones taken with a standard visual camera. Object detection in IR images is a challenging task since they generally give lower contrast, lower resolution, and contain more noise than the visible-band images [12], making pre-processing an often required step.. 2.1.1. Sensor fusion. Several fusion techniques using both the visual and the infrared spectrum have been proposed in the literature: low-level fusion, medium-level fusion, and high-level fusion, to benefit from both camera types [4, 12, 27]. The data is first merged in low-level fusion before any algorithm to extract information in the fused data is applied. In medium-level fusion, also referred to as feature-level, it is the extracted features in each image that are combined. The fusion can take place in between feature extraction and feature selection, or after both of them [1]. In high-level fusion, the fusion is implemented at the decision level, which involves using several classifiers to make the final decision. Fendri et al. present an overview of state-of-the-art low-level fusion methods [12], which is the most common fusion method in the visual and infrared spectrum. Low-level fusion is typically divided into pixel-based and region-based fusion techniques. To demonstrate the efficacy of various fusion methods, the authors behind [12] performed a number of experiments to evaluate low-level fusion techniques performed in the infrared and visual spectrum. The experiments showed that pixel-based fusion caused noticeable loss of information compared to the information received by the two sensors individually. Fusion of visual and infrared information has been used in various applications in the literature. In [4], different fusion schemes were proposed for an obstacle recognition system. The fusion-based system had the ability to adapt to the illumination condition with a weighting parameter that is responsible for the system’s final decision. In [2], fusion of visual and infrared images was carried out for robust face recognition. The fusion-based recognition turned out to outperformed the individual recognizers under illumination variations. The authors behind [46] constructed a medium-level fusion for an automated driving system. The fusion of two separate neural network models trained on color only and thermal only performed significantly better than the baseline models.. 4.

(13) 2.2. Machine learning. 2.2. Machine learning. Artificial intelligence is significantly different from traditional computer programming. Instead of providing a computer with explicit instructions for a defined set of scenarios, the core idea of machine learning is that the computer rather learns how to solve the problem itself, without being explicitly programmed to do so. Machine learning algorithms are designed to seek out different patterns in the data. Neural networks are a highly robust way of discovering these patterns for machines.. 2.2.1. Artificial neural network. In the past years, researchers have made an enormous breakthrough in image recognition thanks to neural networks. An artificial neural network (hereinafter referred to as a neural network) is composed of a large number of artificial neurons, also called nodes, aggregated into layers. The nodes are individually trained to perform simple mathematical calculations, and then pass the result to the connected nodes. When many layers are connected and data flows through the entire network, neural networks can model complex operations and provide answers to complicated problems within natural language processing, computer vision, and artificial intelligence. The nodes of a neural network are organized into a series of groups called layers. A neural network consists of an input layer, a set of hidden layers, and an output layer. When all nodes in two subsequent layers are connected to each other it is a fully connected layer. A network consisting of only fully connected layers is called a fully connected neural network. A feedforward neural network does not contain any cycles between the connections of nodes and the information flows from the input layer of neurons, through the hidden layers, to an output layer of neurons. An overview of such a simple neural network is illustrated in Figure 2.2.. Figure 2.2: A neural network with multiple hidden layers. Input neurons make up the input layer of a neural network, which carries the initial data into the system for further processing. The data is then passed through the following hidden layers, which process the input and pass it forward, each tweaking the value slightly. The layers are called hidden as they do not contain any input nor output data and the output of the hidden layer generally remains unknown to the system [13]. A neural network can have many hidden layers, the more layers the more processing can be made to find complex patterns. This is why neural networks with many hidden layers often are referred to as deep learning. The final outputs of the output neurons fulfill the task of the neural network, such 5.

(14) 2.3. Convolutional neural network as identifying an object in an image. The number of neurons in the output layer refers to the number of classes the network has been trained for.. 2.2.2. Activation function. Neural network activation functions are crucial in deep learning. Before values flow from the nodes in one layer to the next, they pass through an activation function. The performance of a deep learning algorithm, its precision, and the computational efficiency of training it, are all determined by the activation functions. The functions are mathematical equations determining the output of a node as a function of its input. The activation function in a neural network seeks to add non-linearity to the output node in each layer, which can help the network learn complex data. The most popular non-linearity is the Rectified Linear Unit (ReLU) function given in Equation 2.1 below. ReLU ( x ) = max (0, x ). (2.1). When dealing with a multi-class classification problem, the softmax function is often used as an activation function in the output layer of a neural network. The softmax activation function is given in Equation 2.2, taking in an input vector z with elements z j . ez j s (z) j = ∞K. k =1. ezk. (2.2). The softmax function outputs a value for each node in the final layer of the network. This value can be interpreted as the probability of belonging to a specific class and the values sum up to 1.. 2.2.3. Training a neural network. Deep learning models learn to map appropriate inputs to outputs through the training process. All nodes in a given layer in the network are generating outputs of varied importance determined by its Weight. A large weight means that the input is important while a low (or even negative) value means that is should be ignored. The output of each node is calculated by applying an activation function determined by a vector of weights and biases. The weights will, in turn, be adjusted during the training phase to meet the objectives of the specific problem. For a feedforward neural network, backpropagation is a widely used method for training the network and updating the model weights, it computes the gradient of the loss function with respect to the weights. The training is an iterative process performed step by step with small updates on the weighting to give a change in performance of the model in each iteration. The training solves an optimization problem that results in minimum loss given the evaluation of the training dataset. Stochastic gradient descent algorithm (SGD) is a common algorithm to optimize the weights.. 2.3. Convolutional neural network. When analyzing data that has a grid-based topology, such as images, convolutional neural networks (CNN) are commonly used [45]. The basic building blocks of a CNN are the convolutional layer, pooling layer, and fully connected layer. All of which will be described next. These layers each have their own set of parameters that can be tweaked, and they each perform a different task with the input data.. 6.

(15) 2.4. Transfer learning. 2.3.1. Convolutional layer. The convolutional layer adds translation invariance, improving the neural network’s ability to recognize objects in any position in an image. The name comes from the fact that the network performs the linear mathematical operation convolution, where a set of weights are multiplied with the input, instead of general matrix multiplication. The input to a CNN is a multidimensional array of data that is convolved with a smaller, two-dimensional array of weights called a filter or a kernel, using a sliding window approach. The feature map or activation map is a term to describe the resulting output after passing a convolutional layer. The input is convolved with the array of parameters which are adjusted by the learning algorithm. The hyperparameters of the convolutional layer are the filter dimensions, the stride, and the padding.. 2.3.2. Pooling layer. The pooling layer is a downsampling operation along the spatial dimensions which is often found in connection with the convolutional layer in a CNN. The layer controls overfitting by reducing the number of training parameters and the expense of computing. Max and average pooling are two commonly used types of pooling operations where each pooling takes the maximum value or averages the values of the kernel, or sliding window. An illustration of how the max pooling is given in Figure 2.3 .. Figure 2.3: Max pooling performed on a 4x4 matrix.. 2.3.3. Fully connected layer. The fully connected layers are the CNN:s final layers. They are placed before the classification output of the CNN architecture and operate on a flattened input of a final pooling or convolutional layer.. 2.4. Transfer learning. Overfitting is a problem in machine learning where the model learns the training data too well and fails to achieve the same accuracy on the test data or any other new dataset. It has shown that deep neural networks easily tend to overfit when the data available is limited [33]. Neural networks rely on large volumes of high-quality data in order to find new patterns. Without the supply of data, the neural network will be unable to learn new information. With extensive training data available, it is possible to train a neural network from scratch. However, it would be a very expensive process, both in time and resources, to collect sufficient data and train the final model from the beginning. To overcome this problem, an existing neural network model can be re-used and adapted to a new problem, and this is what is referred to as transfer learning.. 7.

(16) 2.5. Dataset augmentation Several models trained on large and challenging datasets, have been released and are available to choose from. The idea is to use a model trained on one set of data and use its knowledge as a starting point in a new related problem. Depending on the modeling methodology used, this can include using all or parts of the model. A CNN processes an image layer by layer where the network learns to detect basic patterns in the initial layer. Each layer then uses the preceding information to look for slightly more complex patterns. The layers closer to the output layer interpret the extracted features for a task-specific classification, which is why the earlier layers are effective at feature extraction in the context of generalization. Moreover, it is possible to fine-tune the pre-trained model’s weights by continuing the backpropagation algorithm. The fine-tuning can be done to either all layers in the network or only to the higher-level layers, which are more specific to the given problem, while the initial layers are kept frozen.. 2.5. Dataset augmentation. In computer vision, data augmentation is a method of increasing the amount of training data by including marginally changed versions of the stored data. By modifying the data with different simple transformations such as translations, rotations and, flips, the generalization of a classifier can be improved [18].. 2.6. Image recognition with CNNs. Common applications for CNNs are object detection, classification, and segmentation. In image segmentation, each pixel is classified and labeled with a category, whereas classification intends to classify an image (or a region of it) and assign it with a label. Use of CNNs for image recognition has proven to perform well when a large dataset is available [26]. State-of-the-art object detectors are usually divided into two categories, single-stage detectors, and two-stage detectors [23]. Region-based Convolutional Neural Networks (R-CNN) proposed by Girschick et al. in 2014 [16] was a first attempt to solve the object detection problem. The detector consists of a first stage of extracting regions of interest (ROI) using a selective search approach, generating as much as up to 2000 region proposals. A CNN is then used as a feature extractor to produce output features for each region which is finally classified with a Support-vector machine (SVM) classifier. The authors behind R-CNN solved some of the previous work’s drawbacks by introducing a faster object detection algorithm called Fast R-CNN [15]. The algorithm runs the neural network once for the whole image, extracting the features, instead of running it once for each ROI. Fast R-CNN was found to improve training and testing speed in addition to improving the overall detection accuracy. A further improvement to the speed of the algorithm was made with the introduction of Faster R-CNN [35]. The Faster R-CNN does not use, unlike its predecessors, the selective search approach but rather introduces the ROI generation into the network itself, making it faster. The first stage of the faster R-CNN is called region proposal and is the step in the process that takes the longest time to execute. It is the region proposal that prevents the two-stage detector from working in real-time applications [35]. Despite this, faster R-CNN is today one of the most used and representative two-stage detector [23]. Single-stage detectors such as You Only Look Once (YOLO) [34] and Single Shot MultiBox Detector (SSD) [28], explicitly propose ROI from the input images, obviating the need for a region proposal stage. YOLO, initially released in 2016 with a number of subsequent versions, is one of the most effective object detection algorithms. The algorithm uses a single CNN to predict bounding boxes and their respective class labels in one run. The one-stage detector achieves a high inference speed as a consequence of its simple architecture, making it suitable for real8.

(17) 2.7. Moving object detection time object detection, it however reaches lower accuracy rates [41]. The two-stage detector, on the other hand, has excellent localization and object detection accuracy but does not have the same inference speed as the one-stage detector.. 2.6.1. Base networks. A network’s configuration is determined by the network’s intended use. The size of a network is determined by the amount of calculations needed, with many variations and modifications that can be made. Some models are built with the goal of being effective rather than accurate. An object detector consists of a base network and a classifier. By excluding the final fully connected layers, a model can be used for feature extraction. The part of the network between the input layer and the last pooling or convolutional layer is regarded as the feature extraction part of a model, whereas the rest of the network, which constitute the fully connected layers, is regarded as the classification part of the model. The VGG architecture [38] comes with either 16 or 19 weight layers and was state-of-the-art in 2014 when it was first introduced. The model achieved 92.7% top-5 test on the ImageNet [11], which is an image database with over 14 millions of hand-annotated images belonging to 1000 classes. The VGG model is a standard CNN design and still widely used today as a basis for other models because of the ease of comprehension. The backbone network originally used in SSD for feature extraction, is the VGG architecture. The fully connected layers has to be removed from VGG before it can be used as an object detector in SSD or Faster R-CNN. The deep learning model is made available alongside pre-trained weights given the ImageNet dataset and can be used for feature extraction, prediction, and fine-tuning.. 2.7. Moving object detection. Surveillance is the process of closely monitoring or supervising an environment, typically from a distance. Monitoring of objects, humans, or processes is often used for security, safety, or regulations purposes. Moving object detection is an important area for automatic detection systems and monitoring applications. In literature, detecting and tracking moving objects from a video sequence is a widely studied area in both visual and infrared imaging. Akula et al. [3] present a real-time framework for forest monitoring using thermal cameras, utilizing moving object detection, and classifying targets with a DCNN. Their model showed a test accuracy of 95% given constrained illumination conditions. In [31], moving object detection is used for vehicle detection on a highway. The authors behind [36] took on the challenge of moving object detection for a moving camera, accounting for the change in the pose of a camera using deep learning. In computer vision, image analysis is often used for the detection of objects. The process of detecting and tracking moving objects includes searching for objects of interest in each frame of a video sequence. To speed up calculations, more interesting parts of an image can be selected. By extracting the desired ROI, the analysis area can be decreased and thus accelerate the calculation. The distinction of objects is, of course, dependent on the scope of use. An advantage when it comes to surveillance applications is that moving objects are of particular interest, hereinafter referred to as targets. As a consequence, environmental changes can be used to differentiate targets from other non-moving objects in a scene. The simple approach is to find a way to segment the image into the foreground, representing the objects of interest, and the background, representing the static objects.. 9.

(18) 2.7. Moving object detection. 2.7.1. Background subtraction. Several methods for background modeling have been proposed in the literature. One of the most commonly used technique for motion segmentation in a static scene is background subtraction [36, 39], which aims to detect moving objects and classify pixels as foreground or background from a sequence of video frames. An initial reference image is taken with a static camera following a set of frames taken at regular intervals. A background-foreground segmentation can thereafter be created by calculating the difference between consecutive frames. The many approaches to the background subtraction problem vary between the used background model and the method used to update this model between frames. Background modeling algorithms are commonly divided into non-recursive and recursive techniques. Non-recursive algorithms use a sliding-window approach for background estimation and use a buffer to store previous frames [8, 32]. A commonly used non-recursive algorithm is median filtering [9, 47], where the background approximation is defined as the median of all the frames in the buffer at each pixel position. In [44], the Wiener filter is used to make probabilistic predictions of the expected background based on a recent history of values. Due to the buffer of previous frames, the non-recursive technique consumes a lot of memory in order to estimate the background model based on statistical properties of these frames [32]. Recursive algorithms on the other hand, do not use a frame buffer to store previous frames but rather recursively updates a single background model. Some of the simplest recursive algorithms in literature are the approximated median filter [30], and the Kalman filter [5, 40]. The Kalman filter is a widely used technique for tracking linear dynamic systems under Gaussian noise. The recursive algorithms use a single background model that is updated with each new frame. As a result, this technique needs less memory than non-recursive techniques and is more computationally effective [32].. 2.7.2. Gaussian Mixture Models. Methods based on Gaussian Mixture Models (GMM) [6, 17, 48, 49] are improvements of original GMM introduced by Stauffer and Grimson in 1999 [42], to address a variety of issues that arise in video surveillance systems. The proposed methods have proven efficient in literature [3, 8, 32]. The recursive per-pixel background subtraction methods, models each background pixel as a mixture of K Gaussian distributions. An extension to the original GMM was proposed with [48, 49] which automatically selects a variable number of Gaussians distributions to model each pixel. The algorithm has additional parameters that influence the outcome of the procedure; the history refers to the number of frames that will affect the background model, and the threshold weight is the distance between the pixel in the current frame and the background model where a smaller value tends to generate false objects. The improved method decreases the memory requirements, improves its computational efficiency, and has proven to increase the performance in multimodal backgrounds.. 2.7.3. Background model. Toyama et al. [44] and Shaikh et al. [37] introduce some challenges in the maintenance of the background model related to the background subtraction procedure. To have a robust system, it is necessary to be able to avoid the following problems: Camouflage There might be objects that seem to be identical to the modeled background and its surroundings. This is a possibility in both thermal and visual images. For visual images, this is having similar color or texture as the surrounding which causes an object to be hidden. 10.

(19) 2.8. Measurements in the background. For thermal images, camouflage is a problem when it is not possible to distinguish temperature changes. Illumination changes In visual imagery, a sudden change in illumination can drastically change the background’s appearance. Since thermal cameras creates an image using infrared radiation, illumination changes are less of an issue in thermal imagery. Dynamic background Some parts of the scene may move occasionally, for instance swaying trees in an outdoor environment, even if they are to be considered as background. Walking person When an initially static object moves, both the object and the revealed parts of the background will be considered as foreground. Shadows Shadows cast by targets could be a potential problem which would complicate the background subtraction procedure if they are detected as foreground objects. Further, there is a possibility that multiple shadows could merge, making it difficult to identify separate objects, and classify them correctly. In thermal imagery, shadows do not exists.. 2.8. Measurements. To be able to evaluate the effectiveness of an object detection and classification model, certain performance metrics can be calculated. The metrics accuracy, precision and recall, all used to measure the classification performance, can be obtained from the confusion matrix presented below.. 2.8.1. Confusion matrix. The classifier’s output is defined by a confusion matrix. A confusion matrix is represented as a table and is often used to summarize the results of a classification model on a collection of data where the true values are known. The matrix does not only provide an insight into the errors made in the classification model but also visualizes the type of errors made. The confusion matrix for a binary classifier is illustrated in Figure 2.4 where the various classification outcomes are described next: True Positives (TP): Correctly prediction of the positive class. True Negatives (TN): Correctly prediction of the negative class. False Positives (FP): Incorrectly prediction of the positive class. False Negatives (FN): Incorrectly prediction of the negative class.. 2.8.2. Accuracy. Accuracy is a performance metric that describes how well a classifier works in all classes. The accuracy is calculated as the ratio of the number of correct predictions made by the classifier and the total number of predictions made:. A=. 2.8.3. TP + TN TP + TN + FP + FN. (2.3). Precision and Recall. Precision and recall [10] are two important model evaluation metrics. Precision refers to the percentage of the relevant result, in other words, how often does the model get it right when 11.

(20) 2.8. Measurements. Figure 2.4: The confusion matrix for a binary classifier. predicting a label. Precision is calculated as the ratio of the predicted class which is the actual class:. P=. TP TP + FP. (2.4). Recall, or sensitivity, refers to the percentage of the total relevant result the model correctly classifies, that is the ratio of the actual class that is correctly predicted by the model and is calculated as:. R=. 2.8.4. TP TP + FN. (2.5). F1 measure. The F1-score is a metric for determining how accurate a model is on a given dataset. The measure is a method of integrating the model’s precision and recall and can be interpreted as a weighted average of the two. The F1-score can assume values between 0 and 1 where a good value, closer to 1, is obtained if the number of false positives and true negatives is low. A high F1-score means that the model accurately detects objects and is largely unaffected by false alarms. The measure is calculated as:. F1 = 2 ˚. 2.8.5. precision ˚ recall precision + recall. (2.6). Intersection over union. Intersection over union (IoU) is often used to evaluate how well an object detector model performs and is able to locate the objects. The IoU is illustrated in Figure 2.5 and is the ratio between the intersection and the union of the bounding boxes predicted by the model and the ground truth boxes, and is defined as follows:. IoU =. area of intersection area of union. (2.7) 12.

(21) 2.8. Measurements. Figure 2.5: Graphical illustration of Intersection over Union. To be able to calculate the precision, recall and F1 measure for an object detector, it is necessary to first identify TP, FP and FN. IoU is used to determine whether a detection (Positive) is right (True) of not (False) by comparing the IoU value with a threshold, usually set to 0.5. If the value is above the threshold, it is a True Positive, while values below the threshold are considered as a False Positive. The objects that the model failed to locate, False Negatives, must also be taken into account and measured.. 13.

(22) 3. Method. This chapter aims to describe the procedure of the work. This includes reading through related work and setting up system requirements, collection of data for quantitative performance evaluation, implementation of an object detector for moving object recognition, utilization of multiple sensors by combining visual and infrared information, and finally, building and using a convolutional neural network for object classification. The workflow is described in more detail below.. 3.1. Pre-study. A literature study was conducted to take part in related work, to read up on what has been done in the area, what challenges there are, and which methods are best suited for the purpose of identifying objects in a surveillance system. System requirements were gradually identified when it became clear what was important to the system, and what could be ignored. Target classes were identified from a monitoring system perspective. Since it is of interest to monitor prohibited movements, the predefined classes were thus identified as person and vehicle as the latter can be used by a person in the monitored area. The most time-consuming part of today’s state-of-the-art object detection methods was identified as the region proposal stage, scanning whole images to detect possible ROI. A moving object detection procedure based on background subtraction makes it possible to detect deviations in an area based on differences in consecutive frames, while a limited number of proposals are used for the recognition stage. A Gaussian Mixture Model was used to generate the background model and extracting moving objects in a scene. The method was chosen due to its proven efficiency in real-time applications. With this architecture, the processing time related to object detection is limited to the CNN’s inference time. Furthermore, it was important that the system could detect and classify objects of lower resolution as monitoring of an area is often done from a distance. Training of the CNN was thus done with an image resolution similar to that which was used in later processing in the system.. 14.

(23) 3.2. Tools and frameworks. 3.2. Tools and frameworks. The implementation of the different components was written in Python, both the detection and classification part. Python ranks first in IEEE Spectrum annual ranking of the most popular programming languages [21]. Python comes with in-built libraries and packages which simplify the implementation of deep learning algorithms, image processing, and visualize data clearly, which makes the language very popular in computer vision. OpenCV is a free open source software library that provides real-time optimized computer vision tools suitable for image processing. The library is used to simplify the implementation and analysis of images in the system. The neural network model was implemented in Keras [24] with Google TensorFlow [43] as backend. Keras is a high-level Application Programming Interface (API) that is built on top of TensorFlow, a software library for machine learning. Keras offers a straightforward and consistent interface that is designed for typical use cases. Since Keras is built in Python, it is easy to use and comprehend for experienced Python programmers. Google Colaboratory [19] was used to edit and run the project files since it is a free and online cloud-based Jupyter notebook environment that allows to train and use deep learning models on GPUs. Since it uses Google resources, it is possible to accelerate neural network operations without being limited by, or using, own processors. Further, the environment does not require any setup since it comes with many pre-installed machine learning libraries which makes it easy to get started.. 3.3. System overview. The proposed system consists of two main parts: a moving object detector and a target classifier, both of which are briefly described below and in more detail in their respective upcoming sections 3.5 and 3.6. 1. Moving object detector The first stage of the system consists of detecting moving objects in both visual and infrared frames and segmenting these into foreground pixels using background subtraction. The two resulting binary masks are then combined by fusing the foreground regions and are thereafter post-processed to finally obtain ROIs in the visual frame. 2. Target classifier The ROIs from the first stage is used as an input to the second stage. The images are resized and fed into a pre-trained CNN which is used for feature extraction and classification of the moving objects.. 3.4. Data collection. To evaluate the proposed moving target detector and the result of the fusion of visual and infrared information, the popular database INO’s Video Analytics Dataset [22] was used. The dataset consists of a number of different video sequences of varying length and resolution and comes with between 10 and 21 ground-truth motion segmentation masks used for evaluation. From this dataset, three sequences recorded in both thermal and color were chosen, namely INO Close person (CP), INO Parking snow (PS), and INO Main entrance (ME). An overview of the used sequences, along with their information is shown in Table 3.1. The sequences are captured in a variety of locations and under varying environmental conditions and form a broad basis for testing the suggested method. Example frames from the sequences along with their ground truth motion segmentation are shown in Figure 3.1. 15.

(24) 3.5. Moving object detector Table 3.1: Details of the used sequences from INO’s Video Analytics Dataset. Sequence CP PS ME. Length (frames) 240 2941 551. Resolution 512x184 448x324 328x254. Moving objects class Person Person, Car Person, Car. Weather Conditions Sunny day Cloudy day Daytime. All three video sequences then underwent the moving object detection stage described in the upcoming section 3.5, to extract ROIs representing moving objects. These detected regions were then resized to 32x32 pixels, saved, and manually divided into one of the predefined target classes, person and vehicle, and labeled accordingly. Any misregistered regions were disregarded in order to build the dataset for the target classifier stage. In addition to the data collected by the moving object detection stage, images belonging to the person and vehicle class were collected from the publicly available datasets CIFAR-10 and CIFAR-100 respectively [25]. The CIFAR datasets consist of 32x32 color images in 10 respective 100 classes. From the perspective of a monitoring system with two identified target classes, only images belonging to either the person or vehicle class were collected in order to build the final dataset for the target classifier stage. The external collection of data was carried out in order to have a sufficiently large dataset and to ensure its variability. A total of 2056 color images were collected evenly distributed between the two classes of people and vehicle, labeled, and constituted the final dataset on which the CNN was trained.. Figure 3.1: Example frames from the sequences in the dataset, including visual, infrared, and ground truth images.. 3.5. Moving object detector. The first step of detecting moving objects is to create the background model which is to be used to distinguish foreground objects from background objects. The foreground is related to the pixels which are in motion while the background is the remaining non-moving pixels. A background subtraction model based on Gaussian Mixture Models which is proposed in the papers [48, 49] was used to detect moving objects given a video sequence. The background subtraction scheme is a background/foreground segmentation algorithm that assigns the appropriate Gaussian distribution for each pixel. The background modeling consists of 16.

(25) 3.5. Moving object detector a background initialization step and a background update. For each new video frame, the background model is updated and the foreground mask is calculated. The history of the background model, the number of last frames that have an effect on the model, was set to 200. The history parameter sets the learning rate of the algorithm which exponentially decays the weights of older frames. This value was set experimentally and can be optimized for a specific course of events, a higher value is good for very small movements while a small value is suitable for fast movements in the scene. For each new incoming frame, the background model is subtracted from the current frame and generates a binary image consisting of ones representing the foreground and zeros representing the background. The threshold value of the algorithm decides whether a pixel is considered as foreground or background. This threshold is based on the squared Mahalanobis distance [29] between a pixel and the background model and decides whether it is considered foreground or not, this value was set to 16. A smaller value requires less certainty and thus generates more components. Pre-processing is an essential step in image processing since it reduces false alarms and improves detection rates by decreasing noise and background clutter while it can improve and enhance target silhouettes. Before any object detection was performed in the individual frames, pre-processing was executed. For this purpose, a flood fill algorithm was implemented to extract the background from the binary mask. The flood fill operation was performed from a pixel known to belong to the background, pixels not affected by this operation are assumed to lie within the object. The flood-filled image was then inverted and combined with the binary mask to extract the foreground. This procedure resulted in solitary pixels being erased. Finally, a dilation operation with a 3x3 kernel was performed to enlarge the boundaries of foreground regions. This was done to avoid removing minor objects at a later stage while holes and connected objects were joined. The fusion was conducted by combining the detection results of the two sensors with logical conjunction. In this way, the intersection of the foreground regions of the binary masks obtained from the background subtraction in the infrared and color images was combined. Morphological closing, dilation followed by erosion, was performed on the combined binary mask to fill holes inside foreground objects obtained from the fusion process. Finally, foreground regions smaller than 100 pixels were eliminated from the classification process as these were assumed to be too small to be classified. However, the regions can still give signs of movement in the system. Figure 3.2 shows an illustration of the process of fusion of visual and infrared spectrum for moving object detection.. Figure 3.2: Overview of fusion of visual and infrared spectrum for moving object detection.. 17.

(26) 3.6. Target classifier. 3.5.1. Detector evaluation. The evaluation of the object detector was carried out by calculating the intersection over union (IoU) between the binary mask obtained from the background subtraction procedure and the respective ground-truth binary mask. The IoU was used to calculate the overlap of a predicted bounding box against the actual bounding box for an object. To be able to evaluate the performance of the fusion stage, the IoU was calculated for the visual and infrared images separately as well as for the fused image. To calculate the IoU, a python script was implemented which first found the contours of the (white) foreground pixels in the binary masks, drew approximate rectangles around these connected pixels, and finally tested each predicted rectangle against all ground-truth rectangles to find a possible match. An IoU threshold value was set to 0.5 to find out if a predicted rectangle had a large enough overlap to be considered as a true positive detection and was otherwise seen as a false positive detection. All ground-truth rectangles that the background subtraction procedure was unable to detect, were annotated as false negatives. In the ground-truth masks provided from the CP video sequence, shadows were detected and reproduced in gray. These binary masks were threshold to not include the shadows and were made black, as shadows are not of interest in this work. The number of true positives, false positives, and false negatives was in this way calculated for each ground-truth frame for the three sequences and used to calculate the precision, recall, and F1-score for the visual, the infrared, and the fused method.. 3.6. Target classifier. The second stage of the system is the target classifier, whose task is to classify the extracted objects from the moving object detection stage. CNNs have proven their efficiency in classification tasks and are therefore used.. 3.6.1. Network architecture. The VGG16 architecture is chosen as the feature extractor in the project due to its proven efficiency and simplicity. The pre-trained VGG16 has trained on the ImageNet dataset consisting of over 14 million color images belonging to 1000 classes. The network has a default input image resolution of 224x224 pixels, which also is the size of the images the network has been trained on, and layers are stacked on top of one other to create the network. There are 16 weight layers, 13 of which are convolutional layers and the remaining three are fully connected layers. The backbone network of the model uses architecture elements repeated multiple times consisting of convolutional layers with 3x3 filters and ReLu activation function, followed by a max-pooling layer.. 3.6.2. Training network. For this project, a pre-trained VGG16 model is adapted and used as a feature extractor. The model is loaded and initialized with the weights from training on the ImageNet dataset with exclusion from the last fully connected layers. On top of the pre-trained model layers, custom fully connected layers are added. Two dense layers and one dropout layer are used as a classifier. The first dense layer uses the ReLu activation function to introduce non-linearity, and the second one uses softmax activation to output the probability scores. The dropout layer is added to prevent overfitting of the data with a dropout rate of 0.2. The first 16 blocks of the VGG16 model were frozen, making only the final convolutional layer trainable. The weights of this layer are thus updated during training and the backpropagation algorithm. An illustration of the used network architecture is shown in Figure 3.3. Furthermore, the following hyperparameters were used to train the model:. 18.

(27) 3.6. Target classifier • Input image resolution: 32x32 • Optimizer: Adam with a learning rate of 0.00001 • Batch size: 32 • Classification loss: Softmax Loss, Softmax activation and Cross-Entropy loss. Figure 3.3: Convolutional neural network architecture used for image classification. The maximum number of epochs, the number of times the entire dataset is passed through the neural network, are initially set to 100. Through the Keras API, early stopping is added to stop training once the model performance stops improving, to avoid overfitting of the data. The performance measure to be monitored is set to validation loss with a delay of 8 epochs. Checkpointing is another callback provided and used through the API to restore and save the best model after early stopping has ended the training. The best model can thereafter be loaded and used for predictions and classification purposes. The dataset for the target classification stage which was collected as described in section 3.4, was partitioned into training and testing splits with a ratio of 5:1. An even distribution of the training and testing splits over the classes was achieved through stratifying, which returned the same proportion of class labels. Both the training and testing datasets were normalized by defining the ImageNet image mean and adding it to them.. 3.6.3. Data augmentation. To increase the size of the dataset and add a degree of variation to it, allowing the model to generalize better on unseen data, data augmentation was applied during the training phase. The augmentation techniques used were random rotations, translations, horizontal flips, changes in scale, and shearing. An example of image augmentation is shown in Figure 3.4, where an image is exposed to various augmentation techniques. The different transformations and exact values used during training are given below: • rotation_range = 30 • zoom_range = 0.15 • width_shift_range = 0.2 • height_shift_range = 0.2 19.

(28) 3.6. Target classifier • shear_range = 0.15 • horizontal_flip = True. (a) Original image. (b) Generated images after different augmentation techniques. Figure 3.4: Example of image augmentation. An image is subject to different transformations generating 8 new slightly altered images.. 3.6.4. Model evaluation. During the training process, the progress of the training and its performance was visualized. The history object of the training made it possible to visualize the progression of accuracy and loss across epochs. The model’s accuracy, precision, recall, and F1-score, both the overall value and for the individual classes, were calculated by evaluating the classification model’s predictions for the test data. Based on this data, a confusion matrix could also be created. The training process continued until the maximum number of epochs was reached or the model performance stopped improving. The hyperparameters of the model were fine-tuned to achieve a model with at least 90% accuracy. The final values of the model are the ones reported above. To demonstrate the effect of transfer learning, the same model and hyperparameters were used to train the network without any weights initialized. This results in the random initialization of weights that are updated during the backpropagation algorithm.. 20.

(29) 4. Results. This chapter presents the results of the work. This includes the quantitative performance evaluation of the proposed moving object detector for the individual sensors as well as the low-level region-based fusion. The results of the classification model’s prediction for a set of still images are also presented in this chapter.. 4.1. Moving object detector. The purpose of this thesis is to see how the low-level fusion affects the outcomes of the moving object detector in comparison with the individual sensors. In Figure 4.1, an example of the qualitative results of the proposed detector in the visible, infrared and fused spectrum, along with the ground-truth masks for motion segmentation, are shown. The results of applying the detector to a set of frames in the Close person (CP) video sequence can be seen. The compiled results of the moving object detector, including the quantitative data of the detector, are presented in Table 4.1. For each video stream and sensor, the number of true positives, false positives, and false negatives along with the calculated precision, recall, and F1-score for the visual, infrared, and fused method, is shown. Table 4.1: Visual comparison of detection results in visible, infrared, and fused spectrum, for frame 156 in CP. Dataset. Sensor. TP. FP. FN. Precision. Recall. F1. Close person. Visual IR Fusion Visual IR Fusion Visual IR Fusion. 9 11 17 45 46 42 14 15 15. 20 1 5 14 11 4 1 2 0. 9 8 2 15 7 13 2 2 2. 0.310 0.917 0.773 0.763 0.807 0.913 0.933 0.882 1.000. 0.500 0.579 0.895 0.750 0.868 0.764 0.875 0.882 0.882. 0.383 0.710 0.829 0.756 0.836 0.832 0.903 0.882 0.938. Parking snow Main entrance. 21.

(30) 4.1. Moving object detector. Figure 4.1: Detection result in visible, infrared, and fused spectrum for Close person (CP) sequence. From Table 4.1, it can be deduced that the recall value for all three sequences is the lowest in the visible spectrum. The F1-score, the harmonic mean of precision and recall, assumes the highest value for the low-level fusion in the CP and ME sequences while the infrared sensor gives the highest value in PS, if not by much. The results of the detection in frame 156 of the CP video stream, and the generated bounding boxes, are shown in Figure 4.2. The results of all three detections, the infrared, visible, and fused spectra, are reproduced in color frames to clarify and simplify the comparison. In Figure 4.2a, the resulting bounding box in the visible spectra after motion detection also includes the person’s shadow. The predicted bounding box gives an IoU value of 0.21 resulting in one FP and one FN detection. The resulting bounding box in the infrared spectra, Figure 4.2b, has an IoU value of 0.74 which is above the set threshold value and is thus counted as a TP detection. The detection results in the fused spectra, Figure 4.2c, give an even higher IoU value of 0.90 and is therefore also counted as a TP.. (a) Detection in visible spectrum IoU=0.21. (b) Detection in IR spectrum IoU=0.74. (c) Detection in fused spectrum IoU=0.90. Figure 4.2: Visual comparison of detection results in visible, infrared, and fused spectrum, for frame 201 in CP. 22.

(31) 4.2. Target classifier The results of the detection in frame 201, also from the CP stream, are shown in Figure 4.3. The results in the visible spectrum, Figure 4.3a, show a detection resulting in two predicted bounding boxes. One box has an IoU value of 0.72 while the second one has zero overlaps with the ground truth, resulting in a value of 0.0. The moving object detector hence results in one TP and one FP detection. Two bounding boxes are also the result of the detection in the infrared spectrum. Figure 4.3b shows two predicted bounding boxes with IoU values of 0.14 and 0.43 which results in two FP and one FN detections. In the fused spectrum, shown 4.3c, the object detector is able to predict a bounding box with an IoU value of 0.78 and is therefore counted as a TP.. (a) Detection in visible spectrum IoU=0.0, 0.72. (b) Detection in IR spectrum IoU=0.14, 0.43. (c) Detection in fused spectrum IoU=0.78. Figure 4.3: Visual comparison of ground-truth with detection result in visible and fused spectrum. The detected and extracted regions from the moving object detector stage results in images of sizes which are shown in Table 4.2. A variety of image sizes are represented, with the shortest width of the image spanning from 24 to 136 pixels. Table 4.2: Cumulative proportion of extracted images of various sizes. Image size (smaller than) Proportion (%). 4.2. 35 21. 50 54. 60 82. 100 97. 140 100. Target classifier. The implemented CNN trained for a total of 35 epochs before the performance stopped improving and ended of early stopping. The evaluation metrics accuracy, precision, recall, and F1-score, are calculated for the overall as well as the class-wise performance of the classifier. The classification is based on the test dataset, which is unseen to the network during the training and validation process. The resulting metrics are shown in Table 4.3. Table 4.3: Table of the class-wise and overall performance of the proposed neural network for classification of the test dataset. People Vehicle Overall. Accuracy 92.0% 92.0% 92.0%. Precision 0.96 0.88 0.92. Recall 0.87 0.97 0.92. F1 0.92 0.92 0.92. The model demonstrates an accuracy of 92%. The overall precision, recall, and F1-score are 0.92. The class-wise performance of the network, the number of TP, TN, FP, and FN based on the test dataset can be read from the confusion matrix which is illustrated in Figure 4.4.. 23.

(32) 4.2. Target classifier. Figure 4.4: Confusion matrix of the class-wise performance of the proposed neural network for classification of the test dataset. The training and validation accuracy and loss during training of the neural network for 35 epochs, is shown in Figure 4.5. Figure 4.5a shows the model accuracy for training data in red and validation data in blue. Similarly, Figure 4.5b shows the model loss for training data in red and validation data in blue. The importance of having a model that converges faster towards a final value for accuracy and loss becomes clear when the same network is trained with random initialization of weights instead of the ImageNet pre-trained. After 35 epochs, this model has a validation accuracy slightly below 80%. With transfer learning, the model converges at epoch 35 and reaches a validation accuracy of 80% already after a few epochs.. (a) Model accuracy. (b) Model loss. Figure 4.5: Training and validation accuracy and loss during training of proposed neural network, pre-trained on the ImageNet dataset.. 24.

(33) 5. Discussion. This chapter contains discussions of the results presented in the previous chapter. In addition, the method choices made and the approach used in the work are discussed and reflected upon.. 5.1. Results. When analyzing the result of the moving object detector, it becomes relatively clear where the respective sensors fail in accuracy and have their difficulties. In visible spectra, shadows are a big problem. The effects of detecting shadows are most evident in the Close person video stream. When the shadow of a moving object is attached to the object, the detection result gives a low intersection over union overlap which generates a false positive detection (and consequentially also a false negative detection). Although the detection itself is not incorrect and can still be correctly classified in the next stage, it poses problems. When a shadow is instead separated from its moving object, two objects are detected instead. As a result, the system will unavoidably identify false objects. For all three sequences in the dataset, the evaluation of the moving object detector in color images yielded the lowest recall value. The recall is directly related to false negatives, which means the system fails to detect objects. For the Close person and Parking snow sequences, even the precision values are low. A low precision is related to a system generating false alarms and incorrectly detecting objects. The Main entrance video sequence gives overall good results in all three spectrums with few false positive and false negative detections. In the infrared spectra, one noticeable problem that affects the accuracy of object detection is the poor image contrast and the lack of clear silhouettes. As shown in Figure 4.3b, the detection in infrared spectrum generates two smaller predicted bounding boxes as a result of poor contrast. The infrared sensor is unable to distinguish the whole object from the background. It is visible in the detection result shown in Figure 4.1, that the resulting binary mask is not very well segmented but is divided into smaller, unconnected foreground areas. The inability to detect the object correctly thus results in the system creating false alarms because the entire object is not enclosed and can not be correctly classified.. 25.

(34) 5.2. Method The proposed neural network used for classification in the system achieved an accuracy of 92.0%. The evaluation looked at the metrics precision and recall for the two classes, all of which assumed high values, resulting in the model achieving satisfactory results. The model can with high certainty distinguish between the two classes. Even when the model achieves acceptable results, improvements can still be made. Fine-tuning of the hyperparameters of the model, as well as its structure, could be explored further to optimize the model even though improvements were made along the iterative process. Another notable aspect, though not surprising from the theory section, is the impact of transfer learning. Although the dataset for the specific classification task for the project did not necessarily have to be too large to achieve its purpose, a noticeable difference is still seen in the time required to train the model. For a system that requires a longer training time, possibly one that should be able to distinguish between multiple classes, this can be assumed to be an even more important aspect. Something worth mentioning is that the result of both the detection and the classification part is strongly dependent on the quality of the detection model. Extracted foreground regions from the first stage are fed into the CNN for classification, which means that the accuracy of the recognition depends largely on how good the background subtraction procedure turns out to be. The thesis aimed to improve the detection of moving objects by utilizing information from the two spectrums, visual and infrared. The use of a background subtraction method based on simple Gaussian mixture models could thus be used to show improvements in the final detection of the fused data without necessarily having to optimize the background subtraction algorithm. Contributions from this work are insights into the problems and challenges that exist for monitoring applications, both in the visible and infrared spectrum, and how utilizing multiple sensors and performing low-level fusion can affect the result of moving object detection.. 5.2. Method. The discussion of the methodological choices covers the sub-areas data collection, implementation of the moving object detector, and finally the target classifier.. 5.2.1. Data collection. In this project, all data was collected from publicly available datasets, images as well as video streams. Annotation of data, as well as correctly determined motion ground truth masks, were thus assumed to be correct. No in-depth verification was carried out. A delimitation was made to evaluate the proposed method rather than to implement and evaluate the system itself. At the beginning of the project, the idea and intention were always to implement a system that could be used and tested. Due to logistical reasons and delays in the delivery of the equipment to be used, the delimitation had to be made. The data collected from the publicly available datasets was therefore considered sufficient for general system evaluation. When it comes to data collection for a system that is to be optimized for specific objectives and environments, the quality and relevance of the data are of course of greater importance. Here it becomes important to collect, optimize the object detector, and train the neural network classifier on images from the same context to enhance the final results and efficiency of the implemented system.. 5.2.2. Moving object detector. As previously mentioned, the object detector used in this project is based on background subtraction and Gaussian mixture models. After reviewing possible methods and approaches, background subtraction and a Gaussian mixture-based background/foreground segmenta26.

No results found