Towards Robust Localization Deep Feature Extraction with Convolutional Neural Networks

(1)

Towards Robust Localization

Deep Feature Extraction with Convolutional Neural Networks

Erik Carlbaum Ekholm

Engineering Physics and Electrical Engineering, bachelor's level 2020

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

(2)

Declaration

I hereby declare that this bachelor has been solely written by me. Any assistance used from the literature work and books has been referenced with the corresponding annotations.

Luleå, August 13, 2020 Erik Carlbaum Ekholm

ii

(3)

iii

Acknowledgments

I want to thank my supervisors Sina and Christophoros for their excellent patience and their ability to point me in a legitimate and promising direction. Their experience and knowledge within the field have been of great help and inspiration throughout the process.

I also wish to thank Sina for the patience to stimulate my ideas and questions as a curious first year student at LTU.

(4)

List of Figures

1.1 Frames from autonomous Micro Aerial Vehicle (MAV) footage from Mjölkud- den mine [1, 3]; (a) dust in illuminated air; (b) dynamic range issues; (c) sudden movements . . . 3 3.1 A generated example of an Artificial Neural Network (ANN) structure [10] 7 3.2 A generated example of a Convolutional Neural Network (CNN) struc-

ture [10] . . . 8 3.3 Feature maps from pre-trained AlexNet. (a) Source image [11]; (b) relu1,

channel 13; (c) relu2, channel 65; (d) relu3, channel 93 . . . 9 4.1 (a) MAV footage frame from Mjölkudden mine; (b) feature map of source

image from (a) extracted from VGG-16, layer relu2_2, channel 127 . . . 12 4.2 (a) Binarized feature map from figure 4.1(a); (b) Binarized feature map

from (a) with extracted feature points (green) overlaid . . . 13 4.3 (a) Two consecutive feature maps, current (magenta) and previous (green)

overlaid for display purposes; (b) Extracted point lists from feature maps in (a) . . . 14 4.4 (a) Feature maps from figure 4.3(a) partitioned into a 3x3 array; (b)

Aligned feature map partition array from (a) . . . 15 4.5 (a) Aligned points from figure 4.3(b); (b) Vectors between matched points

overlaid figure 4.3(a) . . . 16 5.1 Interpolation of vector field from figure 4.5(b) with boundary of confidence 19 5.2 (a) Feature map from VGG-16 relu3_1, channel 120 along with matching

vectors; (b) discrete vector field field from (a) overlaid interpolated vector field from figure 5.1 . . . 20 5.3 (a) Frame 2/20; (b) frame 12/20; (c) frame 20/20 . . . 21 5.4 Feature maps of figure 5.3(c) extracted from reference channels. VGG-16

(a) relu2_2 channel 104; (b) relu2_2 channel 4; (c) relu3_2 channel 90;

(d) relu3_1 channel 120 . . . . 21

vi

(7)

LIST OF FIGURES vii

5.5 (a) Frame from mine [1, 3]; (b) Frame with disturbances succeeding (a);

(c) Feature map (green) of (a) overlaid the frame (magenta); (d) Feature map (green) of (b) overlaid the frame (magenta). Extracted from VGG-16, layer relu3_2, channel 90, 84, 107 . . . . 22 5.6 (a) Source image; (b) Source image with Gaussian noise; (c) Feature map

of (a) ;(d) Feature map of (b). Extracted from VGG-16, layer relu3_2, channel 144 . . . 23 5.7 Example feature maps from excluded channels. (a) High activation (21%),

VGG-16 relu3_1 channel 138; (b) Low activation (1.48%), VGG-16 relu2_1 channel 2; (c) High vector comparisons (324), AlexNet relu1 channel 57;

(d) Low vector comparisons (28), AlexNet relu2 channel 98. . . . 24 6.1 Frame 4005, rgb and fir from [6]. (a) Raw Red-Green-Blue (RGB) image;

(b) Raw thermal image; (c) Cropped RGB and warped thermal image overlapped . . . 26 6.2 (a) Cropped RGB frame; (b) Contrast adjusted and color encoded ther-

mal frame; (c) Fusion between (a) and (b); (d)-(f) feature map of (a)-(c) respectively from VGG16, relu3_3, channel 77 . . . . 27 6.3 (a) Projected down sampled 3D point cloud onto a virtual 120 degree FOV

sensor at the origin, pts; (b) interpolation of points in (a). Color encoding of depth using JET color map and adjusted to between [1, 10] meters . . 28 7.1 (a) Image frame from MAV; (b) Gaussian noise with mean of 0.01 and

variance of 0.05 added to (a); (c) Feature map of (a) extracted at AlexNet relu1 channel 90; (d) Feature map of (b) extracted at same location as (c) 31 7.2 Discrete vector field showing context of movement from AlexNet relu1,

channel 95 . . . 32 7.3 (a) Image frame of MAV; (b) Gaussian noise with mean of 0.01 and variance

of 0.05 added to (a); (c) Feature map of (a) extracted at AlexNet relu1 channel 90; (d) Feature map of (b) extracted at same location as (c) . . . 33

(8)

List of Tables

5.1 Deviation (Dev) and noise tolerance (PSNR) of channels in VGG-16, layer relu1_1 to relu3_3 and AlexNet relu1. Lowest 10 deviations presented and sorted by ascending. . . 25 7.1 Extraction time of feature maps from layers in AlexNet and VGG-16 . . 30 7.2 Typical computation time for Point Feature Detection (featureDetect),

Coarse Image Registration (imageReg) and Closest Point Matching (pair), line 3, 8 and 9 of algorithm 1, respectively. Total, the typical time from line 3 to 9, including all intermediate steps. . . 30 7.3 Comparison between our method and KAZE with and without addition of

Gaussian noise. . . 32

viii

(9)

Acronyms

ANN Artificial Neural Network.

CNN Convolutional Neural Network.

FFT Fast Fourier Transform.

FoV Field of View.

FPS Frame Per Second.

ICP Iterative Closest Point.

IMU Inertial Measurement Unit.

LiDAR Light Detection and Ranging.

MAV Micro Aerial Vehicle.

MSF Multi Sensor Fusion.

PSNR Peak Signal to Noise Ratio.

Radar Radio Detection and Ranging.

ReLU Rectified Linear Unit.

RGB Red-Green-Blue.

SLAM Simultaneous Localization and Mapping.

Sonar Sound Detection and Ranging.

TIC Thermal Imaging Camera.

UAV Unmanned Aerial Vehicle.

VO Visual Odometry.

ix

(10)

Abstract

The ability for autonomous robotics to localize themselves in the environment is crucial and tracking the change of features in the environment is key for visual based odometry and localization. When shifting into rough environments of dust, smoke and poor illumination as well as erratic movements common in MAVs however, that task becomes substantially more difficult. This thesis explores the ability of the deep classifier CNN architecture to retain detailed and noise tolerant feature maps out of sensor fused images for feature tracking in the context of localization. The proposed method is enriching the RGB image with data from thermal images which is fed into a AlexNet or VGG-16 and extracted as a feature map at a specific layer. This feature map is used to detect feature points and is used to pair feature points between frames resulting in a discrete vector field of feature change. Preliminary complementary methods for the selection of channels are also developed.

1

(11)

Chapter 1 Introduction

Localization in low visibility and harsh environments is a complicated issue in the area of autonomous vehicles and is of great importance if one wishes to expand their area of operation. Interest in Micro Aerial Vehicle (MAV) for such tasks has grown due to their ability to reach dangerous and inaccessible areas [1] and can be of help to improve safety for human workers in mines, construction sites and other high risk environments such as firefighting and disaster management. Streamlining of autonomous systems also reduces resources required to perform a desired task compared to a human based alternative and may replace or aid vehicles in some parts of industry such as logistics, agriculture and health care, which is favorable from a sustainability perspective [2]. Still the production of MAVs and their components has an environmental impact and must not be disregarded.

For optimal resource use, efficient recycling schemes for these types of components are preferably established. However, the need for this technology is clear and a prerequisite for the advancement and safe integration of robotics and autonomous systems in society and industry is robust methods of localization of the agent in its surrounding environment, a task that is not trivial.

Systems based on sensors that work properly in a clean setting have a difficult time transitioning into atmospheres of dust, smoke or fog. One problem lies in the attenuation and scattering properties of such conditions for light frequencies overlapping with the visual spectrum used by the common Red-Green-Blue (RGB) camera sensors, resulting in a grainy and crude image that is poor of useful information. Figure 1.1(a) shows illumination in a dusty environment which obscures the surrounding contours. Illumination can also cause blown out exposure resulting in loss of information in the dynamic range of the sensor which leads to a grainy image lacking useful contrast, see figure 1.1(b). Another issue occurs when sudden movements of the camera exceed exposure time. Figure 1.1(c) shows the type of blurred and smeared frame that occurs with insufficient exposure time, common in MAV footage where sudden movements for balancing are often performed.

2

(12)

CHAPTER 1. INTRODUCTION 3

(a) (b) (c)

Figure 1.1: Frames from autonomous MAV footage from Mjölkudden mine [1, 3]; (a) dust in illuminated air; (b) dynamic range issues; (c) sudden movements

One method to minimize the impact of a harsh atmosphere is to fuse information from different types of sensor that complement each other. Both Thermal Imaging Camera (TIC) and Light Detection and Ranging (LiDAR) operate in the infrared spectrum which penetrate dust and haze better than visual light. Radio Detection and Ranging (Radar) technology has proved itself within the nautical and aerospace environment, and Sound Detection and Ranging (Sonar) technology underwater. Multi Sensor Fusion (MSF) is an active area of research and there are a multitude of ways to combine different geometric descriptions into a single coherent and useful multi-modal format.

The area of classification of objects using deep Convolutional Neural Network (CNN) architectures has developed robust methods of extracting the geometric essence of multi modal (RGB) images for the purpose of accurate classification. With the introduction of architectures such as AlexNet [4] and VGGnet [5] in the last decade, the area of classification has vastly improved. These CNN structures have shown great success at the classification of objects from the ImageNet dataset of 1000 categories containing a total of 1.4 million images. Despite the resolution of the images as well as the shapes and the angle of the object varying greatly, along with the presence noise and lens distortions, the CNNs are nonetheless able to extract the essence of the geometric structure corresponding to the class of the object.

For localization on images and Visual Odometry (VO), the detection and tracking of features is key. There is no strict definition of a feature but for the purpose of this thesis, a feature is a point on an image relating to a detail at a stable geometric point in the environment that can be traced through frames. This information is useful as it gives context to the change of the environment relative to the sensor and can be used to calculate the orientation and movement of the autonomous vehicle in relation to the shape of its surrounding. With dense and accurate tracking feature points you can describe a vector field of change between the images which transitions into the area of optical flow.

There are a multitude of ways for the extraction of features ranging from pixel intensity analysis to blob detection. During the process of classification of an image, a classifier CNN calculates feature maps of different levels of abstraction. These levels range from

(13)

CHAPTER 1. INTRODUCTION 4

pixel by pixel analysis to the shape of large geometric structures. Besides the ability for high level abstraction, the feature maps are also extracted on multi modal (RGB) images and may thus be useful in the area of robust localization.

1.1 Thesis objective and problem definition

How MAVs and other autonomous vehicles localize themselves in harsh environments where sensor inputs are suppressed is a core problem if one desires to expand their areas of operation. Can multi sensor based visual odomentry methods be made robust and fast enough on be used for the purpose of localization?

This thesis wishes to explore the multi-modal geometric abstraction property of the deep classifier CNN architecture in order to improve the robustness of feature extraction and tracking for the purpose of localization in low visibility environments which may be poor in features. The sensor types composing the different modes will be RGB camera, TIC, and LiDAR.

1.2 Delimitations

No training of CNNs will occur and this thesis will use pre-trained version of AlexNet and VGG-16 provide by MATLAB and the built in tools for working with CNNs. Implemen- tation of the method will be made in MATLAB for convenience sake. Sensor fusion will be evaluated on individual frames and manual assessment of the feature map of the fused frames will be done without full tracking of features. No localization can be performed because no ground truth data for the dataset used is available, only the feature tracking for the purpose of localization will be explored.

1.3 Methodology

The dataset we will be working with is autonomous MAV footage from Mjölkudden mine [1, 3] which is a good representative of the difficulties of localization in rough environments, and thermal image with TIC data set from Japan [6].

For the extractions of feature maps the state of the art AlexNet [4] and VGG-16 [5]

will be used and explored. Beside being well established and thoroughly examined, these CNNs will produce feature maps of different characteristics and of different processing times and will complement each other.

The development of the multi modal point feature extraction and tracking method and complementary processes will be implemented in MATLAB R2019b but will be general.

Feature extraction and tracking will be evaluated on RGB footage and compared with other state of the art feature extraction and tracking methods. Evaluation of RGB-TIC fusion for feature detection will be done individual frames and will not be a full evaluation.

(14)

Chapter 2 Related works

The authors in [7] developed a fast method for depth and velocity estimation by combining acceleration data from an Inertial Measurement Unit (IMU) sensor along with a strengthening of the temporal and spatial analysis of the optical flow vector fields, called FLIVVER. The authors used multiple optical flows from alternating input images along with spatial pooling to get velocity estimation of different sub-regions. This data is processed and reinforced with acceleration data to build an estimated depth map. Their method proved to be working well and efficiently with most of the computation time land- ing at the CNN however the footage used for evaluation was noise free and the relative movement between images large and articulate, with the performance in a non-controlled and noisy environment unclear.

To improve perception in hazardous environments of smoke and dust for fire fighters and robots, [8] fused Radar, LiDAR and TIC for Simultaneous Localization and Mapping (SLAM) of the environment and localization of hotspots. LiDAR and Radar data were fused for localization and mapping, and LiDAR data projected onto TIC to map the hot spots. The method performs the sought for task well and was able to navigate in indoor environments with poor visibility and map hot spots.

For improved pedestrian identification for autonomous automotive systems, [9] trained a network for fusion between TIC and RGB images to output natural images with enhanced contours around pedestrians. This fused image with enhanced detail in regions with pedestrians was fed into VGG-16 for identification and the authors found great improvement in night images.

5

(15)

Chapter 3 Theory

3.1 Sensors

3.1.1 RGB Camera

RGB camera is a well established technology that detects the ambient light in the visual spectrum with a CMOS sensor and encodes it into a three modal image, representing a red, green and blue light map respectively. The wide availability and high quality of this type of sensor makes it a basis in localization. In the context of localization in harsh conditions this type of sensor has the problem of being dependent on the illumination of its surrounding. In presence of dust, haze, smoke or other low visibility atmospheres, the illumination of the surrounding environment experiences problems of light scattering and attenuation. This reduces the range of visibility substantially and makes the problem of localization an even more difficult one.

3.1.2 Thermal Imaging Camera

The TIC operates in the low infrared spectrum detecting heat radiation in the environment. This information is useful in the context of robust localization as it persists in low light and penetrates haze better than light in the visual spectrum. When thermal imagery is used for localization, the intensity of heat signal is used to outline the contours of the environment along with features in heat signals which can be traced to indicate movement. Compared to RGB, TIC usually suffers from lower resolution along with more grain.

3.1.3 LiDAR

LiDAR measures distance from the sensor to specific points with infrared lasers, estab- lishing fixed points in IR³ which constitutes a point cloud. The density of points are

6

(16)

CHAPTER 3. THEORY 7

dependent on the longitudinal and latitudinal angle of change and decrease with distance. Due to the laser operating in the infrared spectrum as well as the properties of an ideal Gaussian wave beam, it is not as badly affected by scattering and attenuation as diffuse light in the visual spectrum. LiDAR also produces the light which it detects as opposed to an RGB camera, eliminating the need for ambient illumination, marking this sensor type as ideal for harsh environments.

3.2 Overview of Convolutional Neural Networks

An Artificial Neural Network (ANN) is a versatile data structure which allows for the process of fitting the outputs of the ANN to a range of inputs by the method of iterative training. Figure 3.1 shows an example structure of a traditional ANN. The input is typically an array represented as nodes fed into the network. Each consecutive layer is a weighted sum of the nodes from the previous layer, where the weights are iteratively trained. Each layer results in a higher order processing and abstraction of the input, collimating in an output trained according to one’s desire. Depending on the nature of the problem, the amount of layers as well as nodes per layer are design choices resulting in differences in flexibility as well as computation and training time.

Figure 3.1: A generated example of an ANN structure [10]

The Convolutional Neural Network (CNN) is an artificial neural network structure designed for the input of arrays and matrices such as images. Instead of nodes the layers consist of an array of images fed through various activation functions. The connectivity present in an ANN is replaced with trained kernels in a CNN, which acts as filters that are applied on the previous layer. This type of activation layer is called a convolutional layer and is the source of the versatility seen in CNNs. The structures of CNNs vary greatly but the simplest structure consists of a series of pre-processing, interconnected convolutions, and post-processing, see figure 3.2.

Convolutional layers are often pre-processed by a pooling layer which is a linear down- sampling activation function that reduces the image size and is used to compress the

(17)

CHAPTER 3. THEORY 8

spatial information. A common pooling variant is max pooling which works by partitioning the input into smaller sections and whose output will be an image where each pixel represents the respective partition and attain the maximum pixel value of the partition.

Post-processing is often done by a Rectified Linear Unit (ReLU) layer. The ReLU activation function is a rectifier removing the negative pixels by assigning them a value of zero.

Further activation functions are used for the purpose of classifying such as fully connected layers and soft-max layers.

Figure 3.2: A generated example of a CNN structure [10]

3.2.1 Feature maps from activation layers

The activation layers in a CNN are the outputs of activation functions and consist of an array of images and a specific image in the array is referred to as a channel of the activation layer. With Deep Learning Toolbox, MATLAB provides pre-trained CNNs and the ability to interact with the network. When extracted from the CNN these channels will output a feature map corresponding to a series of filtering of the input image. The feature maps vary widely and the property of the filters include enhancing, selecting, removing or accentuating features or visual information from the previous layer. Figure 3.3 shows three feature maps extracted at different locations in a pre-trained version of AlexNet. In each consecutive layer in a classifier CNNs structure, the amount of channels increases and the size of the feature map decreases which means that the deeper you extract a feature map, the lower the resolution, but the higher the level of abstraction.

When retrieving feature maps from an activation function every channel in the activation is retrieved leaving you with the ability to use all feature maps from an activation function with no extra computational cost, the consequence of which will be discussed in chapter 5.4.1.

(18)

CHAPTER 3. THEORY 9

(a) (b)

(c) (d)

Figure 3.3: Feature maps from pre-trained AlexNet. (a) Source image [11]; (b) relu1, channel 13; (c) relu2, channel 65; (d) relu3, channel 93

3.2.2 Structure of AlexNet

AlexNet [4] takes an input image of 227×227×3 and is fed through a series of convolutional and max-pool down-sampling layers.

The first convolutional layer is 96 kernels of 11 × 11 and stride 4 so the resulting image size is 55 × 55 × 96. The image is down-sampled in a max-pooling of 3 × 3 and stride 2 and fed through a convolutional layer of 256 kernels of 5 × 5 and padding 2 so the next resulting image size is 27 × 27 × 256. The image is then fed through max-pooling of 3 × 3 and stride 2 and a convolutional layer of 384 kernels of 3 × 3 and padding 2; The resulting image size is 13 × 13 × 384. The image is then fed through two convolutional layers first of 384 kernels of 3 × 3 and padding 1 then 256 kernels of 3 × 3 and padding 1 and the resulting image size is thus 13 × 13 × 256. Here the image is down-sampled in a max-pooling of 3 × 3 and stride 2 resulting in a 6 × 6 × 256 image followed by two fully connected layers of size 1 × 1 × 4096 and one soft max of size 1 × 1 × 1000.

3.2.3 Structure of VGG-16

VGG-16 [5] has a higher classification success rate and is more GPU demanding than AlexNet which can partly be attributed to introduction of more convolutional layers before the down-sampling. The kernel size as well as stride and padding of each convolu-

(19)

CHAPTER 3. THEORY 10

tional layer is fixed at 3 × 3 with a stride of 1 and padding 1 which does not change the image size. The reduction in image size occurs at the max-pooling layers which is 2 × 2 with stride 2 and is also fixed.

The CNN takes a fixed input image of size 224 × 224 × 3 which is fed through two convolutional layers with 64 kernels so the resulting image size at the respective layers is 224 × 224 × 64. Next the image is fed through max-pooling and two convolutional layers with 128 kernels so the image size is 112 × 112 × 128. The image is then fed through max- pooling and three convolutional layers so the image size is 56 × 56 × 256. The image is down-sampled and fed through another set of three convolutional layers with 512 kernels resulting in an image size of 28 × 28 × 512. Lastly the image is down-sampled and fed through one more set of convolutional layers with 512 kernels resulting in an image size of 14×14×512 followed by down-sampling to 7×7×512, fully connected layer of 1×1×4016 and soft max of size 1 × 1 × 1000.

3.3 Traditional feature extraction

Harris corner detection [12] method uses gray scale images and obtains structure tensors from spatial derivation of the image to find corners. Shi–Tomasi [13] uses a certain M_c function to find the minimal eigenvalues to detect corners of gray scale images. MSER [14, 15] (maximally stable extremal regions) detects extremal regions of gray scale images to detect features. FAST [16] (Features from accelerated segment test) uses a circle of 16 pixels to detect if a certain area on a gray scale image is a corner. KAZE [17] operates in a nonlinear scale space to detect features on gray scale images.

(20)

Chapter 4 Method of deep point feature extraction

The extracted feature maps from various pre-trained classifier CNNs as opposed to the source images will be subject to feature detection in this method. More specifically, feature maps from the ReLU-layer as it is more feature rich than the preceding convolutional layer. The appropriate channel at a specific layer will provide a detailed and noise tolerant feature map that activates on specific geometric structures that may be difficult to obtain with regular methods of filtering.

4.1 Feature point extraction

The gray scale feature maps extracted from the CNN have a binary characteristic, where areas of activation are often concentrated and bright. Figure 4.1 shows (a) a frame of autonomous MAV footage and (b) a feature map of the frame. This opens up for the possibility of extracting points around the areas of activations by method of image bina- rization. This allows for extracting a single well defined point around a feature as opposed to clusters, which is useful as the data will be clear and will make the upcoming task of matching easier.

11

(21)

CHAPTER 4. METHOD OF DEEP POINT FEATURE EXTRACTION 12

(a) (b)

Figure 4.1: (a) MAV footage frame from Mjölkudden mine; (b) feature map of source image from (a) extracted from VGG-16, layer relu2_2, channel 127

A feature map is binarized to threshold according to the Otsu method [18] which is a dynamic thresholding method which maximizing the inter-class variance between the background (black) and foreground (white) class, and results in connected components of white pixels representing a feature. The connected components are detected using a flood fill-algorithm that finds and labels the connected white areas.

cmp_i :

PI_i,1, PI_i,2, ... , PI_i,n_i

, (4.1)

where cmp_i is component i of n components with n_i pixel indices. The set of pixel indices are converted to a set of x and y coordinates and the feature point is defined to be the center of the points.

Ri = 1 n_i

ni

X

j=1

ri,j, (4.2)

where r_i,j is the respective x and y coordinate of pixel index j for component i and R_i is the x and y coordinate of the component i. The feature points can thus be described as set of coordinates that represents areas of interest,

pts :

R₁, R₂, . . . , R_n

. (4.3)

Figure 4.2 shows (a) binarize feature map and (b) feature points overlaid the binarized feature map. A problem with this method is that larger connected structures of specific shapes such as L-shaped features have a center of mass that does not overlap with the location of the feature. Long thin features also have a proclivity to split and join resulting in uncertainty in the position and the amount of feature points around that area. Ad- ditionally, to minimize the extraction of artifacts and insignificant features, an arbitrary pixel area threshold of 20 connected pixels for a feature is chosen.

(22)

(a) (b)

Figure 4.2: (a) Binarized feature map from figure 4.1(a); (b) Binarized feature map from (a) with extracted feature points (green) overlaid

4.2 Feature matching

In order to get a perception of the relative movement of the environment a matching between feature points in two consecutive images has to be done. This will yield a discrete vector field of movement which is useful for localization.

The task of matching feature points from two consecutive images is not trivial. Cal- culating the transformation between two patterns of points is difficult, especially when including noise in the point position as well as introducing and removing points at different locations. Figure 4.3 shows (a) two consecutive feature maps and (b) the respective set of points. Algorithms for such tasks exist, like Iterative Closest Point (ICP) that iteratively morph the points to find the transformation matrix that minimizes the difference between the two sets. However in this case we have access to the feature maps which is a much richer source of information as it provides the whole context of the transformation as seen in figure 4.3(a); The problem thus shifts into the area of image registration.

(23)

(a) (b)

Figure 4.3: (a) Two consecutive feature maps, current (magenta) and previous (green) overlaid for display purposes; (b) Extracted point lists from feature maps in (a)

4.2.1 Image registration

When overlapping two consecutive feature map images a relative translation of features occurs, as can be seen in figure 4.3(a). Given stable features in the environment, their projected position onto a 2-D sensor such as RGB and TIC will change due to translation and rotation of the sensor’s coordinate system relative to that of the environment. The area of image registration within computer vision is the process of approximating the transformation matrix between two images and there are many methods to achieve a finer or coarser approximation with differences in computation time and tolerance for uncertainty. Compensating for such a transformation is a good to start in order to achieve accurate matching between the two sets of feature points.

Image registration of two consecutive sensor image frames that is extracted from a deep CNN has shown to be difficult. Problems arise in the morphing of the structure of the feature image. Aside from noise, features might move, split, shift, disappear and appear.

Methods provided by MATLAB such as imregcorr() from the Image Processing Toolbox have been too slow and unreliable for these types of inputs. A coarse image registration that is fast and can tolerance differences and uncertainty which can arise between two feature frames has to be implemented.

The method proposed is a partitioning of the images into equally sized sections where uniform translation is assumed,

P_1,1, P_1,2, ... , P_1,m

= partition(I₁, m), (4.4)

P_2,1, P_2,2, ... , P_2,m

= partition(I₂, m), (4.5) where I₁ and I₂ is the current and previous feature map and partition P_im,k is the sub- image k from feature map im. The translation offset vector v_k in each partition k is determined by calculating the phase correlation between the images P_1,k and P_2,k using

(24)

Fast Fourier Transform (FFT) [19]. Figure 4.4 shows (a) two consecutive feature frames overlaid and partitioned into sub-images and (b) partitions aligned using alignment vectors. As can be observed the assumption of uniform translation looks like a adequate approximation. This is composed into an list of vectors describing the respective translation offset for each partition and can be used for alignment, see figure 4.4(b).

(a) (b)

Figure 4.4: (a) Feature maps from figure 4.3(a) partitioned into a 3x3 array; (b) Aligned feature map partition array from (a)

The amount of partitions is not initially clear. As can be seen in figure 4.4(b), the simplification of the transformation into sections of a uniform translation appears to be an adequate approximation, however misalignment can still be observed. This misalignment would decrease with a greater amount of partitions, however testing reveals an uncertainty originating in the amount of features present in the section. When size of the partitions decreases the number of present features in the partition also decreases.

This in combination with the occurrence of a sudden change in direction of the sensor can cause a situation where most of the remaining features leaves the partition while new features enter from a different side. This causes a misrepresentation in the change of the feature structure leading to the phase correlation to assume translation in a completely different direction and of a much greater magnitude, hence failing to accurately portray the true movement. There is thus a compromise between the resolution of the coarse image registration and the confidence in the calculation. The partition was chosen to be fixed at 3 × 3 or m = 9 as the misalignment was considered tolerable and the occurrence of a alignment failure infrequent.

4.2.2 Point alignment and matching

With the array of vectors from the coarse image registration, each point can be associated with an offset, which would roughly align the two sets of points making a closest point method of matching possible, see figure 4.5(a). Given that points are approximately

(25)

aligned, points in the other set further away than an arbitrary distance threshold of 5 pixels are assumed to not be appropriate candidates. This assumption does sometimes fail due to two major reasons. First one is that the sections for alignment assume uniform translation which is not always the case. The second one is that the structure of the activation of a feature may change cause its mass center to be calculated differently or the feature to split in two. An increase of the distance threshold increases tolerance of such phenomenon but also decreases the confidence that the right points have been matched.

Difference in the amount of features points between the sets may cause an over- determined matching where two points may be closest to the same candidate point. This is avoided by always matching the set with least points with the larger set, causing the larger set to have points hanging. Figure 4.5(b) shows the two feature maps overlaid along with the vector field of matched points.

(a) (b)

Figure 4.5: (a) Aligned points from figure 4.3(b); (b) Vectors between matched points overlaid figure 4.3(a)

Limitation of the assumption of uniform translation

The assumption of uniform translation within a partition is limiting but works well in many environments. With the introduction of movements disconnected from a stationary environments such as people, animals and vehicles the assumption fails and will not feature match accurately in a partition experiencing this type of movement. Another issue with the assumption of uniform translation occurs with parallax where something moves in the foreground faster relative to the background. However, this type of failure only occurs if the relative movement between the frame is of enough magnitude. The assumption works well in stationary environments without the presents of intruding foreground objects, like mines, caves and and possibly indoor environments devoid of people and outdoor environment without great amount of parallax but this has not been tested.

(26)

Chapter 5 Measurement of channel quality

Each channel is trained for a specific behavior to minimize the error of the CNN when classifying objects causing wide ranging behavior of channels at different layers inside a classifier CNN. A few of these channels also happen to be good descriptors of the features of the input image and are useful for localization, but far from many. Given the large span of possible channels for each layer, for a multitude of different classification CNNs, manual review of each channel is impractical and time consuming and this screening process has to be automated.

In order to assess the quality of a channel, the dimensions which constitute good has to be defined as well as have to be condensed into a metric so that the performance can be comparable. There were two major dimensions along which a channel was perceived to be useful; the ability for the channel to tolerate noise, and the ability for the extracted features to be tracked and accurately represent the movement between frames.

5.1 Tracking strength

When two features are correctly matched they constitute a vector of movement originating at the point from the older set. Expanding to all matched points one can observe that an approximate sparse discrete vector field of change emerges as seen in figure 4.5(b).

One way for a theoretical assessment of the accuracy of the tracking would be done by calculating the offset from the sparse vector field to a true theoretical vector field of change. Obtaining a true vector field for footage used for evaluation would not be feasible outside simulation. However, an approximate vector field that describes the movement accurately enough can be used as a reference for the evaluation process by the method of interpolation, reducing the manual screening process substantially. During development and testing multiple accurate channels were stumbled upon and noted and can be used as a basis to attain a vector field representation of footage used for evaluation. With this method the performance of one channel is only in reference to another, one which is manually chosen to be a good representative for the feature change. A variety of different

17

(27)

CHAPTER 5. MEASUREMENT OF CHANNEL QUALITY 18

channels which differ in the types of features extracted can be chosen as references, gaining a more comprehensive description of a channel’s performance.

5.1.1 Vector field interpolation

The discrete vector field of change between two subsequent frames can be represented as a list of pixel coordinates along with composite vectors in the x and y direction,

vdiscrete :

(x₁, y₁, u₁, v₁), (x₂, y₂, u₂, v₂), ... , (x_n, y_n, u_n, v_n)

, (5.1)

where n is the number of discrete vectors. This list can be separated into two sublists of three dimensional scattered points representing the component of the vectors in x and y direction at its corresponding coordinate,

v_u :

(x₁, y₁, u₁), (x₂, y₂, u₂), ... , (x_n, y_n, u_n)

(5.2)

vv :

(x₁, y1, v1), (x₂, y2, v2), ... , (x_n, yn, vn)

. (5.3)

A continuous function can be interpolated from the respective scattered point list using scatteredInterpolant() from MATLAB with natural neighbor interpolation method,

f_u(x, y) = interpolation(v_u), (5.4) f_v(x, y) = interpolation(v_v). (5.5) This results can be composed into a continuous vector field describing the movement between two specific consecutive frames,

f(x, y) =





f_u(x, y) f_v(x, y)



. (5.6)

Figure 5.1 shows the discrete field from figure 4.5(b) along with its interpolated coun- terpart. The interpolated vector field’s utility only stretches as far as the data it is interpolated upon so a boundary of confidence has to be determined. This is done by calculating the border that the outermost positions of the vectors from the discrete vector constitutes and restricting the use of the vector field to within this boundary. In figure 5.1 this boundary would encapsulate the endpoints of the cyan vectors.

(28)

Figure 5.1: Interpolation of vector field from figure 4.5(b) with boundary of confidence

5.1.2 Performance evaluation with vector fields

The chosen method for calculating the difference between a discrete vector field and a continuous one is by looking at the magnitude of the difference between the two vectors.

At a specific coordinate with a vector, (x_i, y_i, u_i, v_i), a complementary vector from the continuous vector field, see equation (5.6), can be subtracted from the discrete one and an error can be calculated.

E_i = k





ui− fu(xi, yi) vi− fv(xi, yi)



k. (5.7)

Note that this calculation is only performed if (x_i, y_i) lies within the boundary associated with the vector field.

To get a metric for the performance of a channel, this error is calculated for each of the discrete vectors within the boundary in a frame and average; this error is then averaged over each frame in the evaluation footage.

E_ch= 1 m

m

X

j

1 n_j

nj

X

i

E_i

, (5.8)

where m is the amount of frames in the evaluation footage, and n_j is the amount of discrete vectors within the associated boundary at frame j.

Figure 5.2 shows (a) the discrete vector field from another channel and layer and (b) the vector field from (a) overlapped the interpolated vector field from 4.5(b), along with the boundary of confidence.

(29)

(a) (b)

Figure 5.2: (a) Feature map from VGG-16 relu3_1, channel 120 along with matching vectors; (b) discrete vector field field from (a) overlaid interpolated vector field from figure 5.1

5.2 Choice of evaluation footage

As noted upon in chapter 5.1, this type of tracking strength metric is in relation to a reference channel which is deemed to be a good feature extractor and on a specific sequence of frames. However does this measurement tell us how well a channel only performs on a specific footage or is it universal?

The first part of the attempt to achieve a universal measurement is to use the interpolation method to only measure the ability to track features hence this type of evaluation will be done upon clear footage with well defined features that are easy to follow. The second part is to pair this measurement with a measurement of noise tolerance to get a measurement for how well a channel might handle disturbances.

The evaluation footage was chosen to be 20 frames from MAV footage in Mjölkudden mine [1, 3], which is clear and has many contours which can be followed, see figure 5.3.

Four channels from VGG-16 were chosen as references; relu2_2 channel 104, relu2_2 channel 4, relu3_2 channel 90 and relu3_1 channel 120. Figure 5.4 shows the feature maps from the reference channels.

(30)

(a) (b) (c)

Figure 5.3: (a) Frame 2/20; (b) frame 12/20; (c) frame 20/20

(a) (b)

(c) (d)

Figure 5.4: Feature maps of figure 5.3(c) extracted from reference channels. VGG-16 (a) relu2_2 channel 104; (b) relu2_2 channel 4; (c) relu3_2 channel 90; (d) relu3_1 channel 120

5.3 Noise tolerance

Particularly good tolerance of noise has been observed at feature maps retrieved at some channels. For those channels, the ability to extract the geometric shape and contours of images seem to be relatively unaffected by the presence of noise which is a notably good characteristic for localization in harsh environments that weaken the sensor’s signal to noise ratio. Figure 5.5 (a) and (b) shows two frames from Mjölkudden mine where (b) has significantly weaker perception of the environment due to dust. Figure 5.5 (c) and (d) shows an example of seemingly noise tolerant feature maps of (a) and (b) as

(31)

the combination of three channels from the same layer. Note that there is no extraction combining feature maps from the same channel, see section 3.2.1. These channels have a very high tolerance for the presence of these types of visual disturbance and a method for the screening of this behavior has been implemented.

(a) (b)

(c) (d)

Figure 5.5: (a) Frame from mine [1, 3]; (b) Frame with disturbances succeeding (a); (c) Feature map (green) of (a) overlaid the frame (magenta); (d) Feature map (green) of (b) overlaid the frame (magenta). Extracted from VGG-16, layer relu3_2, channel 90, 84, 107

To measure the noise tolerance of the channel a clear image containing a variety of different shapes and contours is chosen as the source, I₀, to which Gaussian noise is applied, J₀. The clear image, I₀, and the noisy image, J₀, are extracted at a desired layer and channel yielding a noise free feature map, I, and a noisy feature map, J . Figure 5.6 shows (a) source image without noise, I₀, (b) source image with noise, J₀, (c) feature map of no noise image, I, and (d) feature map of noisy image, J .

The tolerance of noise is measure by Peak Signal to Noise Ratio (PSNR) between the noisy and noise free feature maps,

PSNR(I, J ) = 10 · log₁₀





MAX(I)² MSE(I, J )



, (5.9)

where MAX(I) is the maximum possible pixel colour intensity of the image and MSE(I, J ) is the mean squared error between the images.

(32)

(a) (b)

(c) (d)

Figure 5.6: (a) Source image; (b) Source image with Gaussian noise; (c) Feature map of (a) ;(d) Feature map of (b). Extracted from VGG-16, layer relu3_2, channel 144

To attain an accurate measurement, the PSNR was calculated with multiple different applications of random Gaussian noise, and averaged, see equation (5.10). n = 10 was deemed to be adequate.

N T = 1 n

n

X

p=1

PNSRI, Jp

. (5.10)

Some channels have a proclivity to activate along the edges of images; To avoid mis- leading data and only retain that which is useful, 5 % was trimmed off along each dimen- sion on the feature map before comparison.

5.4 Compilation of measurement data

The following data about channels are dependent on the specific training the various CNNs have undergone, thus the content of the data is not as important as the method for generating it.

The tracking strength metric Dev (Deviation) for a channel is the average of equation (5.8) that is calculated from each of the reference channels, see chapter 5.1. To filter out channels which perform well on the Dev metric but might not be performing well on the actual task of tracking, multiple other statistics of the channel is recorded; The amount of vectors which are compared to a vector field and the average intensity of the feature

(33)

map, all average over each of the frames of the evaluation footage. Thresholds for these complementary metrics are determined by manual examination sample channels.

For pixel intensity, an average between 2% and 9.5% was observed to be optimal.

Above 9.5% the channels began activating on larger areas which have less clearly defined feature points, see figure 5.7(a). When falling below 2% the size of features got small and lost connection to the structure of the environment, see figure 5.7(b). For vector comparisons, an average between 90 and 300 per vector field was also observed to be optimal. Above 300, the feature points get cluttered, see figure 5.7(c). Below 90, the vectors do not seem to capture the relative change in the environment, see figure 5.7(d).

(a) (b)

(c) (d)

Figure 5.7: Example feature maps from excluded channels. (a) High activation (21%), VGG-16 relu3_1 channel 138; (b) Low activation (1.48%), VGG-16 relu2_1 channel 2;

(c) High vector comparisons (324), AlexNet relu1 channel 57; (d) Low vector comparisons (28), AlexNet relu2 channel 98.

Table 5.1 shows channels with the 10 lowest scores in deviation for VGG-16 and AlexNet when accounted for the thresholds along with the respective noise tolerance (PSNR) calculated with zero mean Gaussian noise of variance 0.01. For AlexNet, the only layer that contained channels which extracted more than 90 comparable vectors is relu1 and thus is the only layer presented in the table. However layers as deep as relu3 extract fewer features of larger structure and are still useful but excluded in this coarse screening. This indicate that the same thresholds for this type of screening may not be appropriate for different CNNs.

(34)

Table 5.1: Deviation (Dev) and noise tolerance (PSNR) of channels in VGG-16, layer relu1_1 to relu3_3 and AlexNet relu1. Lowest 10 deviations presented and sorted by ascending.

VGG-16, relu1_1 VGG-16, relu1_2 VGG-16, relu2_1 VGG-16, relu2_2

Ch PSNR Dev Ch PSNR Dev Ch PSNR Dev Ch PSNR Dev

64 22.33 2.82 10 25.69 2.88 119 29.93 2.41 4 27.15 1.97

52 24.04 3.03 33 26.28 2.90 53 25.70 2.45 104 17.80 2.26

49 21.70 3.15 59 24.14 2.90 3 21.55 2.46 95 27.74 2.28

48 23.86 3.16 35 24.42 2.92 75 36.91 2.53 18 24.54 2.32

7 23.69 3.16 47 23.28 2.95 91 26.01 2.62 107 34.92 2.41

63 21.63 3.21 42 25.14 2.95 101 28.44 2.67 70 23.24 2.48

36 24.28 3.21 49 28.24 2.99 10 22.58 2.74 127 26.99 2.55

59 23.62 3.23 13 21.05 3.03 107 19.27 2.76 63 24.65 2.55

38 24.03 3.24 27 25.79 3.05 69 18.54 2.85 5 23.71 2.59

27 23.39 3.26 45 24.22 3.11 95 24.32 2.86 36 22.91 2.60

VGG-16, relu3_1 VGG-16, relu3_2 VGG-16, relu3_3 AlexNet, relu1

Ch PSNR Dev Ch PSNR Dev Ch PSNR Dev Ch PSNR Dev

120 27.42 1.80 90 32.65 1.82 14 24.14 2.55 85 28.92 1.85

206 24.03 2.18 130 24.65 2.38 227 26.58 2.63 90 33.55 2.02

238 20.01 2.33 162 22.85 2.59 77 25.34 2.73 95 32.79 2.03

31 27.58 2.45 249 29.92 2.59 169 26.54 2.78 96 27.39 2.09

131 23.03 2.51 38 23.80 2.60 51 22.49 2.81 67 32.65 2.14

248 27.46 2.56 145 19.85 2.67 136 28.74 2.81 8 25.34 2.29

186 22.75 2.63 142 20.94 2.74 254 23.83 2.83 86 27.29 2.54

73 25.73 2.63 144 19.71 2.75 158 20.71 2.86 2 27.08 2.58

213 20.97 2.64 250 22.36 2.89 81 20.70 2.87 35 23.37 2.68

128 21.68 2.69 170 24.98 2.91 163 23.52 2.89 65 30.01 2.74

5.4.1 Selection of channels

Every channel has different strengths and excels at tracking inside different environments, and vary in the shape that triggers activations, the resolution of points and the tolerance of noise. Additionally, you always extract the entire layer. This lack of extra computational costs opens up for the possibility of combining different feature maps that complement each other’s weaknesses or strengthen weak signals, see figure 5.5.

(35)

Chapter 6 Data Fusion

The types of images the CNNs takes as input are RGB images which means that all the information from the sensors has to be condensed into the modes corresponding to red, green and blue. Given that the CNNs are also trained on RGB images, as well as the wide availability and high quality of RGB camera sensors, this approach of fusion will use the RGB images as a basis and try to enrich them with information from other sensors.

6.1 Thermal fusion

If an RGB camera and TIC is mounted at a physical proximity and facing the direction a transformation matrix can morph the thermal image onto the RGB image or vice versa.

The common way to calculate such a matrix is by camera calibration where a grid is present in both sensor images and the relative distortion can be calculated. If the images can be accurately overlapped fusion can ensue.

No suitable footage of TIC and RGB was available for a complete evaluation of a fusion method however, a dataset from Japan [6] was used to evaluate feature extraction on single frames of fused images. Due to the lack of a calibration matrix, rough alignment was manually performed, see figure 6.1.

(a) (b) (c)

Figure 6.1: Frame 4005, rgb and fir from [6]. (a) Raw RGB image; (b) Raw thermal image;

(c) Cropped RGB and warped thermal image overlapped

26

(36)

CHAPTER 6. DATA FUSION 27

The gray scale thermal image was by first adjusted to increase contrast; pixels within [0.3, 0.6] in intensity was stretched to [0, 1]. The image was then color encoded with a HSV color map. The fused image is a blend between the RGB image and color encoded thermal image, I₀ = αI_rgb+ (1 − α)I_tic with α = 0.85.

Figure 6.2 (a)-(c) shows the fusion between thermal and RGB images and (d) to (f) shows feature maps corresponding to the image above. In (d) and (e) one can see that both RGB and TIC are able to provide feature maps with complementary features to one another. As can be seen in (f) a fused image is able to extract the feature content from both component images providing a richer description of the features in the environment with less computational cost than separate extraction.

(a) (b) (c)

(d) (e) (f)

Figure 6.2: (a) Cropped RGB frame; (b) Contrast adjusted and color encoded thermal frame; (c) Fusion between (a) and (b); (d)-(f) feature map of (a)-(c) respectively from VGG16, relu3_3, channel 77

Due to the lack of adequate video footage with TIC and RGB cameras and the associated transformation matrix, no thorough examination of the fusion could be done. The specific values of the contrast adjustment and blend between RGB and TIC worked well on a select few frames but further analysis has to be done for greater confidence in this method of fusion.

6.2 LiDAR fusion

Methods of LiDAR fusion were explored but not implemented. The initial idea of LiDAR fusion was to retrieve a depth map of the environment by interpolation of projected 3D

(37)

CHAPTER 6. DATA FUSION 28

point cloud from a LiDAR. This depth map was believed to contain information about the features in the environment and would be fused with thermal and RGB images. The assumption of the presence of features turned out to be false.

The data used was LiDAR data from a mine [20]. The 3D point cloud was projected onto a virtual 120 degree Field of View (FoV) sensor at the origin of the point cloud which yielded a list of points with coordinates corresponding to non discrete pixel locations and distances to origin, see equation (6.1) and figure 6.3(a).

pts :

(r₁, c1, d1), (r₂, c2, d2), ... , (r_n, cn, dn)

(6.1) The points were interpolated into a continuous function using scatteredinterpolant() in MATLAB, which would return a function describing depth at non discrete pixel values, see equation (6.2). The function was plotted for a continuous depth map, see figure 6.3(b).

f (r, c) = interpolation(pts) (6.2)

(a) (b)

Figure 6.3: (a) Projected down sampled 3D point cloud onto a virtual 120 degree FOV sensor at the origin, pts; (b) interpolation of points in (a). Color encoding of depth using JET color map and adjusted to between [1, 10] meters

As can be seen in figure 6.3 the depth map does not contain enough information of the contours in the environment to retain stable feature points. The artifacts from the point clouds were a problem as the interpolation method resulted in those artifacts being enhanced and becoming very protruding in the depth image.

This method of retaining features from point clouds was not successful and was prone to enhance inaccuracies in the point cloud. No further methods were explored in the scope of the thesis and LiDAR fusion was abandoned. Projection of the LiDAR points onto an RGB can still provide useful information however. If feature extraction and tracking is performed on RGB-Thermal images, LiDAR data can provide physical coordinates for the feature points which gives useful context to the environment.

(38)

Chapter 7 Implementation, evaluation and results

7.1 Structure

Algorithm 1 shows the structure of the method for feature extraction and tracking on an image I₀ that is the product of fusion between multiple sensor forms, see chapter 6.

The algorithm assumes a constant feed of sensor images and contains a buffer that when full holds a feature map and point feature list for the sensor image at t = 0 and t = −1 respectively, where t = 0 is the most recent frame and t = −1 is the previous one. When the buffer is full the algorithm calculates the alignment vectors from the two feature maps, from which the two sets of feature points can be paired. Finally the algorithm returns a list of vectors that describes the change of features between sensor image I at t = 0 and the previous sensor image J at t = −1.

Algorithm 1 Feature extraction and tracking

1: procedure Tracking(I⁰) . Sensor fused image I₀

2: map ← extraction(net, layer, channel, I₀)

3: list ← featureDetect(map) . List of feature points

4: buffer.add(map, list) . Global buffer of size 2

5: if buffer.full() then . If contains 2 sets of maps and lists

6: I ← buffer.getMap(1) . Retrieves map at t = 0

7: J ← buffer.getMap(2) . Retrieves map at t = −1

8: M ← imageReg(I, J ) . Alignment vectors

9: vectors ← buffer.pair(M ) . Pairs aligned feature points.

10: return vectors

11: end if

12: return null

13: end procedure

29

(39)

CHAPTER 7. IMPLEMENTATION, EVALUATION AND RESULTS 30

7.1.1 Computational performance

The computational performance is analyzed in a MATLAB implementation of algorithm 1. No parallelization of the algorithm is done hence MATLAB only performs CPU com- putations. Parallelization and distributing the load to a GPU would greatly improve the performance. The CPU of the hardware on which the evaluation is done is an Intel(R) Core(TM) i5-8250U.

A big variable in the performance of the algorithm is line 2, of algorithm 1. Which CNN and where inside the CNN one chooses to extract the feature map has a wide range of computational cost. Table 7.1 shows the extraction times for relu1 to relu7 for AlexNet and relu1_1 to relu3_3 for VGG-16. AlexNet is significantly faster than VGG-16 but both CNNs have different characteristics, pros and cons in their feature maps. Table 7.2 shows the computational cost of the performance heavy processes of the rest of algorithm 1. Note that this is dependent on the specific input image and the amount of features extracted; The presented times are typical from adequate channels.

Table 7.1: Extraction time of feature maps from layers in AlexNet and VGG-16

Alexnet relu1 relu2 relu3 relu4 relu5 relu6 relu7

T ime [ms] 31 55 57 69 73 273 303

VGG-16 relu1_1 relu1_2 relu2_1 relu2_2 relu3_1 relu3_2 relu3_3

T ime [ms] 136 283 336 408 448 525 607

Table 7.2: Typical computation time for Point Feature Detection (featureDetect), Coarse Image Registration (imageReg) and Closest Point Matching (pair), line 3, 8 and 9 of algorithm 1, respectively. Total, the typical time from line 3 to 9, including all intermediate steps.

P rocess featureDetect imageReg pair total

T ime [ms] 35 30 80 150

A way to make the extraction of feature maps more efficient is by modifying the extraction process so as to not calculate each channel at the layer where you wish to extract, but only until the proceeding one. This requires more specialized tools such as those provided by Tensorflow or Keras. Moreover further optimization of the closest point matching may be possible.

(40)

CHAPTER 7. IMPLEMENTATION, EVALUATION AND RESULTS 31

7.2 Extraction with and without noise

The ability for channels to extract the contours and shape of objects is key for robust feature tracking. The addition of noise obscures the contours of the features is still able to be extracted using higher order of abstraction from feature maps. Figure 7.1 show (a) MAV frame from mine and (b) the same frame with major Gaussian noise. (c) shows a feature map of the no noise frame and (d) a feature map from the same channel from of the noisy frame. Despite the obscuring from noise the feature map still retains the shape and contours of the environment with PSNR of 24.66.

(a) (b)

(c) (d)

Figure 7.1: (a) Image frame from MAV; (b) Gaussian noise with mean of 0.01 and variance of 0.05 added to (a); (c) Feature map of (a) extracted at AlexNet relu1 channel 90; (d) Feature map of (b) extracted at same location as (c)

7.3 Extraction and tracking compared to other methods

Visual comparison was done between this method and a variety of feature detection algorithms FAST [16], Harris [12], Shi–Tomasi [13], MSER [14, 15] and KAZE [17] on consecutive frames with feature matching algorithm [21]. Comparison was done on MAV footage collected from autonomous flight from Mjölkudden mine [1].

FAST, Harris, Shi-Tomasi and MSER all failed on this type of footage. They tracked well on the few frames where the difference between frames where minimal but failed to

Towards Robust Localization Deep Feature Extraction with Convolutional Neural Networks