Deep Learning for Iceberg Detection in Satellite Images

(1)

IT 21 010

Examensarbete 30 hp Januari 2021

Deep Learning for Iceberg Detection in Satellite Images

SHUZHI DONG

Institutionen för informationsteknologi Department of Information Technology

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

Deep Learning for Iceberg Detection in Satellite Images

SHUZHI DONG

The application of satellite images for ship and iceberg monitoring is essential in many ways in Arctic waters. Even though the detection of ships and icebergs in images is well established using Geoscience techniques, the discrimination between those two target classes still represents a challenge for operational scenarios. This thesis project proposes the application of Support Vector

Machine (SVM), Convolutional Neural Networks (CNN), and SingleShot Detector (SSD) for ship-iceberg detection in satellite images. The CNN model

is compared with SVM and SSD, and the final results indicate not only a superior classification performance of the proposed methods but also the object

detection results from SSD.

Tryckt av: Reprocentralen ITC IT 21 010

Examinator: Mats Daniels Ämnesgranskare: Olle Gällmo Handledare: Maria Erman

(4)

(5)

I would like to thank Olle Gällmo,Maria Erman, Cubo Rubén, Henrik Södergren, and Xinrui(Jerry) Ge.

(6)

List of Figures

1.1 Workflow of comparing three models of image object detection 10

2.1 Division of 2D space . . . . 12

2.2 Division of 3D space . . . . 12

2.3 Artificial neuron used by the perceptron [13] . . . . 14

2.4 A regular 3-layer Neural Network. . . . 15

2.5 The Rectified Linear Unit (ReLU) activation function produces 0 as an output when x < 0, and then produces a linear with slope of 1 when x > 0 [?] . . . . 16

2.6 Depicts how convolution works. . . . 17

2.7 An example of both Max-Pooling and Average-Pooling . . . . 18

2.8 An architectural overview of CNN . . . . 19

2.9 Lower resolution feature maps (right) detects larger scale objects. . . . 20

2.10 SSD framework[45]. . . . 21

2.11 SSD Network Architecture [45]. . . . 22

3.1 json format data . . . . 26

3.2 Visualizations of an iceberg and a ship signature: (a) Iceberg structure, (b) Ship structure, (c) 3D visualization of the iceberg structure by band_1, (d) 3D visualization of the ship structure by band_1, (e) 3D visualization of the iceberg structure by band_2, (f) 3D visualization of the ship structure by band_2. Note that both the iceberg and the ship are similar in size and intensity, but differ significantly in their spatial signature. . . . 27

3.3 Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer ’plain’ networks. The deeper network has a higher training error, and thus test error [35]. . . . . 28

3.4 Residual learning: a building block [35]. . . . 29

3.5 Properties of different instances . . . . 30

3.6 CNN model . . . . 31

3.7 Network architecture . . . . 32

3.8 Iceberg and ship annotation . . . . 34

3.9 Estimator setting for SSD in AWS sagemaker . . . . 35

3.10 Hyperparameters setting for SSD in AWS sagemaker . . . . . 35

4.1 Confusion matrix . . . . 37

4.2 Confusion matrix for SVM results . . . . 39

(8)

4.3 CNN classification results, illustrating the normalized target sub-image (on the left) and the classification probability plot (on the right): (a)(b) Examples of a correctly classified ship.

(c)(d) Examples of a correctly classified iceberg. (e)(f) Ex- amples of a correctly classified iceberg. (g)(h) Examples of misclassified ship. (i)(j) Examples of misclassified iceberg. . . 40 4.4 Confusion matrix for CNN results . . . . 41 4.5 Confusion matrix for SSD results . . . . 42 4.6 SSD detection results, (a) Example of a correctly classified

ship. (b) Example of a correctly classified iceberg. (c) Ex- ample of a misclassified ship. (d) Example of a misclassified iceberg. . . . 43 4.7 SSD detection results: (a) Example of a correctly classified

iceberg in the correctly detected location with a confidence score of 0.178. (b) Example of a correctly classified iceberg with wrongly detected location with a confidence score of 0.152. 44

(9)

1. Introduction

1.1 Background to the Research

Climate change is one of the greatest challenges facing humanity, and its ef- fects are increasingly visible [2]. The 2018 intergovernmental report on climate change (IPCC: Intergovermental Panel on Climate Change) estimates that the world will face catastrophic consequences unless global greenhouse gas emissions are eliminated within thirty years [28]. Such a diversity of chal- lenging problems can be seen as an opportunity: there are many ways to have an impact.

In recent years, Machine Learning (ML) [46] has been recognized as a generally beneficial tool for many technological advances. There still remains the demand to identify how these tools can be best applied in the fight against climate change. Machine learning as a subdivision of Artificial Intelligence (AI) deals with designing algorithms to learn from machine readable data [42]. Some examples of the ML algorithms commonly used, are artificial neural networks (ANN)[23], support vector machines (SVM)[25], self-organizing maps (SOM)[40], decision trees (DT)[49], ensemble methods such as random forests, case based reasoning, neuro-fuzzy (NF)[38], multivariate adaptive regression splines (MARS)[30], etc.

Owing to many climate prediction problems being data-limited, current climate models deal with this limitation by relying on physical laws. These models are structured in terms of coupled partial differential equations that represent physical processes like cloud formation, ice sheet flow, and permafrost melt [31]. ML models may provide new techniques for solving such systems efficiently in a data-driven way [2]. Besides prediction, ML can also be used to identify and impact relationships between climate variables.

Based on global climate changes, environment change in the Arctic is extremely complex and diverse [26]. Although academia has made great progress in documenting change, it is now accepted that ’there are many ways of know- ing’. Because of that, INTERACT [5] was proposed by the existing SCAN- NET network of field stations situated in all eight Arctic countries: Canada, Denmark, Finland, Iceland, Norway, Russia, Sweden, and the United States.

INTERACT’s main object is to build capacity for identifying, understanding, predicting and responding to diverse environmental changes throughout the

(10)

wide environmental and land-use envelopes of the Arctic. This project is under INTERACT Phase III in the context of tackling Arctic environment change with machine learning.

1.2 Research Problem and Hypotheses

The main objective of this thesis is to develop and compare three methods for object detection, namely: Support Vector Machines (SVM)[25], Convolu- tional Neural Networks (CNN)[32], Single Shot MultiBox Detector (SSD)[45].

To achieve this, the following objectives should be met:

- Implementing three models (SVM, CNN, SSD) for object detection in Python.

- Comparing these three methods on the SAR (Synthetic-aperture radar) imaging dataset from kaggle (a subsidiary of Google, it is an online community of data scientists and machine learning practitioners) as a case study, to see how different algorithms perform in iceberg detection which will help transport in polar area and hence to cut-down air pollu- tion.

1.3 Method

The following research is focused on two main outputs as mentioned in chapter 1.2: The first one is to implement the three approaches in object detection, the second one is to do a comparative analysis on the results.

Figure 1.1 shows the process describing the activities conducted. An extensive review of previous studies and literature about using SVM, CNN and SSD on image object detection has been undertaken. After that the models of object detection have been implemented using Python and tested in a case study. The results generated from the case study have then been analyzed based on the aspects of discrimination and calibration from each of these models. Finally, generalizations and conclusions regarding the advantages and drawbacks of each model have been made.

Figure 1.1.Workflow of comparing three models of image object detection

(11)

2. Literature review

2.1 Support Vector Machine (SVM)

2.1.1 Classifier

A classifier is an algorithm that given a sample of data, determines which category the sample belongs to, for instance, a spam filter that scans incoming emails and classify them as either "spam" or "not-spam". This is a concrete implementation of pattern recognition, which is one of many forms of machine learning.

In classification problems, the data input to the classifier is called features. A feature is an individual measurable property or characteristic of a phenomenon being observed [22]. Following the spam example, in spam detection algorithms, features can be the presence or absence of certain email headers, the structure of the email, the grammatical correctness of the text, the language, etc. A set of numeric features can be conveniently characterized by a feature vector. With regards to data output, the output from the model after training is called a label.

Linear Classifier

The Linear classifier is a classifier, that uses a linear combination of features to determine a label, but not a non-linear combination of features. In contrast to nonlinear classifiers such as kernel methods, which map data to higher dimensional space, linear classifiers directly work on data in the original input space [56].

In practical applications, we often encounter such problems: given some data points which belong to two different classes, one needs to find a linear classifier to divide these data into two classes. Take two-dimensional space as an example, as shown below (Figure 2.1), the space is cut by a straight blue line. The points on the left side of the line belong to category-1 (represented by squares), and the points on the right side of the line belong to category-2 (represented by triangles). In this case, the space is a two-dimensional space composed of X1 and X2, and the equation of the straight line is X 1 + X 2 = 1.

If one considers a three-dimensional space, a plane cuts the space in half.

(12)

In the example depicted in Figure 2.2, the corresponding equation would be X1 + X 2 + X 3 = 1.

Figure 2.1.Division of 2D space Figure 2.2. Division of 3D space In high dimensional spaces, a n − 1 dimension hyper-plane is used to cut the space. Here, x is used to represent data points and y is used to represent categories or labels (y takes 1 or -1, representing two different classes). The learning goal of a linear classifier is to find a hyper-plane in the n-dimensional data space to cut the space. The equation of this hyper-plane can be expressed as [56]:

W^TX+ b = 0 (2.1)

According to the above discussion, it is known that in a multi-dimensional space, a hyper-plane can divide the data into two categories. This hyper-plane is called a separation hyper-plane [34]. However there can be many separation hyper-planes, so which one to select? Theoretical research shows that in the existing training data, each element has a distance to the separation hyper- plane. When adding a hyper-plane, the distance between the element which is closest to the separation hyper-plane and the hyper-plane should be as large as possible. In this way, when new data is added, the probability of separation will be maximized [55]. Support Vector Machines, which will be introduced in the next section, is an algorithm that maximizes that distance.

2.1.2 Support Vector Machine (SVM)

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection [25]. Given a set of training examples, each marked as belonging to one of two categories, an

(13)

SVM algorithm builds a model that assign new examples to one category or the other, making it a deterministic binary linear classifier [25].

First of all, SVM normalizes the function intervals which is |W^Tx+ b|, the ab- solute value of substituting eigenvalues into separate hyper-plane equation, in order to remove the influence of the value scale. Secondly, it find the distance to the hyper-plane for all elements, which is the geometric interval (^|(W^T_{|W |}^x+b|)).

Given a hyper-plane P, the distance of all samples from the hyper-plane can be written as di j= ^|(W^T_{|W |}^x+b)|, the smallest distance among them is recorded as Dp, which also known as the margin, and the role of SVM is to find the hyper-plane with the largest Dp[48].

In the case where the separating hyper-plane cannot be found in the 1D space, kernel methods offer a more efficient and less expensive way to transform data into higher dimensions. This approach is called the ’kernel trick" [54]. It can map the original features to another high-dimensional feature space to deal with the linear inseparable problems in the original space. Feature mapping can solve the linear inseparable problem in low-dimensional space. In addition to being able to complete feature mapping, the inner product results after feature mapping can be directly returned. It simplified the work.

Sometimes, because the after mapping features are still not linearly separable, or the data is mixed with abnormal data, the optimal separation hyper-plane shift even makes the data linearly inseparable. To solve this problem, SVM introduces slack variables, which allow some data points not meeting the category requirements for separating the two sides of the hyper-plane, so that some strictly linear indivisible data sets can also be classified using SVM.

2.2 Convolutional Neural Network (CNN)

2.2.1 Perceptron

In machine learning, the Perceptron is a neuron that can be trained in many ways, usually used for binary classification. A binary classifier is a function which can determine whether the input represented by a vector of numbers belongs to a specific class or not [29].

The perceptron was invented at the Cornell Aeronautical Laboratory by Frank Rosenblat [50]. In mathematical terms, it is a binary classifier (threshold function) which maps its input x (a real-valued vector) to an output value f (x) (a single binary value) :

(14)

f(x) =1, if w · x + b > 0,

0, otherwise. (2.2)

where w is a vector of real-valued weights, and w · x is the dot product defined as ∑ⁿ_i=1wixi , where n is the number of inputs to the perceptron, and b is the bias which is a constant term and does not depend on any input value. No matter the formulation, the decision boundary for the perceptron is thus [50]:

b+

n

∑

i=1

w_ix_i= 0 (2.3)

Equation 2.3 is that of a hyper-plane, with the vector w of synaptic weights being the normal to the plane and the bias b being the offset from the origin.

In order to get a better comprehension of the perceptron’s capability to tackle binary classification problems, the artificial neuron model it relies on is shown in Figure 2.3:

Figure 2.3.Artificial neuron used by the perceptron [13]

2.2.2 Neural Network

In machine learning, an Artificial Neural Network (ANN) is a system of in- terconnected neurons that pass messages to each other. Neural Networks are used to model complex functions and, in particular, as frameworks for classification [23]. Before diving into Convolutional Neural Networks, let’s first revisit concepts of vanilla Neural Networks. In general, a Neural Network has three types of layers, see Figure 2.4 [4].

- Input Layers: is the layer that we give input to our model.

- Hidden Layers: the output from the input layer is then fed into a hidden layer. There can be multiple hidden layers.

- Output Layers: the output from the last hidden layer is then fed into a function.

(15)

Figure 2.4. A regular 3-layer Neural Network.

In neural networks, Convolutional Neural Networks (CNN) are popular for image recognition and image classification. Object detection, facial recognition etc., are some of the areas where CNNs are widely used as well. CNNs have been shown to perform amazingly well in these tasks. The key idea behind CNNs is to automatically learn a complex model that is able to extract visual features from pixel-level content and its surroundings, exploiting a sequence of simple operations such as filtering, local contrast normalization, non-linear activation, and local pooling [32]. See below in section 2.2.3 for an explana- tion of these concepts.

2.2.3 Architecture of CNN

"A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of activations to another through a differentiable function" [39]. It is true for almost all neural networks. This means that each layer is associated with converting the information from the values, available in the previous layers, into some more convoluted information and pass on to the next layers for further generalization.

Rectified Linear Unit (ReLU) Active Function

In an artificial neural network, the activation function is a function used to get the output of node.

The activation function used inside this project is the Rectified Linear Unit (ReLU) activation function [33], which is the most commonly used activation function in deep learning models. The function returns 0 if it receives any negative input, but for a positive input value it returns that value back. Hence it is defined as:

(16)

f(x) = max(0, x) (2.4) Graphically it looks like Figure 2.5:

Figure 2.5. The Rectified Linear Unit (ReLU) activation function produces 0 as an output when x < 0, and then produces a linear with slope of 1 when x > 0 [?]

Compared to other activation functions, ReLU is fast and the calculation cost is much smaller.

Convolution

All CNN models follow a similar architecture, i.e., it perform a series of convolution + pooling operations (more details explained in the following parts), followed by fully connected layers [41]. The main building block of a CNN is the convolution layer. Convolution is a mathematical operation for merging two sets of information. It is the first layer to extract features from input data and computing a dot product between their weights and a small region they are connected to in the input volume, and then passing the results to next layer.

For instance, A (part of the image matrix) = [1, 5, 3] and B (part of the filter matrix) = [1, 0, 1]. Then the answer of A × (convolve) × B = [1 × 1 + 5 × 0 + 3 × 1] = [1 + 3] = [4]. This is just an example; generally both of the A and B matrices are 2D, but the sense of the operation remains the same. See figure 2.6 for filter-matrix spans over the whole matrix, performing the operation and generating the desired convolved feature map.

Stride and Padding

Stride is the number of pixel shifts over the input matrix, and it specifies how much we move the convolution filter at each step. Sometimes when the filter

(17)

Figure 2.6.Depicts how convolution works.

does not fit the input, since the convolution filter needs to be contained in the input, to maintain the same dimensionality, we can pad the image with zeros (zero-padding). Otherwise, we drop the part of the image where the filter does not fit.

Padding is used to preserve the boundary information. As the filter matrix moves over the whole image, the size of the feature map is smaller than the input. To overcome this and maintain the same dimensionality, padding is used to surround the input with zeros. Even though these zeros will not supply any extra information, without padding, feature maps would shrink at each layer

(18)

and the filter matrix will have limited traverses.

Pooling

After the convolution operation, pooling layers will perform a down-sampling operation along the width, resulting in the reduction of dimensions. The sole purpose of pooling is to reduce dimensionality, and it enables us to reduce the amount of parameters, hence shortening training time and combatting overfitting [41].

There are various types of pooling, and the most common type is maxpooling which takes the maximum elements from the windows. In addition to maxpooling there exists average pooling, and as its name implies, it takes the average of the elements from the windows. See figure 2.7:

Figure 2.7.An example of both Max-Pooling and Average-Pooling

Fully Connected Layer

After the convolution + pooling layers, several fully connected layers are added to wrap up the CNN architecture. These layers form the last block of the CNN. This part is actually a regular multilayer perceptron and the previous layers are pre-processing for the multilayer perceptron. Fully connected layers have connections to all activations in the previous layer, as seen in section 2.3.2. What needs to be explained here is, the output of both the convolution and pooling layers are 3D volumes; however, a fully connected layer expects a 1D vector of numbers. Hence we will finally "flatten" the multi- dimensional data, that is, compress the data of (height, width, channel) into a one-dimensional array of length height ∗ width ∗ channel, and then connect it

(19)

with the fully connected layer [44].

Training

The CNN is trained over a large amount of input during the training phase, and for each time, the error generated is fed back into CNN to adjust the matrices’

values in each layer. It works the same way as an ANN. This concept is called backpropagation.

In summary, with regards to CNN architecture, generally, it consists of an input layer, and then followed by a convolution layer whose dimension depends on the data and the actual problem, therefore changing the dimensions accord- ingly. There is an activation function of the nodes in the convolutional layer, usually ReLU, which gives better results. After the combination of convolution and ReLu, a pooling layer is adopted to reduce the size of network. After- wards, a flattening layer is used to flatten the input to the fully connected layer.

The last layer is the output layer. Figure 2.8 below shows an architecture with a Softmax function. The Softmax function is called in the neural networks to map the non-normalized output of a network to a probability distribution over predicted output classes. The Softmax function is generally used just before the output layer (Figure 2.8).

Figure 2.8. An architectural overview of CNN

2.3 Single Shot MultiBox Detector (SSD)

Single Shot Detector is a method for detecting objects in images using a single deep neural network. This approach discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. Furthermore, the network combines predictions from multi-feature maps with different resolutions to naturally handle objects of

(20)

various sizes[45].

Multi-scale feature maps for detection

SSD uses one CNN network for detection, but it uses a multi-scale feature map. The so-called multi-scale uses feature maps of different sizes. The CNN network generally has a relatively large front feature map, later, convolution with stride=2 will be gradually adopted or pool to reduce the feature map size.

Both a relatively large feature map and a relatively small feature map are used for detection. A relatively large feature map is used to detect relatively small targets, while a small feature map is responsible for detecting large targets.

Each unit sets a priori boxes with different scales. For example, the 4 ∗ 4 feature maps are used for larger scale object. The 8 ∗ 8 feature map can be divided into more units, but the priori box scale of each unit is relatively small, Figure 2.9.

Figure 2.9.Lower resolution feature maps (right) detects larger scale objects.

Convolution for detection

SSD directly uses convolution to extract detection results from different feature maps. For the feature maps of shape m ∗ n ∗ p (m ∗ n number of locations with p channels), just use 3 ∗ 3 ∗ p which is a relatively small convolution kernel to get the detection value.

Default box

In SSD, each unit sets default boxes with different scales or aspect ratios. The predicted bounding boxes are based on these default boxes, which reduces the difficulty of training to a certain extent.

In general, each unit will set multiple default boxes, and their scales and aspect ratios are different. As shown in Figure 2.10 ([45]), It can be seen that each unit uses 4 different default boxes. The cat and dog in the picture use the default box that best suits their shape for training. For each default box

(21)

of each unit, it outputs a set of independent detection values, corresponding to a bounding box, which is mainly divided into two parts. The first part is the confidence or score of each category. The second part is the location of the bounding box, which contains 4 values (cx, cy, w, h), which respectively represent the center coordinates and width and height of the bounding box.

SSD only needs an input image and ground truth boxes for each object during training. In machine learning, the term "ground truth" refers to the accuracy of the training set’s classification for supervised learning techniques. So here, the ground truth boxes refer to the highest accuracy box. In a convolutional fashion, it evaluates a small set of default boxes of different aspect ratios at each location in several feature maps with different scales. For each default box, it predicts both the shape offsets and the confidences for all object categories ((c1, c2,..., cp)). At training time, it first matches these default boxes to the ground truth boxes [45].

Figure 2.10. SSD framework[45].

2.3.1 SSD Network Architecture

The SSD object is composed of two parts: extract feature maps and apply convolution filters to detect an object. The original SSD uses VGG16 [52], a convolutional neural network model, as the basic model to extract feature maps, but other networks should also produce good results. Inside this project we use another network; more details will be clarified in the next chapter and then adds a new convolutional layer based on VGG16 to obtain more feature maps for detection.

The network structure of SSD is shown in Figure 2.11 ([45]), and it can be clearly seen that SSD uses multi-scale feature maps for detection. To have a more accurate detection, different layers of feature maps go through a small 3 ∗ 3 convolution for object detection. The input image size of the model is 300 ∗ 300; it can also be 512 ∗ 512 (there is no difference from the former net-

(22)

work structure, except that a new convolutional layer is added at the end).

Figure 2.11.SSD Network Architecture [45].

2.3.2 Training

Default box matching

In the training process, we must first determine which default box the ground truth (real target) in the training image matches with, and the bounding box corresponding to the default box that will be responsible for predicting it.

There are two main points in the matching principle of the SSD’s default box and ground truth. First, for each ground truth in the picture, find the default box with the largest IoU (IoU, the intersection over the union. This is the ratio between the intersected area over the joined area for two regions [24].), and match it with the default box. In this way, it can be guaranteed that each ground truth must match a certain default box. However, there are very few ground truths in a picture, but many default boxes, so the second principle is needed: for the remaining unmatched default boxes, if the IoU of a certain ground truth is greater than a certain threshold (usually 0.5), then the default box is also matched with this ground truth. An IoU of 0 means that there is no overlap between the boxes, and an IoU of 1 means that there is 100% overlap between the boxes.

Let’s simplify this discussion to 3 default boxes as an example. Only default box 2 and 3 (but not 1) have an IoU greater than 0.5 with ground truth above.

Hence only box 2 and 3 are positive matches, and box 1 is a negative match.

Once we identify the positive matches, the corresponding predicted boundary boxes are used to calculate the cost. This matching strategy encourages each prediction to predict shapes closer to the corresponding default box. There- fore, the predictions are more diverse and more stable in the training. Although a ground truth can be matched with multiple default boxes, the ground truths are still too few compared to the prior boxes, so there will be many negative

(23)

samples relative to positive samples. In order to ensure that the positive and negative samples are as balanced as possible, SSD uses hard negative mining.

Instead of using all the negative examples, they are sorted using the highest confidence loss (details listed in the following section) for each default box and the top ones are picked so that the ratio between the positives and negatives is at most 1:3.

Loss function

First, the training sample is determined, and then the loss function.The loss function is defined as the weighted sum of localization loss (loc) and confidence loss (conf):

L(x, c, l, g) = 1

N(Lcon f(x, c) + αLloc(x, l, g))[45] (2.5)

The localization loss is the mismatch between the ground truth box and the predicted boundary box. SSD only penalizes predictions from positive matches.

In order to get the predictions from the positive matches be closer to the ground truth, negative matches can be ignored. The confidence loss is the loss in making a class prediction. For every positive match prediction, the loss is penalized according to the confidence score of the corresponding class. For negative match predictions, the loss is penalized according to the confidence score of the class "0" (classifies no object is detected).

So inside the loss function, N is the number of positive match, α is the weight for the localization loss, c is the predicted value of category confidence, l is the position prediction value of the corresponding bounding box of the prior box, and g is the location parameter of ground truth.

Data Augmentation

Using Data Augmentation can improve SSD performance. Data Augmenta- tion acts as a regularizer and helps reduce overfitting when training a machine learning model [51]. The main techniques used are horizontal flip, random crop & color distortion, and randomly sampling a patch (obtain small target training samples). For handling variants in various object sizes and shapes, each training image is randomly sampled by one of the following options [45]:

• Use the original input image.

• Sample a patch with IoU of 0.1, 0.3, 0.5, 0.7 or 0.9.

• Randomly sample a patch.

(24)

The size of each sampled patch is [0.1, 1] of the original image size, and the aspect ratio( the ratio of image width to image height) is between¹₂ and 2.

The prediction process is relatively simple. For each prediction box, first determine its category (the one with the highest confidence) and the confidence value according to the category confidence, and filter out the prediction boxes belonging to the background. Then, according to the confidence threshold, filter out the prediction box with a lower threshold.

(25)

3. Implementation

3.1 Data

Background

The remote sensing systems used to detect icebergs are housed on satellites over 600 kilometers above the Earth. The one used to monitor Land and Ocean is the Sentinel-1 satellite. In SAR (Synthetic-aperture radar) images, ships and icebergs typically have a stronger backscatter (the energy reflected back to the radar) response than the surrounding open water, and are therefore detectable using adaptive threshold techniques [21]. In general, the surrounding open water will be darker at a higher incidence angle, and thus it is also necessary to consider the radar polarization, which is how the radar transmits and receives the energy. More advanced radars like Sentinel-1, can transmit and receive in both the horizontal and the vertical planes. In this way, a dual-polarization image can be obtained. The data used in this project have two channels: HH (transmit/receive horizontally) and HV (transmit horizontally and receive ver- tically), which lay an important role in the object characteristics, since objects tend to reflect energy differently.

Description

The data used for this study come from the Kaggle "Statoil/C-CORE Iceberg Classifier Challenge" competition [16]. The labels are already provided by hu- man experts and geographic knowledge of the target. All images are 75 × 75 images with two bands. The data are presented in json format, and consist of a list of images, with each image including the following fields:

• id: the id of the image.

• band_1, band_2: the flattened image data, each data has 75 × 75 pixel values (5625 elements) in the list. These values have physical meanings, i.e., they are float numbers with the unit being decibel. Band 1 and Band 2 are signals characterized by radar backscatter produced from different polarizations at a particular incidence angle. The polarizations correspond to HH and HV .

• inc_angle: the incidence angle of which the image was taken.

• is_iceberg: the target variable; 1 represents iceberg, 0 represents ship.

An excerpt of the initial json format data is shown in Figure 3.1, and the visu- alized data is show in Figure 3.2. We will illustrate the visualization details in

(26)

the coming sections. The 1604 rows of the json file symbolize that there are 1604 images in total. We select 10% of the data to be testing data, 90% ∗ 75%

of the data to be training data, and 90% ∗ 25% to be validation data. So here, a brief introduction to training data, validation data, and test data. Training data: is the general term for the samples to build the model; Validation data:

is used to qualify performance; Test data: is used to test the model; usually, the generated models use it to predict the results.

Figure 3.1.json format data

3.2 Related work

3.2.1 ResNet

ResNet is an abbreviation of Residual neural network, which makes it possible to train up to hundreds or even thousands of layers and still achieve compelling performance. It was created by researchers at Microsoft, and published in

"Deep Residual Learning for Image Recognition" [35].

Degradation of deep networks

From experience, the depth of the network is crucial to the performance of the model. When the number of network layers is increased, the network can extract more complex feature patterns, so better results can be achieved when the model is deeper. However will the performance of a deeper network be better? The experiment found that a deep network has a degradation problem: When the network depth increases, the accuracy of the network becomes saturated or even decreases [35]. This phenomenon can be seen directly in Figure 3.3: The 56-layer network is worse than the 20-layer network. This will not be an overfitting problem because the training error of the 56-layer network is also high. In a neural network with n hidden layers, n derivatives will be multiplied together. If the derivatives are large, then the gradient will increase exponentially as we propagate down the model until they eventually explode, and this called an exploding gradient. Alternatively, if the derivatives are small, then the gradient will decrease exponentially as we propagate

(27)

(a) (b)

(c) (d)

(e) (f)

Figure 3.2.Visualizations of an iceberg and a ship signature: (a) Iceberg structure, (b) Ship structure, (c) 3D visualization of the iceberg structure by band_1, (d) 3D visualization of the ship structure by band_1, (e) 3D visualization of the iceberg structure by band_2, (f) 3D visualization of the ship structure by band_2. Note that both the iceberg and the ship are similar in size and intensity, but differ significantly in their

(28)

through the model until it eventually vanishes, and this called a vanishing gradient [37]. We know that deep networks have the problem of vanishing or exploding gradients, which makes deep learning models difficult to train.

Figure 3.3.Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer ’plain’ networks. The deeper network has a higher training error, and thus test error [35].

Residual learning

The degradation of deep networks shows that deep networks are not easy to train. However let us consider the fact that you have a shallow network and you want to build a deep network by stacking up new layers. In an extreme case, these added layers learn nothing but copy the characteristics of the shallow network. The new layer is identity mapping. Identity mapping ensures that some multilayer neural net output is guaranteed to be equal to its input. In this case, it is expected to, at the very least, have the same performance as the shallow network, and there should be no degradation.

ResNet has proposed residual learning to solve the degradation problem. For a stacked layer structure (several layers stacked), and when the input is x, the learned feature is recorded as H(x). With this, we hope it can learn the residual F(x) = H(x) − x, so that the original learning feature is F(x) + x. The reason for this is that residual learning is easier than direct learning of original features. When the residual value is 0, the stacked layer only performs identity mapping at this time, at least the network performance will not decrease, in fact the residual will not be 0, which will also make the stacked layer learn new features based on the input features, so as to have better performance.

The structure of residual learning is shown in Figure 3.4. This is similar to the

"short circuit" in the circuit, so it is a shortcut connection.

(29)

Figure 3.4.Residual learning: a building block [35].

3.3 Platform

3.3.1 Software

All models created during this thesis project were implemented in the pro- gramming language Python 3.

• SVM were developed with Scikit-Learn library [14].

• CNN were developed with the tf.keras module. The tf.keras module is the Tensorflow core implementation of Keras [18]. Keras is an open source, high-level API that simplifies the development of neural networks [8]. TensorFlow is an open source library for developing and training ML models [17].

• SSD were developed with the Amazon sagemaker object detection function.

3.3.2 Hardware

The training of the SVM model was performed on the Intel(R) Core(TM) i5- 8400 CPU@2.80GHZ, and it has 6 CPU cores and 8GB of RAM. The training of CNN and SSD was both performed on Amazon. CNN used the Amazon EC2 instance class ml.t3.medium. SSD model was trained on the Amazon EC2 instance class ml.p2.xlarge, and hosted on the Amazon EC2 instance class ml.m4.xlarge. Instance types are comprised of varying combinations of CPU, GPU, memory, and networking capacity, which are shown in figure 3.5.

3.4 Model Architecture and Application

The goal of this section is to provide a basic guideline for how to apply SVM, CNN, and SSD on this Iceberg detection task.

(30)

Figure 3.5.Properties of different instances

3.4.1 SVM

One of the most common classifiers used for image processing is Support Vec- tor Machines (SVM). In this project, the sklearn SVM class for Python was used. Library NumPy [11] and Pandas [12] were also used for data preparing.

From section 3.1, based on the data background, the three main features extracted for both the SVM and CNN models are:

• band_1: The flattened 75 × 75 horizontal radar frequency information.

• band_2: The flattened 75 × 75 vertical radar frequency information.

• (band_1 + band_2)/2: The average of band_1 and band_2 as the third channel to create a 3-channel RGB equivalent.

Inside the dataset, there exists ’na’ value in "inc_angle" column. Please note that ’na’ in this column is in string format which cannot be detected using numpy.isnan, so the ’na’ should be replaced with numpy.nan first.

SVM only gets the numerical values as a numpy matrix, so a numpy matrix which includes all features was generated. First, 3 different matrices were created. Then, a combination of those 3 matrices was built as an input variable for the model. SVM algorithms are not scale invariant, so MaxAbsScaler is used to scale each variable in the range of [-1,+1] which is centered by 0. For the kernel function, the sigmoid function performs best in this case. For parameters, we set the range of hyper-parameter to tune the SVM classifier:

• C_range = [0.1, 1,5, 10, 50, 100].

• gamma_range =[0.00001, 0.0001, 0.001, 0.01, 0.1].

• param_grid_SVM = dict(gamma = gamma_range, C = C_range):

• grid = GridSearchCV(clf, param_grid = param_grid_SVM, cv = 3, scoring = ’f1’)

.

The C parameter trades off misclassification of training examples against sim- plicity of the decision surface. A low C makes the decision surface smooth,

(31)

and a high C aims at classifying all training examples correctly by giving the model freedom to select more samples as support vectors. The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ’far’; while high values meaning ’close’, it can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.

GridSearchCV [6] was called to select the best parameters from the listed hyper-parameters. GridSearchCV is a library function that is a member of sklearn’s model_selection package. It helps to loop through predefined hyper- parameters and a fit estimator (model) on the training set. Clf is an estimator object, which is assumed to implement the scikit-learn estimator interface. Cv determines the cross-validation splitting strategy. Scoring is to evaluate the predictions on the test set. In this case, we set the gridsearch using 3-fold cross validation and ’f1 score’ as the cross validation score. The best parameters trained out in this case is ’C’:100, ’gamma’: 10 with -5 in the exponent with a score of 0.71.

3.4.2 CNN

The CNN network architecture in Figure 3.6 (and Figure 3.7) was used for these experiments. It is composed of four convolutional layers, two dense layers and one sigmoid layer to generate the classification probabilities. The CNN structure is optimized using Adam with lr = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon =10 with -8 in the exponent, decay = 0.0. When we trained the model, batch_size was set to 32, epochs was set to 50. Tf.keras. call- backs.EarlyStopping [9] was used to stop training when a monitored metric had stopped improving. The input is a 75 ∗ 75 ∗ 3 set of images, and the output is a binary 0/1 where 1 is noted as an iceberg and 0 is noted as a ship. The features extracted for CNN are the same as SVM (section 3.3.1).

Figure 3.6.CNN model

(32)

Figure 3.7.Network architecture

(33)

3.4.3 SSD

The SSD model implementation in this experiment is on the AWS Object De- tection algorithm platform. It uses the SSD framework and support base network: ResNet [36]. For input files, the Amazon Sagemaker Object Detection algorithm supports both RecordIO and image content types for training in the file mode [1]. Because of that, we transferred our data from a json file to an image file by Python, with the methodology used for getting color composites based on [3]. Then 1604 images were drawn from the 1604 rows’ json file, which is the initial data for this SSD experiment.

From section 2.3, it is known that before training the SSD model, we have to add a default boundary box for training data and validation data, which in this case is to feed the train_annotation and validation_annotation channels with boundary box added images files [1]. Here, for a boundary box adding operation, an open source software ’labelimg’ had been applied, and the source can be found through github: [10]. The boundary box added images will be saved in an XML file and then loaded to train_annotation and validation_annotation channels in the AWS Sagemaker notebook instance. The examples are shown in Figure 3.8: a ship image with boundary box and its XML file, as well as an iceberg image with boundary box and its XML file.

Since in SVM and CNN, 10% of the data were selected to be testing data, 90% × 75% of the data were selected to be training data, and 90% × 25% of the data were selected to be validation data. The ratio of the 1604 images is hence 161:1082:361.

Considering the data being separated by "train_test_split" functions with ’ran- dom_state=1’ under the sklearn library [15], the 161 test images should be exactly as same as the 161 test rows tested by SVM and CNN. As for achiev- ing that, we defined an indices variable, where indices = range(len(ytrain) to count the range of length of the ’y_train’ (’is_iceberg’ column’s binary value, either 1 or 0). When calling ’train_test_split’ function, we set indices inside it. The ’indices_test’ value we got here is as same as the row id from test data in CNN and SVM, which represents the test data set used for all three models in this project being the same data set.

The implementation pipeline used for SSD is based on [19]. First, extract the annotation from an XML file and then transfer that to a json file, which is the annotation input requirement of the AWS SSD function. Secondly, upload the data to four channels: "train", "validation", "train_annotation", "valida- tion_annotation" into the AWS Simple Storage Service (S3) bucket which is a public cloud storage resource avaliable in Amazon Web Services. Lastly, train the model with different hyper-parameters setting. The training estimator and

(34)

(a) (b)

(c)XML file for iceberg annotation

(d)XML file for ship annotation

(35)

the hyper-parameters settings for this project are shown in figure 3.9 and 3.10.

After the best model having been trained out, we fit this model with the test images data set to do iceberg detection.

Figure 3.9. Estimator setting for SSD in AWS sagemaker

Figure 3.10.Hyperparameters setting for SSD in AWS sagemaker

(36)

4. Results and Analysis

4.1 Introduction

In this section, all the results presented are based on the data we introduced previously. The object detection results from those three models will be demon- strated separately. The model architectures introduced in section 3 were tested with different parameters, and the best model over the training dataset was selected and is shown directly in its respective section. The final architectures were then validated in the test dataset. The Support Vector Machine (SVM) method was also trained with different kernel functions, and the hyper- parameters were chosen using grid search methods. For the CNN model, the exact same test data were used to test the model. For SSD, the same dataset but in image format was used to test the model.

The evaluation indexes chosen for evaluating the models in this project are the confusion matrix, the f1-score, and the prediction accuracy.

Confusion Matrix

A Confusion matrix [53] is a performance measurement for the machine learning classification problem, where the output can be two or more classes. The rows of the matrix represent the true values, and the columns of the matrix represent the predicted values. Let us take the two classification as an example and look at the matrix form (figure 4.1):

Here, let’s understand what TP, FP, FN, TN represent. They are extremely useful for measuring Recall, Precision, and Specificity.

• TP (True Positive): Predict the positive class as the positive class number; the true value being 1, and the prediction also being 1.

• TN (True Negative): Predict the negative class as the negative class number; the true value being 0, and the prediction being 0.

• FP (False Positive): Predict the positive class as the negative class number; the true value being 1, and the prediction being 0.

• FN (False Negative): Predict the negative class as the positive class number; the true value being 0, and the prediction being 1.

(37)

Figure 4.1. Confusion matrix

F1-score

Precision is a good measure to determine, when the costs of False Positive is high. It is calculated as [47]:

Precision= T P

T P+ FP (4.1)

Recall calculates how many of the actual positives the model capture through labeling it as true positive. It is calculated as [47]:

Recall= T P

T P+ FN (4.2)

F1-score helps to measure Recall and Precision at the same time. It uses Har- monic Mean in place of Arithmetric Mean by punishing the extreme values [47]:

F₁=2 × Recall × Precision

Recall+ Precision (4.3)

Prediction Accuracy

Compared to an F1-score, accuracy can be largely contributed by a large number of True Negatives which is common in most business scenarios. It is calculated as:

Accuracy= T P+ T N

T P+ T N + FP + FN (4.4)

(38)

ROC_AUC score

The ROC_AUC scoring the area under the receiver operating characteristic curve from prediction scores. ROC is the abbreviation of Receiver Operating Characteristic. It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. TPR and FPR are calculated in 4.5 and 4.6 [27]:

T PR= T P

T P+ FN (4.5)

FPR= FP

FP+ T N (4.6)

For a certain classifier, we can obtain a TPR and an FPR point pair based on its performance on the test sample. In this way, this classifier can be mapped to a point on the ROC plane. Adjusting the threshold used in the classification of this classifier, we can get a curve that passes through (0, 0), (1, 1), which is the ROC curve of this classifier. In general, this curve should be above the line connecting (0, 0) and (1, 1). Although, using the ROC curve to represent the performance of the classifier is intuitive and easy to use. In addition, another measure has been introduced to indicate the quality of the classifier: Area Un- der Roc Curve (AUC). As the name implies, the value of AUC is the size of the area below the ROC curve. Generally, the value of AUC is between 0.5 and 1.0, and a larger AUC represents better performance.

4.2 SVM

In this project, the best SVM model is selected during the implementation pe- riod, and then validated using the test data. The output of this SVM model is the binary classification results, and the confusion matrix results of this model are shown in figure 4.2. Inside those 161 images: 48 iceberg images being classified as iceberg, 38 iceberg images being classified as ship; 9 ship images being classified as iceberg, 66 ship images being classified as ship.

(39)

Figure 4.2. Confusion matrix for SVM results

4.3 CNN

The output results of CNN are the probability of being the category of iceberg.

In this experiment, we set the threshold to be 0.5. If the output of predict_prob- ability is less than 0.5, this sample is labeled as belonging to ship. Vice versa, if the output of predict_probability is more than 0.5, this sample is labeled as belonging to iceberg. Figure 4.3 illustrates and visualizes the CNN classification output. Figure 4.3(a) and (b) show a ship correctly classified with high probability output. Figure 4.3(c) and (d) show an iceberg correctly classified with high probability output. Figure 4.3(e) and (f) show an elongated iceberg structure which is correctly classified as iceberg. However, it is still possible that the network has also considered the elongated structure as a ship with lower probability. Figure 4.3(g) and (h) visualize a misclassified ship. The visible signature of ship in that image is small, misguiding the model to select the iceberg class. Figure 4.3(i) and (g) visualize a misclassified iceberg. The visible signature of iceberg in that image is not clear and has too much astral, misguiding the model to select the ship class.

(40)

(a)Ship (b)

(c)Iceberg (d)

(e)Iceberg (f)

(g)Ship (h)

(i)Iceberg (j)

Figure 4.3. CNN classification results, illustrating the normalized target sub-image (on the left) and the classification probability plot (on the right): (a)(b) Examples of a correctly classified ship. (c)(d) Examples of a correctly classified iceberg. (e)(f) Examples of a correctly classified iceberg. (g)(h) Examples of misclassified ship.

(i)(j) Examples of misclassified iceberg.

(41)

Figure 4.4 shows the confusion matrix of the CNN model results. Inside those 161 images: 77 iceberg images being classified as iceberg, 9 iceberg images being classified as ship; 11 ship images being classified as iceberg, 64 ship images being classified as ship.

Figure 4.4.Confusion matrix for CNN results

4.4 SSD

With regards to section 3.1, it is known that the outputs of SSD would be different from SVM and CNN. For the 161 test images input, the output would be 161 images with a boundary box added surrounding the object. Here, we first draw the confusion matrix of the SSD results, shown in figure 4.5. Inside those 161 images: 65 iceberg images being classified as iceberg, 21 iceberg images being classified as ship; 9 ship images being classified as iceberg, 66 ship images being classified as ship.

Deep Learning for Iceberg Detection in Satellite Images