Object Detection in Domain Specific Stereo-Analysed Satellite Images

(1)

Master of Science Thesis in Computer Science

Department of Electrical Engineering, Linköping University, 2019

Object Detection in Domain

Specific Stereo-Analysed

Satellite Images

(2)

Master of Science Thesis in Computer Science

Object Detection in Domain Specific Stereo-Analysed Satellite Images

Fredrik Grahn and Kristian Nilsson LiTH-ISY-EX–19/5254–SE Supervisor: Felix Järemo-Lawin

isy_{, Linköping University} Bertil Grelsson

Saab Dynamics AB Examiner: Maria Magnusson

isy_{, Linköping University}

Computer Vision

Department of Electrical Engineering Linköping University

SE-581 83 Linköping, Sweden

(3)

Abstract

Given satellite images with accompanying pixel classifications and elevation data, we propose different solutions to object detection. The first method uses hierar-chical clustering for segmentation and then employs different methods of classi-fication. One of these classification methods used domain knowledge to classify objects while the other used Support Vector Machines. Additionally, a combina-tion of three Support Vector Machines were used in a hierarchical structure which out-performed the regular Support Vector Machine method in most of the evalu-ation metrics. The second approach is more conventional with different types of Convolutional Neural Networks. A segmentation network was used as well as a few detection networks and different fusions between these. The Convolutional Neural Network approach proved to be the better of the two in terms of precision and recall but the clustering approach was not far behind. This work was done using a relatively small amount of data which potentially could have impacted the results of the Machine Learning models in a negative way.

(4)

(5)

Acknowledgments

We would like to thank our supervisors, Bertil and Felix, as well as our examiner, Maria, for their engagement in this work. We also thank Saab Dynamics for this opportunity and our fellow thesis workers at Saab for the company during this time. We would also like to thank the coffee machine at our floor for its provided energy.

Additional acknowledgements goes to Overleaf for their excellent LA_{TEX editor} which helped tremendously in writing the report.

Linköping, May 2019 Fredrik Grahn and Kristian Nilsson

(6)

(7)

viii Contents 4.2 Input Data . . . 23 4.2.1 Image Data . . . 23 4.2.2 Domain Knowledge . . . 24 4.2.3 Annotation . . . 25 4.3 Clustering Approach . . . 28 4.3.1 Clustering . . . 28 4.3.2 Pre-Processing . . . 30 4.3.3 Region Proposal . . . 31 4.3.4 Region Classification . . . 33 4.3.5 Support Functions . . . 40 4.4 Deep-learning Approach . . . 44 4.4.1 Image Augmentation . . . 44 4.4.2 Image-types . . . 45

4.4.3 Inference Before Testing . . . 47

4.4.4 Implementation Details . . . 47 4.5 Post-processing . . . 50 4.6 Evaluation . . . 52 4.6.1 Segmentation Evaluation . . . 52 4.6.2 Detection Evaluation . . . 53 4.6.3 Partial Results . . . 53 5 Result 55 5.1 Clustering Approach . . . 59 5.1.1 Domain Knowledge . . . 60

5.1.2 Support Vector Machine . . . 62

5.1.3 Split Support Vector Machines . . . 63

5.2 Deep-learning Approach . . . 66 5.2.1 FCN . . . 67 5.2.2 Faster R-CNN . . . 73 5.2.3 Muliple-Model Fusion . . . 75 5.2.4 YOLO . . . 78 6 Discussion 83 6.1 Annotation . . . 83 6.2 Result . . . 83 6.3 Clustering Approach . . . 84 6.4 Deep-learning Approach . . . 85 6.5 Future Work . . . 87 6.5.1 Clustering Approach . . . 87 6.5.2 Deep-learning Approach . . . 88 7 Conclusion 89 A Algorithms 93 References 95

(9)

Notation

Abbreviations

Abbreviation Description

CNN Convolutional Neural Network IoU Intersection over Union

NN Neural Network

FCN Fully Convolutional Network SVM Support Vector Machine HSV Hue, Saturation, Value LDA Linear Discriminant Analysis R-CNN Region-based CNN

YOLO You Only Look Once

(10)

(11)

1

Introduction

Many areas of work require human operators for the task of detecting objects and analysing their interactions. For example, rescue operators may rely on quickly locating a person in aerial images produced by unmanned aerial vehicles. At the same time, active fields of research, such as computer vision, explores the use of computers and artificial intelligence to quickly find objects within images. With growing computational power, some techniques are able to reach human-level performance and in some cases, even surpassing human performance [26].

In recent years, developments have led to new methods that not only can classify images, but also identify and separate objects in the image [22]. Convo-lutional Neural Networks (CNN) that made significant advancements in image classification [11, 26, 55] are now being used for image segmentation and object detection as well [10, 50]. These networks have also been successfully applied to other computer vision problems such as autonomous driving, where they can be used to detect objects using other kinds of data such as sensor data [20].

In addition to these supervised learning techniques, there are other techniques, which have been proven useful. For image segmentation, clustering using either hierarchical algorithms or k-means is a popular option to CNNs [8, 12, 25, 52, 60]. These methods have the advantage of not requiring the large amounts of training data that many supervised methods require.

For this work we used the methods mentioned above within the domain of airports, in which there is a great deal of prior domain knowledge. For example, it is known that there is at least one runway, often some hangars and maybe a terminal. The runway and the hangars are both connected to taxiways that the aircraft use before takeoff and after landing. This information could be used to make assumptions about unknown objects or help with the object classification.

The thesis was done at Saab Dynamics1 and their image-processing depart-1_{https://saab.com/}

(12)

2 1 Introduction

ment. Saab is an aerospace and defence company that is, among other things, known for their military aircraft. At their image-processing department, they do object detection, tracking and more.

1.1 Input data

The input data given for this thesis consist of satellite ortho-photo images (Fig-ure 1.1). An ortho-photo is geometrically corrected so that it lacks perspective view. With these images, there are also pixel-by-pixel classifications (Figure 1.2) and topographical data (Figure 1.3). The pixel classifications consist of broad classifications such as "Ground" and "Building" and not all of them are of interest for this thesis. These images are supplied by Vricon2_{and they have used} stereo-images to generate the elevation data and a CNN for the classifications.

1.2 Motivation

The problem of object detection in remotely sensed images comes with many challenges such as availability of data and choice of method. Yet, the problem is very interesting from a technical point of view. This is because the state-of-the-art techniques are constantly evolving, becoming more and more complex as the hardware capabilities increase.

Many of the tasks performed by an object detector require great speed and accuracy. For example, locating a person in one of a few thousand images or locating other vehicles in traffic from a constant stream of images to avoid colli-sions. To achieve the required accuracy when detecting objects, large amount of data is used to train complex models. With our research questions, we want to explore if a satisfactory performance can be achieved when the amount of data is limited. In addition to our available data, we want to find a way to incorporate human knowledge about the domain into our models with the hope of increased performance.

1.3 Aim

The aim of this thesis is to locate and classify different objects in satellite images of airports. The objects to be detected are within the boundaries of the airport and other objects outside of the airport are of no interest. Furthermore, we aim to find a way of combining the prior knowledge about the domain with the data to improve detections. The resulting objects should then be presented to the user.

(13)

1.3 Aim 3

Figure 1.1: Satellite ortho-photo of Linköping airport with surroundings.

Figure 1.2:Classified pixels of Linköping airport with surroundings, blue is wa-ter, green is trees, yellow is ground, red is buildings, pink is roads and dark pink is runway.

(a)Surface model. (b)Terrain model. (c)Height difference be-tween the surface model and the terrain model. This gives the height of objects such as build-ings.

(14)

4 1 Introduction

1.4 Research Questions

The research questions for this project are listed below.

1. How can prior knowledge about the domain be combined with clustering to detect objects using the given data?

Prior knowledge of the domain means that it is known that some objects should be present and that these are connected. In this work we investigate if the knowledge can be used to detect objects in images. Clustering will hopefully be useful to separate some objects from each other and improve upon the initial classifications. The objects can then be extracted and the prior domain knowledge can be used to reason about the type of objects. 2. Can a CNN be trained using our limited data to detect objects of the given

domain?

Our data is composed of a small amount of airports. In this work we inves-tigate if this is enough to train a model. CNNs have been shown to be good for this task when enough training data is given [50].

3. How does clustering combined with domain knowledge or Support Vector Machines compare to a CNN in terms of precision and recall?

1.5 Delimitations

The domain that is explored in this thesis is limited to airports and no other domains are considered.

There does not exist any other dataset with annotated pixel classification and elevation data. Therefore, there exists no pre-trained models which use this type of data.

1.6 Limitations

There are a limited number of airports in the world which naturally limits the amount of data that is available. Additionally, the data-provider does not have data of every airport. Every airport that is used needs to be annotated by hand.

1.7 Division of labour

The work was divided into two methodologies, the clustering approach (Section 4.3 Clustering Approach) and the deep-learning approach (Section 4.4 Deep-learning Approach). The work on the clustering approach was done by Fredrik Grahn while the work on the deep-learning approach was done by Kristian Nils-son. All the material related to one of those approaches was therefore written by the respective author. This includes the theory and related work pertaining to each of the approaches.

(15)

2

Theory

Object classification is a typical classification problem, that is, labelling objects within the data. In the case of images, objects are entities such as structures, trees and vehicles. As this can be seen as a general classification problem, there are many machine learning algorithms that can be applied. For images that cover a large area such as satellite images, the images may contain multiple objects and these have to be extracted before being classified one by one.

Object localisation refers to the task of finding the location of objects in im-ages. This is closely related to image segmentation where individual pixels of an image are partitioned into a number of clusters. There exists various methods for image segmentation, one commonly used method is clustering [8, 12, 25]. These segments can be used as a basis for localisation. Semantic segmentation is the task of classifying each pixel in an image.

The task of finding and classifying one object within an image is called classi-fication with localisation while object detection is the task of localising and clas-sifying many objects within an image [1, 2]. A simple example of the differences between classification, localisation and detection is shown in Figure 2.1.

(16)

6 2 Theory

(a) Object localisation; finding the object within the image.

Star

(b)Object classification; classifying the object. Here labelling the object found in (a). Pink Pink Pink Yellow Yellow Yellow Yellow Yellow (c) Object detection; finding and classifying objects in the image. Here finding and classi-fying all the stars within the image.

Figure 2.1: A simple example of the differences between object localisation (left), classification (middle) and detection (right).

2.1 Evaluation Metrics

To measure the accuracy of different object detection algorithms, different met-rics are used. A common metric is the intersection over union (IoU) also known as the Jaccard index, which is calculated as

J(A, B) = |_|A ∩ B|

A ∪ B| , (2.1)

where A and B are sets (or in the case of object localisation, bounding boxes), see for example [3, 40]. IoU gives a value between 0 and 1 of how well the produced bounding box is compared to the ground truth. This is often used in combination with precision or recall, see for example [21, 36, 50]. Precision p is calculated as

p = T P

T P + FP (2.2)

and recall r is calculated as

r = T P

T P + FN , (2.3)

where TP denotes true positives, FP denotes false positives and FN denotes false negatives [36]. IoU, precision and recall are visualised in Figure 2.2.

(17)

2.2 Clustering 7

IoU

Precision

Recall

Figure 2.2:Illustration of the three different evaluation metrics used where green is FN, red is FP and blue is the TP. The coloured parts are used in the calculation. Red and blue combined are the prediction, green and blue combined are the ground truth and blue is the intersection.

Sometimes a high precision may come with the cost of lower recall and a high recall may come with the cost of lower precision. To compare different results when one model has better precision and the other has better recall, the F1-score is often used [24]. The F1-score is calculated as

F1= 2pr

p + r , (2.4)

where p is the precision and r is the recall. This often gives a good indication on how good a model is if there is no preferences towards having a high precision or a high recall.

2.2 Clustering

Clustering is a form of exploratory data analysis, where hidden structures in unla-belled data are found. The unlaunla-belled data is partitioned into discrete sets, where each set represents a cluster [62]. Hierarchical clustering is a group of popular techniques for, among other problems, image segmentation [8, 25, 52, 60]. These algorithms partition the data using hierarchical treelike structures [62].

In the case of agglomerative hierarchical clustering, the algorithm starts with each data point having its own cluster [62]. These are then merged together to form larger clusters. The clusters are merged based on a distance criterion such as the single linkage criterion where the distance between two clusters is defined by the distance between the two closest points. Another criterion that minimises the variance between clusters is Ward’s minimum variance method [39]. Most linkage methods use a dissimilarity or distance function d(i, j), which denotes the distance or dissimilarity between clusters i and j. Furthermore, the distance

(18)

8 2 Theory

between two merged clusters i and j and a third cluster k can be generalised to the formula

d(i + j, k) = ai· d(i, k) + aj· d(j, k) + b · d(i, j) + c · |d(i, k) − d(j, k)| (2.5)

where a, b and c are variables that depend on the method used. More details on the different values used by different methods can be found in [39].

2.3 Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a form of feature extraction that aims to find the linear combination of features which are best able to separate the dif-ferent classes [38]. In its simplest form, LDA finds a direction to project two-dimensional data in a way that maximises the separation between the two classes. If we have two sets of data points X1 = {x11, x12, . . . , x1l1

}_{and X}₂ _{= {x}2

1, x22, . . . , x2l2 } corresponding to two different classes, this direction is given by maximising

J(w) = w T_S Bw wT_S Ww , (2.6) where SB= (m1−m2)(m1−m2)T , (2.7) SW = X i=1,2 X x∈Xi (x − mi)(x − mi)T (2.8) and mi= 1 li li X j=1 xi_j. (2.9)

Equation (2.6) is the well known Fisher’s Linear Discriminant. Equation (2.7) is the squared mean differences between the two sets of data points and (2.8) is the variance of the two sets. By maximising J(w), a direction w is found which will ensure that the distance between the mean of each class is maximised while the variance of each class is minimised. The data points of both classes are then projected in this direction, as shown by Figure 2.3.

(19)

2.4 Support Vector Machines 9 x1 x2 _|m 1− m 2| w

Figure 2.3:The data points of two classes being projected on a vector w. The vectors m1and m2are the means of each class. This is what is done by LDA.

The kernel trick, that is commonly applied to SVMs, can also be applied to LDA for nonlinear feature extraction [38]. Furthermore, a generalisation of LDA exists that makes use of the singular value decomposition to avoid some numeri-cal problems while being more computationally demanding [42].

2.4 Support Vector Machines

A Support Vector Machine (SVM) is a binary classifier that is used to separate classes from each other in data composed of any number of dimensions [6]. This binary classifier is also easily adaptable to multi-class problems such as object classification [13, 27].

A linear SVM separates groups of points with a hyperplane, as shown in Fig-ure 2.4. Below is a description of linear SVM mainly based on [64]. The SVM maximises the separation between the different groups by finding the hyperplane with the largest margin. In this case, the margin is defined as the shortest distance between the hyperplane and the closest point. Figure 2.4 shows a linear decision boundary that is a hyperplane in a two-dimensional space. This decision bound-ary can, however, be made nonlinear by defining a mapping from the input space to the feature space using, for example, the kernel trick.

In the case that the data are linearly separable and free from noise, we can define the hard-margin SVM. The data points are on the form

D = {(x1, y1), (x2, y2), . . . , (xm, ym)} , (2.10)

where xiis an n-dimensional real vector and yi is either 1 or −1 depending on the

class. We can then define the separating hyperplane as

(20)

10 2 Theory

where w is the weight vector and b is the bias. To classify a data point, it must be on the correct side of the hyperplane which corresponds to the conditions

       w· xi−_{b > 0} _{if y}_i _{= 1} w· xi−b < 0 if yi = −1 . (2.12) This can then be rewritten as

yi(w · xi−b) > 0, ∀(xi, yi) ∈ D . (2.13)

To maximise the margin (that is, the separability of the groups), (2.13) becomes

yi(w · xi−b) ≥ 1, ∀(xi, yi) ∈ D . (2.14)

The points that satisfy (2.14) with equality are called support vectors, hence the name Support Vector Machines.

margin = 1/||_w || w · x i− b_> 1 w · x_i − b = 0 w · x i− b_< − 1

Figure 2.4:The decision boundary of a linear SVM. The hyperplane (in this case, a line) that separates the different classes is the middle line, the two dotted lines are the lines that are formed by (2.14). Inspired by: Yu and Kim [64].

The distance between the separating hyperplane and a support vector is the margin that equals ||_w||1 .

Lastly, the optimisation problem that defines the SVM is: minimise: Q(w) =1

2||w|| 2

subject to: yi(w · xi−b) ≥ 1, ∀(xi, yi) ∈ D

(2.15)

By introducing slack variables ξi, SVMs can be adapted to produce a

(21)

2.5 Neural Networks 11 becomes: minimise: Q(w, b, ξi) = 1 2||w|| 2_{+ C}X i ξi subject to: yi(w · xi−b) ≥ 1 − ξi, ∀(xi, yi) ∈ D ξi ≥0 (2.16)

This is the soft-margin SVM which allows for some data points to be on the wrong side of the hyperplane, as defined by the C parameter. A larger C leads to less training error but probably also narrows the margin to achieve this. The C pa-rameter thus offers a tradeoff between maximising the margin and minimising the training error.

Different techniques exist to adapt the binary SVM classifier for multi-class classification [27]. The two main approaches for this is to use either the one-against-one or one-against-all scheme, both of which require multiple classifiers to be trained and thus, many optimisation problems to be solved. The one-against-all scheme trains one classifier for each of the classes to separate that particular class from the rest, resulting in k classifiers where k is the number of classes. The one-against-one scheme trains one classifier for each pair of classes to only separate those two classes, resulting in k(k − 1)/2 classifiers. In both cases, the different classifiers vote for the most likely class at prediction time.

2.5 Neural Networks

In this thesis, different types of Neural Networks (NN) were used. Exactly how a NN works will not be presented here, but is described in the book "Deep Learn-ing" by Goodfellow et al. [23]. A Convolutional Neural Network (CNN) is a type of Neural Network that, in addition to fully connected layers, consists of convo-lutional and pooling layers. The output from a classical CNN is often a class prediction.

2.5.1 Training

A NN can be trained using data with an associated ground truth. The data go through the NN and in the end, it makes a prediction [23]. The prediction can be a value, a class or a segmented image depending on the type of the NN. A loss value is calculated from the difference between the prediction and the ground truth according to a loss function. This loss is passed back through the network in a process called back-propagation. The process of back-propagation updates the weights of the network using the gradient of the loss function. This gradient is then multiplied by the learning rate which adjusts the process of learning by lowering the rate at which the weights change.

When training a NN for a long time with a lot of data, the learning rate is often lowered throughout the training, this is called learning rate decay [23]. This often happens after a pre-specified amount of time. When training a NN,

(22)

12 2 Theory

the duration time is calculated in iterations and epochs. One epoch has passed when all training-data have gone through the network and one iteration is a batch going through the network. A batch is one or more training examples that is sent through the network before back-propagation is done. For example, if the training-data consists of 15000 images and a batch consists of 100 images, one epoch will consist of 150 batches going through the network.

2.5.2 Fully Convolutional Network

A Fully Convolutional Network (FCN) is a CNN without the common fully con-nected layers at the end [32]. The FCN can be used as a semantic segmentation network which produces a pixel-wise classification of an image. This pixel-wise classification can be seen as a heat-map.

The heat-map can be used to predict where there are objects in the input im-age [32]. However, as pooling makes the output smaller than the input, the heat-map is smaller than the input image and needs to be up-sampled before the pixels corresponds to pixels in the image. This up-sampling is called deconvolution and is not the exact reverse of convolution.

The FCN method have different ways to obtain the segmented image [32]. It can take the heat-maps (one for each class) directly from the end of the network and up-sample it 32 times, this is called FCN-32s. It can also use the output from a middle layer combined with the output from the end. The heat-maps from the end is up-sampled to the same size as the heat-maps from the middle layer. Then the heat-maps are summed pixel-by-pixel to create new heat-maps. Finally, the new heat-maps are up-sampled 16 times and this is thus called FCN-16s. It has better accuracy for each pixel compared to FCN-32s. The procedure can be performed in one more step by adding an even earlier layer, resulting in a segmentation called FCN-8s. A visual representation of how these three differ-ent outputs from the FCN are produced from a basic FCN structure is shown in Figure 2.5.

(23)

2.5 Neural Networks 13

FCN-32s FCN-16s FCN-8s

= Image

= Image = Conv = Max-Pool = Up-Sample

Figure 2.5: Representation of an FCN with three different outputs, FCN-32s, FCN-16s and FCN-8s. The image (grey square) to the left is the input image and the images (grey squares) to the right are the output heat-maps. Each up-sampling is done with factor of two. The number in the names of the three outputs represents the factor of which to up-sample the heat-map. Image source: [16]

2.5.3 Faster Region-based Convolutional Neural Network

Faster Region-based Convolutional Neural Network (Faster R-CNN) is an object detection method [50]. Faster R-CNN consists of a Region Proposal Network (RPN) which proposes regions and a classification network that classifies each region proposed by the RPN. The RPN and the classification network share an FCN while having separate fully connected layers for their different tasks.

The RPN does the localisation part of the detection. It takes images as input and outputs bounding boxes with an accompanying score of how likely the region is to contain an object. The RPN uses a sliding window technique to propose boxes of various shapes and sizes at each location. Usually, three different sizes and three different shapes are proposed, i.e. nine boxes. Each of these boxes gets four values representing the coordinates of the two opposite corners and two scores representing probabilities for it being an object and not being an object. The four values for the positioning are put through a box-regression layer and the two object probabilities are put through a box-classification layer.

(24)

14 2 Theory

boxes that have an IoU over 0.7 are counted as good predictions and if the IoU is below 0.3, they are counted as bad predictions. If no proposed box have an IoU of 0.7 or higher, the proposed box with the highest IoU is counted as a good prediction. Boxes that have an IoU between 0.7 and 0.3 are counted as neither good nor bad predictions.

The good and bad boxes are used to calculate a loss function that is described in [50]. This loss is then used to train the RPN using back-propagation. This is done using mini-batches of 256 boxes with a balanced amount of good and bad examples.

The input to the object classification part are the heat-maps generated by the FCN and the proposed regions from the RPN. Each region is associated with the corresponding part of the heat-map. The classification layers perform max pool-ing on the heat-map regions and classify them as one of the object classes or as background.

The RPN and the classification network are not combined from the start. In-stead, they are combined after each part has been individually trained. Both networks use a pre-trained model trained on the ImageNet dataset [11]. These pre-trained models are combined with each of the models unique layers and in-dividually fined-tuned. After this, the unique layers of the RPN are fine-tuned with the now shared FCN from the classification part. These shared layers are now fixed, meaning that they cannot change. After this, the unique layers of the classification are again fine-tuned.

2.5.4 You Only Look Once

You Only Look Once (YOLO) is an object detection method [48]. YOLO divides every input image into an S × S grid. Each cell in this grid does a class prediction and produces different bounding boxes with a confidence score. Both the class prediction and the bounding box generation use the same CNN. Each cell utilises the information from the whole image when predicting.

The CNN is pre-trained on the ImageNet dataset [11] before being converted to the detection model by adding convolutional and fully connected layers with randomly initialised weights [47, 48]. The last of these layers does the class pre-diction and produces the bounding boxes.

As mentioned above, each cell produces more than one bounding box, but only one box is needed for the prediction in that cell at one time [48]. At training, the box with the highest IoU is selected to be used. This way, different boxes in a cell get adjusted to fit different shapes.

The used boxes and the cells that contain an object are used to calculate a loss function defined in [48]. This loss is used for back-propagation when training the network.

2.5.5 Network Fusion

Network fusion is the process of taking two or more networks and combining them in some way. It can be the same type of network with different inputs or it

(25)

2.6 Hough Transform 15 can be completely different networks [14, 57].

Late Sum Fusion (LSF) for segmentation networks, such as an FCN, sums each corresponding value of the heat-maps with the same class from the networks before the classification step [17]. A small example of LSF is shown in Figure 2.6.

Late Max Fusion (LMF) for segmentation networks, such as an FCN, takes the maximum of each corresponding value from the heat-maps and makes a new heat-map [17]. This can be thought of as taking the prediction on each pixel from the network that is most sure about its prediction. A small example of LMF is shown in Figure 2.7.

+

2 3 4 3 0 4 4 2 0 4 6 5 4 1 1 2 1 5 6 9 9 7 1 5 6 3 5

Figure 2.6: Two 3 × 3 heat-maps fused together into one using sum fusion.

2 3 4 3 0 4 4 2 0 4 6 5 4 1 1 2 1 5 4 6 5 4 1 4 4 2 5

Figure 2.7: Two 3 × 3 heat-maps fused together into one using max fusion.

It is also possible to do fusion between a detection and a segmentation net-work [14]. One way of doing this, is to only use a detection if the segmentation inside the detection area covers an area above a certain threshold.

2.6 Hough Transform

The Hough transform is a method for detecting straight lines in images [15]. This is done by transforming each point in an image to a curve, e.g. a sinusoid in a parameter space. Then a point (x, y) in the image is represented as a sinusoidal curve in the parameter space,

p = x cos(θ) + y sin(θ) (2.17)

where p is the distance from the origin and θ is the angle of its normal. If we have a set of points in an image {(x1, y1), . . . , (xn, yn)}, these are transformed into

sinusoidal curves in the parameter space. It can then be shown that points on a line represents sinusoids with a common point of intersection. This point repre-sents the line. An inverse Hough transform of the point gives a perfect line in the image.

The inverse transform can be obtained by first dividing the (p, θ) plane into a grid [15]. The grid is represented as an accumulator array in which each cell contains the number of curves in the parameter space that passes through that cell. Additionally, the (p, θ) plane is constrained such that its size is not infinite. After all points of the image has been processed, the cell with the highest count has the most curves passing through it. Based on the constraints and the size of the cells, there will be some quantisation error.

(26)

(27)

3

Related Work

The following chapter contains different methods for solving the object detec-tion task, specifically detecdetec-tion in remote sensed images, as well as methods that can be used to solve different sub-tasks. Different ways of using domain knowl-edge for object localisation and classification is presented in Section 3.1 Domain Knowledge. Section 3.2 Clustering contains different clustering techniques and ways of utilising these for the task of image segmentation. Section 3.3 Support Vector Machines contains comparisons between SVMs and other classifiers as well as different applications of SVMs that are related to this work. In Section 3.4 Segmentation and Detection Networks, the usage of different state-of-the-art techniques for semantic segmentation and object detection are presented.

3.1 Domain Knowledge

Domain knowledge can be any type of knowledge concerning the given domain. It could be information that everyone knows or information from an expert in the field. In this context, domain knowledge is used to help in the localisation or classification of objects.

Forestier et al. [19] present an approach that uses domain knowledge to clas-sify semantics of coastal images. First the objects are split into larger groups like vegetation and water and then they are split further into smaller subgroups like fields and wooded areas. Each group has information of neighbouring groups, distances between groups and their elevation. Some examples of this are that a beach is a neighbour of an ocean or a lake and a wooded area may be 100-200 m2.

Luo et al. [33] use the same approach for splitting big groups into smaller sub-groups. Low-rise buildings are split into subgroups according to their different roof-types and then combined after classification. Different attributes are used to calculate the texture and shapes of different objects.

(28)

18 3 Related Work

According to Yu et al. [65], domain knowledge in anatomy could help in the segmentation of different organs. While using FCNs and combining that with knowledge of the positions of organs, they improve the result of segmenting or-gans. They also show that if only one side of the body is used for training, the result is worsened for segmenting the other side.

3.1.1 Specific Techniques

Different techniques can be used when searching for different kinds of objects. There is a big difference between roads and buildings in an image, especially if there is accompanying elevation data.

Leninisha and Vani [31] propose a method for detecting all connected roads, by using only one road pixel. They start by finding the middle of the road and the width by searching out in every direction until two edges are reached. Then the road is followed in both directions. An intersection is found when the width of the road suddenly becomes larger. This can also be used for road-like objects like canals or dirt paths.

Ok [41] uses shadows in satellite images and information about where the sun was at the time when the image was taken to detect buildings. By knowing the position of the sun, they calculate the height by looking at how far away the shadow is cast. The height can then be used to classify if it is a building or not.

3.2 Clustering

Clustering is often used as a means of segmenting an image rather than as a way to improve an existing segmentation [8, 12, 25]. However, combining segmenta-tion and clustering has been done in the work by Wuttke et al. [60]. They first roughly segment remote sensing images to obtain super-pixels, which are groups of related pixels. This greatly reduces the amount of data. They then apply a hi-erarchical clustering algorithm to refine the previous segmentation. Wuttke et al. also show that the initial segmentation step greatly increases the accuracy of the final clustering. In addition to the segmentation and clustering steps, they use a final active learning step in which the clustering is further improved.

Clustering techniques, such as hierarchical clustering, have also been used to-gether with a three-dimensional surface topography similar to the topographical data shown in Figure 1.3. In the work by Senin et al. [52], they mention that the structural similarities between three-dimensional surface topography and digi-tal images makes many techniques used in image processing easily adaptable to work with three-dimensional topography data.

Fuzzy clustering is another popular technique in combination with image seg-mentation [8, 25]. These techniques exploit fuzzy set theory in which elements have a degree of membership towards a set that is defined by a membership func-tion [25].

(29)

3.3 Support Vector Machines 19

3.3 Support Vector Machines

Thanh Noi and Kappas [56] compare the performance of Random Forest, k-nearest Neighbours and SVM classifiers using remote sensing data. In addition to the ac-curacy of the classifiers, they compare how well these perform using different sub-datasets of various sizes. Thanh Noi and Kappas conclude that the SVM clas-sifier, on average, produces the highest accuracy while being the least sensitive to the size of the dataset. SVMs are also efficient when the amount of data is less than the dimensionality of the feature space, as shown by Pontil and Verri [45]. This is often the case with image data, especially with high-resolution images, as each pixel is regarded as one feature. Foody and Mathur [18] also compare SVMs to other classifiers such as NNs using remotely sensed data. They found that for multi-class classification, SVMs performed very well and achieved high accuracy, similar to that of NNs.

Melgani and Bruzzone [37] also use SVMs in combination with remote sens-ing images. They describe and use a hierarchical tree-based approach in which multiple SVM classifiers are used in a hierarchical structure. This structure con-sists of binary classifiers that either separate two classes or two groups of classes. Melgani and Bruzzone conclude through their experiments that this type of ap-proach showed a lower discrimination capability compared to traditional one-against-one and one-against-all schemes of SVM multi-class classification. They mention that this may be caused by the risk of error propagation through the hi-erarchical structure. Furthermore, their experiments show that the hihi-erarchical approach had a lower computational time and that the choice of method may be based on a tradeoff between computational time and discriminating ability.

Chapelle et al. [7] classify images using SVMs in combination with histogram features. They transform the images to a HSV (Hue, Saturation, Value) represen-tation, where the colour components (Hue and Saturation) are separated from the luminance component (Value). This way images in this representation are less affected by changes to the illumination. After this transformation, they sum-marise each colour component in a histogram of 16 bins and use this as one of their features when classifying. With theses features Chapelle et al. achieve low error rates on their testing sets.

3.4 Segmentation and Detection Networks

Marmanis et al. [35] present an FCN model where they use both RGB images and a digital height model to segment satellite images. They combine the height model and the RGB image late in the process and also change the structure of the FCN from the original version. Their implementation of the FCN delivered a state-of-the-art performance on the dataset used.

Kaiser et al. [30] use satellite images combined with coordinates of roads and buildings to train an FCN. This FCN segments the three classes of "Building", "Road" and "Other". They have images from different cities and train on some cities and evaluate on others. The results from this varied, but when training

(30)

20 3 Related Work

on more cities, the result improved. Evaluating on images from the same city as cities in the training data proved to give better results.

Marušić et al. [36] use the Faster R-CNN model for detecting humans in aerial images and are getting close to human results. They tested how well the RPN performed by calculating precision and recall with different ratios of prediction boxes to ground truth boxes. With a one-to-one ratio, the recall was the low-est and increased to almost 100% when the ratio was two-to-one. The precision decreased according to the percentage of ground truth boxes compared to predic-tion boxes.

Sommer et al. [54] test the performance of Faster R-CNN on detection of ve-hicles in aerial images. Two datasets are used in this work. The first is located over Munich in Germany and contains only urban and residential areas. The sec-ond is located over Utah in USA and contained both urban and rural areas. They compared the Faster R-CNN with other state-of-the-art detectors, where Faster R-CNN performed the best.

Wu et al. [59] show that detection of airports with a few other objects is pos-sible with the YOLO model. Additionally, it is much faster than other detection algorithms. It has some disadvantages when it comes to localisation accuracy and when objects are in groups. However, they use the first version of the YOLO model and the localisation accuracy has been improved in newer models [47].

Xia et al. [61] present a dataset with aerial images for object detection. These images contain small objects such as cars and boats as well as larger objects such as bridges and harbours. Different detection models were evaluated on the dataset. Two of the methods evaluated were Faster R-CNN and YOLO. The Faster R-CNN was the best model evaluated and YOLO had some object classes where it performed at a similar level to the Faster R-CNN as well as some classes where it performed significantly worse.

(31)

4

Method

The goal of this thesis was to locate and classify objects in satellite images. Dif-ferent methodologies for this were tested to find advantages and disadvantages of the different methods. Two main approaches were tested. The first approach used clustering for localisation and domain knowledge or SVMs for classification. The second approach employed deep-learning using CNNs. Three different CNN methods were tested.

An overview of the data-flow and methodology are shown in Figure 4.1. El-evation data, pixel classifications and image data are described more in Section 4.2.1 Image Data. These three types were the given data. Domain knowledge is specified in Section 4.2.2 Domain Knowledge and was the data given by the user. The two different main approaches are explained further in Sections 4.3 Clustering Approach and 4.4 Deep-learning Approach.

(32)

22 4 Method Elevation Pixel RGB Image Domain Hierarchical Clustering Classifications Knowledge Region Proposal SVM Region Classification Clustering Approach FCN Annotation Model Training Deep-learning Approach Labelling Faster R-CNN YOLO Image Augmentation Object Detection Semantic Segmentation Post-processing Input Evaluation

Figure 4.1:Flow-chart showing the two approaches and their different meth-ods. Input to the program are shown to the left. The clustering approach consists of three steps, clustering, region proposal and region classification. The deep-learning approach also consists of three steps, image augmenta-tion, model training and detection/segmentation. Lastly, there is some post-processing done to the segmentation outputs in order to produce objects.

4.1 Objects

The objects at an airport that are of interest for this work are: • Runway • Taxiway • Apron/Ramp • Hangar • Control tower • Terminal • Gate • Car park • Highway

These are the most important parts of an airport that are large enough to be seen in a satellite image. An aircraft is large enough to be observed in the images, but because of the images being generated from a multi-stereo reconstruction over time, non-stationary objects such as an aircraft is suppressed. For more explanations of the different parts of the airport, see

(33)

4.2 Input Data 23

4.2 Input Data

Two different types of input data were used. The first input was the image data with its accompanying pixel classification and elevation data. The second was the domain knowledge that was given by the user. Also, the images were annotated by hand to have ground truth on training and test data.

4.2.1 Image Data

There were three types of images. The first was the ortho-image, which is a par-allel projection that lacks perspective view. Compared to a usual satellite image it has been geometrically corrected. An example is shown in Figure 1.1. The second was the classified image, which was a pixel-by-pixel classification of the ortho-image with each class having a unique code. An example of this is shown in Figure 1.2. The last was the topographical images exemplified in Figure 1.3. The topographical images consisted of three different images, one was a surface model, one was a terrain model and one was the difference between the surface and the terrain models. The last of these three was used by our models because it shows objects, such as buildings, and does not depend on the elevation above sea level. In all these mentioned images, the resolution on the ground was 0.5 m.

(34)

24 4 Method

4.2.2 Domain Knowledge

The user input was given in a text file. It consisted of the data listed below. 1. Name

The name of the domain object. This is not used when comparing objects but is needed in the file to have a way of giving a located object a class. 2. Aspect ratio

Rough estimate of scale difference between the widths in different direc-tions. It is calculated as the quotient between the width and height, where width is always the larger of the two. The width here doesn’t need to repre-sent a left to right length but can rather be in any direction. Input is a span of accepted values.

3. Size

Size of the object in m2. Input is a minimum and a maximum size where all values in between are accepted.

4. Mean elevation

Mean elevation of the object in m. Input is a minimum and a maximum mean elevation where all values in between are accepted.

5. Adjacent

A list of other objects that could be connected to this object. If this is empty there is no restriction. If there are objects in the list, at least one of the objects must be adjacent. Adjacency is not symmetrical, meaning that if object A needs to be adjacent to B, it does not mean that B needs to be adjacent to A.

6. Not-adjacent

A list of other objects that should not be connected to this object. These objects can not be adjacent to each other. Non-adjacency is a symmetric relationship meaning that if object A can not be adjacent to B, object B can not be adjacent to A.

7. Class

The expected pixel classification of the object. An object can have more than one allowed classification.

Aspect ratio, size and elevation are used to separate objects according to their di-mensions, such as separating a taxiway from an apron or separating a tower from other buildings. Adjacent and not-adjacent are used to separate objects according to context. For example, a taxiway should be adjacent to a runway while a gate should not. Class is used for setting what classes from the classified pixels are allowed to be in a certain object. All these features can be estimated by a person. To estimate values for the aspect ratio and size of the domain objects, histograms

(35)

4.2 Input Data 25 were constructed containing the aspect ratio and size of different objects. The objects used for this were the objects located by the region proposal in Section 4.3.3 Region Proposal. Moreover, the ground truth that was used together with the SVM classifiers in Section 4.3.4 Region Classification was used to extract all objects of a particular type. By analysing these histograms, reasonable value in-tervals could be found by excluding some extreme outliers. The structure of this information was inspired by the work of Forestier et al. [19], they used a similar approach when classifying the semantics of coastal images.

The input file used in this work is shown in Table 4.1. For example, a runway is long, giving it possible aspect ratios between 130 and 20. It is also very large, between 210000 and 50000 m2. It should be completely flat so the highest eleva-tion acceptable is 1 m above the ground. A runway has the classificaeleva-tion runway and should be connected to the taxiways. As the not adjacent is symmetric it means that the runway can not be adjacent to any other class than the taxiway.

As the classification image has a class for runways, the runways are easy to find at the beginning. This makes identifying connections between objects easier since all objects are directly or indirectly connected to the runway. For example, the taxiways are directly adjacent to the runway and a gate is directly adjacent to a taxiway.

4.2.3 Annotation

Each image was annotated by hand with quadrilateral polygons around objects. Taxiways and runways were split up into parts. This led to that large objects, such as the runway, consisted of many small annotations. Annotations of Linköping airport are shown in Figure 4.2a. A version with horizontal rectangles was also constructed from the rectangles in Figure 4.2a which is shown in Figure 4.2b. In Table 4.2, all colours for the different objects are explained. Note that the colours in the images can differ from the table because they are slightly transparent and thus depend on the background.

(36)

26 4 Method

Table 4.1: The structure of domain knowledge input file used in this work. Here it is separated into three tables but in the file, these are combined into one line per object. The meaning of the features is described in the list in Section 4.2.2 Domain Knowledge.

Name Aspect Size Elevation

Runway 130:20 210000:50000 1:0 Taxiway 75:3 50000:30 1:0 Tower 5:1 500:25 100:10 Hangar 20:1 25000:100 20:5 Apron 60:1 30000:300 1:0 Gate 40:1 15000:100 5:0 Terminal 40:1 40000:100 15:5 CarPark 10:1 50000:1000 1:0 Highway 100:3 25000:1000 5:0

Name Adjacent Not adjacent

Runway [Taxiway] []

Taxiway [Runway] []

Tower [] [Runway]

Hangar [Taxiway, Apron] [Runway, Tower]

Apron [Taxiway, Hangar] [Runway, Tower]

Gate [Taxiway] [Runway, Hangar, Tower]

Terminal [Gate] [Runway, Taxiway, Hangar, Tower, Apron]

CarPark [Terminal] [Runway, Hangar, Apron]

Highway [] [Runway]

Name Class

Runway [Runway]

Taxiway [Ground, Road] Hangar [Building]

Tower [Building] Apron [Ground, Road] Terminal [Building]

Gate [Ground, Road] CarPark [Ground] Highway [Ground, Road]

(37)

4.2 Input Data 27

(a)Annotations for Linköping airport used when training the segmentation model and as ground truth when eval-uating.

(b)Annotations for Linköping airport with horizontal rectangles used when training the detection models.

Figure 4.2: Example of annotations at Linköping airport, colours are ex-plained in Table 4.2

(38)

28 4 Method Table 4.2:Colour of different objects in the annotations and predictions.

Object Colour Hangar ****** Control tower ****** Terminal ****** Gate ****** Runway ****** Taxiway ****** Highway ****** Apron ****** Car park ******

4.3 Clustering Approach

A method for object detection was proposed which includes clustering to locate objects. Figure 4.3 shows the general idea that inspired this method. First the data were clustered using an agglomerative hierarchical algorithm, described in Section 2.2 Clustering. Regions were then extracted from the clusters and filtered to remove regions of no interest. Lastly, the proposed regions were classified using a classifier. Three different classifiers were tested, one of which makes use of the domain knowledge presented in Table 4.1.

Clustering Region Proposal Region Classification Hangar

Tower Terminal

Figure 4.3:The general idea behind the clustering approach. First, the data is clustered using hierarchical clustering. The clusters are then extracted and transformed into regions which are filtered. Lastly, the regions are classified using a classifier.

4.3.1 Clustering

The first step of this approach was to cluster the data using an agglomerative hi-erarchical clustering algorithm (described in Section 2.2 Clustering) to produce a segmentation of the image. The pre-processing of the data is described in Sec-tion 4.3.2 Pre-Processing. A relatively large amount of clusters were specified

(39)

4.3 Clustering Approach 29

to ensure that each object of interest was found. The number of clusters k was determined by k = 1 2 r whr2 s2 , (4.1)

where w is the width of the down-scaled image, h is the height of the down-scaled image, s is the scale and r is the resolution of the original image. The scale is the scale used for down-scaling. The original size, the scale used for down-scaling, the resulting resolution and the calculated k used for each of the airports are shown in Table 4.3.

Airport Size Scale Resolution (m) k

Linköping 3649×4985 0.25 2 1066 Norrköping 2108×6388 0.25 2 917 Bromma 3871×4864 0.25 2 1085 Prague 6848×9680 0.25 2 2035 LAX 6159×11501 0.125 4 2104 Atlanta 9757×10818 0.125 4 2568

Table 4.3: The different airports used, their original size after cropping around the runway, the scale used for down-scaling, the final resolution of the down-scaled image and the calculated k used for the clustering.

The result of the clustering is shown in Figure 4.4. To calculate distances be-tween objects, the Euclidean distance was used. For the linkage criterion, Ward’s method was used which minimises the variance between clusters. Additionally, the Python package Sklearn [44] and the function AgglomerativeClustering was used to perform the clustering.

Figure 4.4: The results of agglomerative hierarchical clustering when ap-plied to the data discussed in Section 4.3.2 Pre-Processing. Note that the number of colours in the image does not represent the number of clusters, two clusters of the same colour that are not connected does not belong to the same cluster.

(40)

30 4 Method

4.3.2 Pre-Processing

The RGB ortho-photo, together with the classified pixel image and elevation im-age were combined and normalised to the range [0, 1] by dividing with the max-imum pixel value in each image. The reason behind the normalisation was the difference in ranges between the different input data. To ensure that all classes in the classified pixel image were treated equally by the clustering algorithm, a one-hot encoding was used. This means that for each class, the mask of that class was added to the data as a separate feature. For example, if the classified pixel image contained three classified objects of different classes, three masks would be added to the data. The one-hot encoding ensured that that Euclidean distance between pixels was affected in the same way regardless of the class. The result of this was input data with the following features:

• Height difference image, shown in Figure 1.3c.

• RGB channels of the ortho-photo, shown in Figure 1.1.

• A mask for each of the classes found in the classified pixel image, shown in Figure 1.2.

Since the images were of very high resolution, they were down-scaled to a smaller size. This was done to reduce the amount of time and memory needed for the program to run. Particularly, the clustering algorithm had a time complexity of at least O(n2) and a memory complexity of O(n2), where n is the number of pixels in the image [62]. This presented a problem when the input images were of full resolution. Additionally, to further reduce the amount of data, the runway was extracted using the classified pixels and all the images were cropped to only include the area around the runway (see Figure 4.5). The size of this area was determined by user input.

Figure 4.5: The RGB image of Linköping airport, cropped around the run-way.

Lastly, each feature was transformed into a flat representation with no spatial information. In the case of the RGB ortho-photo, this meant that the three im-age channels were transformed into three flat vectors. To not lose the important

(41)

spatial information, a connectivity matrix was constructed to accompany the flat features. The connectivity matrix represented a graph where an edge existed if two pixels were adjacent in the image. The data was flattened to fit the require-ments of the Sklearn [44] Python package that was used.

4.3.3 Region Proposal

The different clusters produced by the clustering were extracted and presented as a region. A region was characterised by various properties:

• A polygon which surrounds the cluster.

• The mean elevation within the cluster pixels, extracted from the elevation data.

• The area of the cluster.

• The width, length and aspect ratio of the cluster, calculated according to 4.3.5, Sub-section Aspect Ratio Calculations.

• The dissimilarity to the background colour, computed using the ortho-photo. • How many pixels of each class there are within the cluster, extracted from

the classification data.

The polygon was obtained by creating a binary mask with the same dimensions as the input images where the cluster pixels had the value one while the background had the value zero. This mask was then used as input to the Opencv [5] function findContours which produced the polygon.

Since the image was down-scaled, the area, width and length of the region were adjusted to account for this. In the case of the area, this meant that the down-scaled area was divided by the squared scale of the image. The other quantities were divided by just the scale. In addition, the resolution of the image was also taken into account before the adjustment. The area A was calculated by

Aadjusted=

Ar2

s2 (4.2)

and the width w by

wadjusted=

wr

s (4.3)

where r is the resolution of the original image and s is the scale used for down-scaling. The length was calculated in the same way as the width.

The dissimilarity to the background colour was calculated by taking the mean of the Euclidean norms between the colours in the cluster and the background colour and normalise the result. That is

1 n||cmax||2 n X i=0 ||_c_bg−_c_i||₂ _(4.4)

(42)

32 4 Method

where n is the number of pixels in the cluster, cbgis the background colour vector and ciis the colour vector of the i:th pixel. The maximum colour vector cmaxis de-fined as the longest possible distance from the origin, which is ||(255, 255, 255)||2. Figure 4.6 shows the effects of filtering using different thresholds for the back-ground dissimilarity. The backback-ground colour vector was calculated by taking the mean of all pixels which were classified as ground in the pixel classified data.

As mentioned before, a polygon was used as a form of bounding box to repre-sent the localisation of an object. As the other properties contained no informa-tion about the locainforma-tion and shape of the region, the polygon was needed to pro-vide this information. The mean elevation, area, width, length and aspect ratio were used to provide summaries of the pixels that the region represented. This was done to ensure a uniform representation of a region, such that they could easily be compared. The background dissimilarity was introduced to distinguish objects that were classified as ground in the data but were in reality not a ground object, e.g. distinguish some roads from the surrounding background. Lastly, the pixel class counts were used to determine the most common class of the region.

Since the pixels containing the runway(s) were already known from the clas-sification data, these were automatically extracted and transformed into a region which was then added to the list of regions. The regions containing the runway(s) were then merged with overlapping regions in the filtering step discussed in the next section.

(a) Located regions that featured a background dissimilarity greater than 0.02. This was essentially all of the re-gions.

(b) Located regions that featured a background dissimilarity greater than 0.07.

Figure 4.6: The effects of filtering the located regions using different back-ground dissimilarities.

(43)

Refinement

As the number of clusters were large, a large number of regions were produced. Many of the produced regions were of no interest and therefore needed to be removed. This was done by removing regions that were unlikely to be of interest. These regions satisfied any of the following properties:

• The size of the region is less than the smallest allowed size of any domain object, as defined by the domain knowledge in Table 4.1. In this case, the minimum allowed size was 25 m2.

• The most common class of the region is "Forest", "Water", "Railroad" or "Bridge".

• The most common class of the region is "Other", which indicates that the pixels within the region are of an unknown class.

• The background dissimilarity of the region is less than a specified threshold. For this work, a threshold of 0.07 was used.

In addition, some regions that were classified as runways were merged to produce one runway region per actual runway. This step was counteracting some effects of the clustering, which tended to find smaller regions within the runway. This also ensured that the runway(s) extracted from the data did not overlap with the regions that were extracted from the clustering. The effect of refining regions is shown in Figure 4.7.

Figure 4.7: All the regions that are left after the refinement, here a back-ground dissimilarity threshold of 0.07 was used.

4.3.4 Region Classification

Three different methods for classification were implemented, the first of which utilised the domain knowledge from Section 4.2.2 Domain Knowledge and an iterative approach that was inspired by the work of Forestier et al. [19]. This method classified regions based on a score that was derived using the domain

(44)

34 4 Method

knowledge. The second method utilised an SVM together with labelled regions as training data. The third method utilised three SVMs in a hierarchical structure trained to separate different subsets of the classes.

Domain Knowledge

To classify regions, a score was calculated that indicated how similar that region was to one of the specified domain objects (the class in Table 4.1). The score was based on the area A, mean elevation e and aspect ratio a of the region and was calculated as

Si = Ri,a· Ri,A· Ri,e, (4.5)

where Ri,a=                  1 , if min(la) ≤ ai ≤max(la) ( ai−end max(la)−end) 2_, _{if max(l} a) < ai ≤end ( ai−start min(la)−start) 2_, _{if start ≤ a} i < min(la) 0 , Otherwise , (4.6) Ri,A=                  1 , if min(lA) ≤ Ai ≤max(lA) ( Ai−end max(lA)−end) 2_, _{if max(l} A) < Ai ≤end ( Ai−start min(lA)−start) 2_, _{if start ≤ A} i < min(lA) 0 , Otherwise (4.7) and Ri,e=                  1 , if min(le) ≤ ei ≤max(le) ( ei−end max(le)−end) 2_, _{if max(l} e) < ei ≤end ( ei−start min(le)−start) 2_, _{if start ≤ e} i < min(le) 0 , Otherwise . (4.8)

In (4.6), la is the limit interval that exists for the aspect ratio in the specified

domain object, as shown in Table 4.1, and ai is the aspect ratio of the i:th region.

The start and end variables were calculated as

start = min(l) − (max(l) − min(l)) and (4.9)

end = max(l) + (max(l) − min(l)) (4.10)

where l is the limits interval. Equations (4.7) and (4.8) show similar calculations for the area and mean elevation respectively. This results in a score Si ∈ [0, 1]

for the i:th region that indicates how similar the region is to the specified do-main object. The R function is shown in Figure 4.8. This function was designed to return a uniform score when the input value was within the limits and then quickly drop to zero. It is similar to the membership functions used by Forestier et al. [19], which they used to check if the shape of an object was consistent with their hypothesis. The quadratic functions outside of the interval still allowed for some deviations that were determined by the length of the interval, according to

(45)

(4.9) and (4.10). This meant that if the interval was large, larger deviations were allowed.

0 50 100 150 200 250 300

Value

0.0

0.2

0.4

0.6

0.8

1.0 Score

Figure 4.8:The R function that is described in (4.6) to (4.8), when the limit interval is [150, 200].

Class restrictions were also checked when calculating the score. If the most common class of the region was not in the list of accepted classes (the "Class" column of Table 4.1), no score was calculated. For example, a region could only be classified as a "Hangar" if its most common class was "Building". If the most common class of an object was "Runway", it would be automatically classified as the "Runway" domain object. This was done since the runway was known from the classification data. In addition to this, the constraints on adjacent objects (the "Adjacent" and "Not adjacent" columns of Table 4.1) were checked when calculat-ing the score and a score of zero towards a particular domain object would be returned if a region did not satisfy these constraints.

The proposed algorithm for classifying regions is an iterative algorithm that calculates the score for each region and then iteratively refines the classifications by re-calculating the scores. A score is calculated for each domain object ac-cording to (4.5) and the class restrictions as well as the adjacency constraints are checked, as mentioned above. The domain object with the highest score is then selected and the region is assigned as that object. The result of this is then refined in a number of iterations until there is no change or a maximum number of iterations is reached. By iteratively re-calculating the scores, the classes of the adjacent regions are checked which might cause the score to change. Each itera-tion starts with the runway (as its locaitera-tion is known from the data) and moves outwards by utilising a nearest neighbour graph, described in 4.3.5, Sub-section Distance Calculations. The runway region is put in a queue and then retrieved as the first region. Each neighbour is then retrieved using the nearest neighbour graph and added to the queue. The score of the current region is then calculated and the region is classified as the domain objects with the highest score. This

(46)

pro-36 4 Method

cess repeats until the queue is empty, which marks the end of one iteration. The full algorithm is shown in Appendix A and was inspired by the work of Forestier et al. [19] where they used a similar approach.

The maximum number of iterations parameter is needed to prevent infinite it-erations which may arise if changing the class of one region which in turn changes the class of the first region. This is a pattern that may never stop repeating. Fur-thermore, the choice to start with the runway was made because its location was available in the data and starting with the runway already classified could help to classify taxiways which are connected to the runway.

Support Vector Machine

For the second method, the previously located regions (see Section 4.3.3 Region Proposal) were labelled to be used as training data for an SVM classifier. The following features were extracted and used in the training data:

• The area of the region, adjusted for the down-scaling of the image. • The aspect ratio of the region.

• The background dissimilarity of the region.

• A histogram with 10 bins containing the classes within the region. An ex-ample is shown in Figure 4.9a.

• A histogram with 13 bins containing the elevation within the region. An example is shown in Figure 4.9b. Note that the bins vary in width, with small widths in the beginning and larger towards the end.

• Three histograms with 16 bins containing the hue, saturation and value channels from the HSV representation of the RGB pixels within the region. Examples are shown in Figures 4.9c-e.

In total, there were 74 features when all of the above features were combined. The area, aspect ratio and background dissimilarity were chosen as features due to them being very discriminative of the region. Moreover, the histograms were introduced as a way to summarise the data behind the pixels of the region, similar to what was done by Chapelle et al. [7]. Compared to mean values, the histograms offered additional information about the distribution of the pixel values.

The labelling of the regions were done automatically using the ground truth, described in Section 4.2.3 Annotation. A mask of the ground truth was compared to the masks of each region (see e.g. Figure 4.7) and the intersection between these were calculated. If there was no intersection, the region was labelled as "Other". Otherwise, the pixels in the intersection were used to get the most com-mon class of ground truth for those pixels. The most comcom-mon class was then used as the label for the region. A comparison between the produced SVM ground truth and the annotated ground truth in Section 4.2.3 Annotation is shown in Figure 4.10.