Instance Segmentation of Buildings in Satellite Images

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2020

Instance Segmentation of

Buildings in Satellite Images

Karin Fritz

(2)

Master of Science Thesis in Electrical Engineering

Instance Segmentation of Buildings in Satellite Images

Karin Fritz

LiTH-ISY-EX--20/5283--SE Supervisor: Gustav Häger

isy_{, Linköpings universitet}

Carl Sundelius

Vricon

Examiner: Per-Erik Forssén

isy_{, Linköpings universitet}

Computer Vision Laboratory Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden

(3)

Abstract

When creating a photo realistic 3d model of the world using satellite imagery, image classification is an important part of the process. In this thesis the specific part of automated building extraction is investigated. This is done by investi-gating the difference in performance between the methods instance segmentation andsemantic segmentation for extraction of building footprints in orthorectified

imagery. Semantic segmentation of the images is solved by using U-net, a Fully Convolutional Network that outputs a pixel-wise segmentation of the image. In-stance segmentation of the images is done by a network called Mask r-cnn. The performance of the models are measured using precision, recall and the F1 score, which is the harmonic mean between precision and recall. The resulting F1 score of the two methods are similar, with U-net achieving a the F1 score of 0.684 without any post processing. Mask r-cnn achieves the F1 score of 0.676 without post processing.

(4)

(5)

Acknowledgments

I want to thank Vricon for giving me the opportunity to write my master thesis. I want to thank both of my supervisors, Gustav Häger and Carl Sundelius for their help with technical support and with writing this thesis. A final thanks goes to my examiner Per-Erik Forssén for being a great advisor and the quick feedback.

Linköping, Januari 2020 Karin Fritz

(6)

(7)

2.4.1 Loss functions . . . 13 2.4.2 Optimizers . . . 15 2.4.3 Evaluation metrics . . . 16 3 Method 19 3.1 Implementation . . . 19 3.1.1 Input data . . . 19 3.1.2 Mask R-CNN . . . 19 3.1.3 U-net . . . 21 3.1.4 Post processing . . . 21 3.1.5 Evaluation . . . 21 3.2 Training . . . 22 3.2.1 Experiments . . . 22 4 Result 25 4.1 Experiments . . . 25 vii

(8)

viii Contents

4.1.1 Results using Adam optimizer . . . 26

4.1.2 Results using SGD optimizer . . . 29

4.1.3 Comparing with baseline . . . 32

5 Discussion 37 5.1 Discussion . . . 37

5.1.1 Data . . . 37

5.1.2 Method and experiments . . . 37

5.1.3 Result . . . 38

6 Conclusion 41 6.1 Summary . . . 41

6.2 Future work . . . 42

(9)

Notation

Abbrevations

Abbrevation Full text

3d Three-dimensional ann Artificial Neural Network cnn Convolutional Neural Network

mrcnn Mask Region Convolutional Neural Network fcn Fully Convolutional Network

iou Intersection over Union

r-cnn Region based Convolutional Neural Network

(10)

(11)

1

Introduction

1.1 Background

Automated building extraction from satellite imagery can serve as a great source of information. The method can be useful for a range of different applications. Not only in intelligence analysis and defense is this valuable, but also in the tele-com business and emergency preparedness. The tele-companyVricon provides its

clients with a range of products that enable the globe to be visualized in 3d, by using commercial satellite imagery. From these images the company derives its main product; photorealistic 3d models of the globe [29].

With rendered images from their 3d model as input, Vricon currently uses a Convolutional Neural Network to perform a semantic segmentation, from which building footprints can be extracted. The output is a pixel-based classification, where each pixel is given a class label. Semantic segmentation will give the an-swer to both what and where the different classes are present in the image, but no distinction between the different instances of the same class object is made. To extract each separate building to eventually create a 3d representation of the indi-vidual instances, further post processing steps of the classified image is necessary. A problem with this method is that adjacent buildings may not be completely sep-arated, but linked together and therefore later treated as one single object. This gives an incorrect representation of the world and a desire to investigate other methods in the search of an improvement of the delineations of building foot-prints.

Another approach to building extraction is to use instance segmentation. This is a method that combines both object detection and semantic segmentation, and makes the class labels instance aware [26].

(12)

2 1 Introduction

1.2 Problem description

With pixel-wise classification, objects of the same class can be confused due to close placement relative to the resolution of the image. When using the rendered images from the 3d model, this can cause building extraction to result in an out-put where multiple building instances have been merged together. In order to treat each building as a separate object, a network architecture performing in-stance segmentation will be implemented and evaluated with the aim to have a better separation of buildings compared to semantic segmentation.

1.3 Aim and goals

The goal of this master thesis is to train a cnn for instance segmentation of buildings, using rendered images from Vricon’s 3d model as input. The resulting model’s output will be compared with the output from an end-to-end semantic segmentation network without further post-processing steps in order to separate individual objects. Thus this thesis will investigate the following research ques-tions:

• How will an instance segmentation network perform on the task of extracting instances of building footprints?

• Is there an advantage of using instance segmentation over semantic segmenta-tion when comparing the separasegmenta-tion of adjacent buildings in an image?

1.4 Data

The available data consists of rendered images from a true ortho view of 3d mod-els created from multi view satellite images. An orthophoto is a photographic map that can be used to measure true distances and is an accurate representation of the Earth’s surface. This since the image displacements such as tilt and terrain have been removed from the image data [9]. An example image is visible in figure 1.1.

(13)

1.5 Limitations 3

Figure 1.1:An orthograpic view of an area with closely placed buildings. A typical sample from the training data set.

The images depict four different cities located in California, USA. Images from three of these (Los Angeles, Oakland and San Diego), which in total is 1149 im-ages of resolution 2048 × 2048, will be used as training data. The imim-ages from the remaining city (San Francisco), which is 228 images, will be used for evalua-tion. There are manually drawn outlines of building footprints as ground truth, stored as polygons and bounding boxes. The available image data contains the RGB image, the Digital Surface Model, (DSM), and a binary classification of the image.

In figure 1.2 the overview of Los Angeles is displayed. The white surrounding border depicts the extent of the annotations, inside these lines all buildings are annotated. The red markings in the image are the contours of every building instance.

1.5 Limitations

Because of the time constraint on this thesis, one network performing semantic segmentation and one network performing instance segmentation will be chosen and investigated. The main focus of this thesis is to investigate how an instance segmenting network will effect the separation of adjacent buildings in an image, hence more experiments will be conducted with the instance segmentation net-work.

(14)

4 1 Introduction

Figure 1.2: Orthophoto of Los Angeles with annotated building footprints overlaid in red.

(15)

2

Theory

This chapter presents a summary of the related work and background theory of the techniques used in this thesis.

2.1 Convolutional Neural Networks

This section will give an overview of the concept of Convolutional Neural Net-works (cnn), and how they can be applied to solve problems concerning object detection and image segmentation.

A cnn is a specialized form of Artificial Neural Network (ann). An ann is ini-tially inspired by the human brain. The brain is composed of millions of neurons and in an ann each neuron is represented as a weighted node. The nodes each read an input, process it and generate an output, which may serve as an input for another node. This enables the nodes to interact and the network to make a decision. The network is an adaptive system, meaning that it can adapt to the information passed through it by adjusting the weights of the nodes [18]. cnn_{is a sub-type of ann, designed to process locally dependent data that comes} in multiple arrays, typically images. The cnn takes advantage of the properties of a natural signal by using local connections, shared weights, pooling and the use of multiple layers. As the name suggests, the cnn employs the mathematical operation of convolution in at least one of its layers [8]. cnn’s have been widely adopted in the computer vision community since they have proven superior to traditional methods in tasks such as image classification, object detection and se-mantic segmentation [19]. The architecture of a cnn can vary greatly, and can be adjusted to fit the specific needs of different types of problems and tasks. Al-though the benchmark models change regularly and improve, making it difficult

(16)

6 2 Theory

to describe an optimal architecture of a cnn, they still follow a general type of structure [8]. The architecture of a cnn is structured as a series of stages, where the first few stages are composed of two types of layers, convolutional layers and pooling layers [11].

The purpose of the convolutional layer is to detect local conjunctions of features from previous layers. A feature map is used as input to the convolutional layer. This map is filtered with a set of filter kernels, to produce a set of new feature maps. This makes local patches of the input layer connected with the new fea-ture maps through the filter kernel. All units in a feafea-ture map share the same filter kernel and the different layers of a feature map use different filter kernels [11]. An illustration of the input and output from a convolutional layer is shown to the left in figure 2.1, where the filter kernel is a four layer kernel that slides over the input and outputs a four layer feature map. The feature maps are later passed through a non-linear activation function, such as a rectified linerar unit, or ReLU, to make sure that the values of the feature map are within a specific interval. In the case of using ReLU as activation function, all negative values will be set to zero. The activation function introduces a non-linear property to the network, increasing the complexity of both the network and the tasks it can per-form. This type of architecture with shared weights makes it possible to utilize the fact that local groups of values in images often are highly correlated. By shar-ing weights a cnn can detect the same pattern in different parts of the image [11]. The role of the pooling layer is to merge semantically similar features together, by subsampling the input. Since the relative positions of the features forming a motif can vary, the detection of the motif can be done by coarse-graining the posi-tion of each feature. A motif in an image is simply an element of the image, such as a pattern or an object within the image. The pooling is performed on each feature map separately and the size of the pooling unit is always smaller than the feature map. A typical pooling unit computes the maximum of a local patch of units in a feature map. The pooling unit slides over the feature map with the given stride as step length. By having a stride larger than one, the feature map will be downsampled. This will contribute to making the representation invariant to small translations in the input [11]. This is illustrated as the second operation in figure 2.1.

(17)

2.2 Instance segmentation 7

Figure 2.1:Sequence of layers commonly used in a cnn. The input is passed through a convolutional layer with four filter kernels, producing a set of new feature maps. The smaller square represents the kernel used while filtering the input. This is followed by a pooling layer which downsamples the feature maps, reducing their spatial size. Here, the smaller square represents the pooling unit

2.2 Instance segmentation

Instance segmentation is the combination of the two methods of object detection and semantic segmentation. One of the most common approaches to solving the problem is to combine different network structures that either first perform an object detection and then segments the object inside of the box, or conversely, first segments the image and then separates the instances with object detection [26].

Object detection is the task of identifying objects of interest in the image and determining their positions and sizes. This can be represented as a bounding box and a label with the predicted class of the object within the box. There should be one box per object [26]. The task of object detection was a several step process before deep learning, including techniques such as edge detection and feature ex-traction. The images were then compared with existing object templates, in order to detect and localize the objects within the image [? ]. In more recent years, the use of deep learning has been introduced to the area [30].

Image segmentation is the process of partitioning an image into multiple seg-ments. These segments should contribute to a simpler representation of the im-age that is easier to use for further analysis. This is typically used to locate objects within the image. Semantic segmentation is a pixel-level classification, where the objects that belong to the same class are clustered together. This will differentiate between different classes, but does not take the instances of multiple objects of the same class into account [15]. There is a range of applications for semantic segmentation and it is commonly used in the medical field [25].

(18)

8 2 Theory

Finally, instance segmentation is the combination of both methods, producing both a bounding box with a classification and a mask of the object [26]. See fig-ure 2.2 for a demonstration of the different methods.

(a)Original image. (b) Object detection. All objects found are annotated with a bounding box.

(c) Semantic segmentation. The different classes have each been given a distinct color: the background is gray, circles are blue and squares are green.

(d) Instance segmentation. All pixels are aware of both class and which instance of that class it belongs to. This is in the image represented by a surrounding bounding box and a class color defin-ing the pixel-level classifica-tion mask.

Figure 2.2: Illustration of the differences between the methods object de-tection, semantic segmentation and instance segmentation. The circles and squares represent objects of different classes.

2.3 Related Work

The task of instance segmentation is approached with various methods. In the following sections, different approaches to object detection and semantic

(19)

segmen-2.3 Related Work 9

tation will be described. Thereafter methods that combine these approaches, per-forming instance segmentation will be discussed.

2.3.1 Object detection

One of the first algorithms to solve object detection using deep learning was r-cnn [7], Region proposals with cnn. The method firstly produces a number of candidate object regions, where each Region of Interest, (RoI), is then evalu-ated with a cnn. Since then a lot of different deep networks have been designed to improve the speed and accuracy of object detection. Fast r-cnn and Faster r-cnn_{are further developments of r-cnn which focuses on lowering the time} consumption of the algorithm [30]. Another popular object detector is You Only Look Once, or YOLO [23], which solves the object detection task with a fixed-grid regression.

This thesis will focus on object detection with r-cnn since the method later chosen and investigated for instance segmentation uses Faster r-cnn as object detector.

R-CNN, Fast R-CNN and Faster R-CNN

r-cnn_{[7] uses the search algorithm}_{selective search to propose regions of interest} in the image. Selective search [28] is an algorithm that takes both scale and di-versification into account when grouping regions together. The proposed regions are then forwarded and further processed by a cnn that acts as a feature extrac-tor. The feature map is then fed to a Support Vector Machine (SVM) to classify the presence of the object within the given region. This architecture gives a well performing object detector with the side-effect of being very computationally ex-pensive. This is mainly because each region is treated separately, since they are passed through the cnn one by one. This is illustrated in figure 2.3a.

Fast r-cnn [6] was designed to improve the speed of r-cnn. Fast r-cnn takes an image and region proposals, generated by selective search, as input. The dif-ference made from the previous r-cnn, is that instead of passing each region through the cnn, the whole image is used as input. The region proposals are then warped into a fixed shape using a method called Region of Interest pooling, or RoI pooling. RoI Pooling divides the input feature map into a fixed number of regions, where each region is approximately of the same size. Max pooling is then applied to each region which will reshape them into a fixed size. The regions are then fed to a set of fully connected layers that performs the classification. An-other major difference to r-cnn is that Fast r-cnn trains the cnn, classifier and regressor in a single model instead of using different models for the different tasks. The difference in architecture between r-cnn and Fast r-cnn is visual-ized in figure 2.3.

(20)

10 2 Theory

(a) r-cnn. (b)Fast r-cnn. (c)Faster r-cnn.

Figure 2.3: Network architectures of the different models cnn, Fast r-cnn_{and Faster r-cnn. r-cnn passes each region of interest (RoI), through} a cnn, which is both time consuming and computationally expensive. Fast r-cnnis a development from r-cnn which uses the whole input image as input to the cnn, instead of each RoI. The RoIs are then extracted from the feature map produced by the cnn. Faster r-cnn is a further development from Fast r-cnn which has replaced the search algorithm used for finding RoIs in both r-cnn and Fast r-cnn with a Region Proposal Network (RPN), which makes Faster r-cnn even faster than its precursors.

Faster r-cnn [24] is a further development of Fast r-cnn. The system is com-posed of two modules, where the first is a fully convolutional network that pro-poses regions, replacing the selective search algorithm. The region proposer, (RPN), uses the feature maps from the cnn as input and makes the region pro-posal nearly cost free. The RPN use anchor boxes that serves as a reference at multiple scale and aspect ratios, this in contrast to using different scales of im-ages or filters. This contributes to the cost-effeciency of the RPN and also makes the approach invariant to translations. The predicted region proposals are re-shaped using the RoI pooling layer and then each region is classified [24].

2.3.2 Semantic segmentation

The most common ways of performing semantic segmentation is to use some sort of cnn since they have obtained remarkable results in the area [16]. A popu-lar method is the Fully Convolutional Networks, or FCN. FCN [17] is a network built with a series of convolutional layers. When these are trained end-to-end they achieve a good result on pixel-wise segmentation. A network architecture called U-net [25] has lately gained popularity because of its results in various semantic segmentation tasks. U-net, is a type of FCN that originally was imple-mented for medical image analysis, but has since then been used in many other applications.

(21)

2.3 Related Work 11

structure. The encoder is usually a pre-trained classification network. The con-tracting path of the network is more or less symmetrical to the expansive path, which yields a u-like shaped architecture, see figure 2.4, from which the network got its name [25]. U-net is extended from FCN to work with fewer training images and yield more precise segmentations.

Figure 2.4: Simplified illustration of the U-net architecture, with the con-tracting path on the left side and the expansive path on the right. The max pooling operations decrease the sizes of the feature maps but increase the number of feature channels. In the expansive path an upsampling followed by a 2 × 2 convolution (’up-convolution’) halves the number of feature chan-nels and leaves the final output to have the same dimensionality as the input. The contracting path of the network architecture follows the typical structure of a convolutional network. It has the repeated application of convolutional layers fol-lowed by a pooling layer, which serves as a downsampling operator [25]. At each downsampling step, the number of feature maps are doubled. In the expansive path the pooling is exchanged with upsampling operators of the feature maps. These will increase the resolution of the output. In order to get a higher accuracy, Ronneberger et al [25], propose a weighted loss function. The background pixels separating two instances of the same class will obtain a higher weight in the loss function, in order to improve the separations of adjacent objects.

The baseline method, performing semantic segmentation, is in this thesis cho-sen to be a U-net network, with ResNet50 as backbone. This since the network architecture is similar to the method currently used at Vricon as a step in the extraction of building footprints.

2.3.3 Instance segmentation

Driven by the success of region proposals in object detection, many approaches to instance segmentation has relied onsegment proposals. An example of such an

(22)

12 2 Theory

algorithm is Deepmask [22], which is a network that learns to propose segment candidates, which are later classified with an object detection algorithm. Another method that uses segment proposals is proposed by Dai et al [5], which first per-forms object detection, from which segment proposals are predicted, which is then followed by classification. Thereafter, each segment proposal will be classi-fied.

Another example of a network performing instance segmentation is the Fully Convolutional Instance Segmentation, (FCIS). The idea is to predict a set of po-sition sensitive output channels fully convolutionally, which will address object classes, boxes and masks simultaneously [10].

To achieve instance segmentation with U-net, Iglovikov et al. [12], proposes a method that adds an extra output channel where borders between objects that are closely positioned, or even touching, are predicted. This layer later serves as a mask and will improve the separation of the objects. In these cases, segmenta-tion precedes recognisegmenta-tion [12].

Mask r-cnn [10] is another popular method for both image segmentation and object detection. The method is designed for instance segmentation and is based on the Faster R-CNN object detection model [24]. In addition to object detection and prediction of bounding boxes, the extension from Faster r-cnn is an addi-tional binary mask prediction of the object inside of the bounding box [10]. The model achieved state of the art results on the MSCOCO dataset [3], when pub-lished.

The choice of method for instance segmentation in this thesis is Mask r-cnn, since this is a method that performs instance segmentation in parallel, rather than linearly, which makes the method differ compared to using an extension of the baseline method.

Mask R-CNN

Mask r-cnn or mrcnn, extends Faster r-cnn by adding a branch that predicts an object mask in parallel with the branch used for classification and bounding box regression. This makes mrcnn able to combine both the classical parts of object detection and segmentation [10]. A simplified illustration of the network architecture is available in figure 2.5.

The mask branch in mrcnn is a small FCN, applied to each RoI, performing a pixel level mask prediction. This is a binary classifier since the model depends on that the proposed boxes only contain one instance of one class. The predicted region proposals are like in Fast and Faster r-cnn reshaped using RoI pooling. RoIPool is an operation for extracting small feature maps from each RoI. Since each region consists of floating point numbers, the RoIPool firstly quantizes the region to a discrete feature map. This quantized version will then be divided into

(23)

2.4 Training and Evaluation 13

Figure 2.5:Network architecture of Mask r-cnn. The architecture is similar to Faster r-cnn with the addition of the mask branch in parallel with the classifier and bounding box regressor.

spatial bins, which are also quantized. The feature values are then aggregated by max pooling. These quantizations introduce misalignments between the RoI and the extracted features. Since classification is robust to small translations, this is not a problem with r-cnn, but will have a big negative impact on a pixel-level mask prediction since the RoI features need to be well aligned in order to preserve the per-pixel spatial correspondence. The solution is to implement a technique called RoI Align [10]. The RoI Align layer removes the harsh quantiza-tion of the RoI pooling and aligns the features with the input, which has proven to improve the results. RoI Align works by avoiding any quantization of the RoI boundaries or bins. Bilinear interpolation is used to compute the values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result by using max or average pooling [10].

2.4 Training and Evaluation

This section will describe some concepts used for training a cnn and how the network performance will be evaluated in this thesis.

2.4.1 Loss functions

Deep learning is an iterative process, usually with many parameters to tune. To make the tuning as efficient as possible, an optimizer and a loss function are applied to the problem. A loss function is a cost that the optimizer will try to minimize by updating the weights of the ann. This is how the ann learns and improves its performance [8].

One of the most commonly used loss functions for image segmentation tasks is pixel-wise cross entropy. The label maps are of the same size as the original

(24)

14 2 Theory

image and are used as ground truth. The loss function examines each pixel indi-vidually and compares the class predictions to the ground truth. The result will then be averaged over all pixels, which contributes to equal learning to all of the pixels in the image [8]. Cross entropy can be derived from the method Maximum likelihood. Maximum likelihood, (ML), is a method for estimating parameters of a model. A good parameter estimate is one that makes the model produce as similar output as possible to the actual output from the process that the model describes. In other words, ML tries to maximize the likelihood of the model to produce the data that was observed by the process [8]. This is done by using a parametric model, described as pmodel(x; θ), where x is the input data and θ are

the model parameters. ML tries to fit the function that maps given input as close to the true function as possible. This is done by optimizing over the parameter θ, with the criterion:

θML= argmax θ m Y i=1 pmodel(xi; θ) (2.1)

In equation 2.1, xi is one example from the sample data, and if each sample is

assumed to be independent of the others, the ML principle can be defined as the product of the probabilities [8]. In order to maximize the function, the maximum is found using the derivative, and set to zero. However, ifm is large, the

deriva-tive will be close to zero since the probabilities range from 0 to 1, causing an underflow problem. In order to avoid this, the sum of the logarithm can be com-puted instead, since this does not affect the argmax. The argmax is not affected by scaling, thus a division bym is possible, which obtains a version of the

crite-rion that is expressed as an expectation with respect to the empirical distribution ˆ

pdata. This is called the log-likelihood. To be able to use this as a loss function

we take the negative log-likelihood, which is also known as the cross entropy, see equation 2.2.

J(θ) = −Ex, y ∼ ˆpdatalog pmodel(y|x) (2.2)

In the case of a binary classification problem, binary cross entropy can be used as loss function, where there is assumed that there are only two classes.

Two other common loss functions are the L1and L2loss functions. The L1loss is

the sum of the absolute errors between the ground truth and the predicted value. The L2is the sum of the squared differences or the mean squared error. Another

loss function is smooth L1 loss. This is a combination of L1 and L2 that is more

robust to outliers. The function behaves as L1when the absolute value of the loss

is high and as L2 otherwise [24]. Smooth L1 loss is often used in bounding box

regression and is used in both Faster cnn and in Mask r-cnn where it is used as part of a multitask loss, evaluating the performance of the object detection part of the network [24]. The smooth L1loss used in Mask r-cnn is defined in

(25)

box of classu, and v = (vx, vy, vw, vh) is the tuple of the true bounding box of the

same class. L(tu, v) =XsmoothL1(t u i −vi) (2.3) smoothL1(x) =        0.5x2 if | x |< 1 |_{x | −0.5} _otherwise, (2.4)

2.4.2 Optimizers

Optimization during training of a neural network is an indirect process. This is because the optimizer is trying to minimize a loss function, when the actual performance of the model is determined by a function defined according to the test set. There are a range of different optimization algorithms frequently used in deep learning. The optimizer tries to find the parameters of the neural network that will reduce the loss function significantly, and hopefully find the global min-imum [8].

Gradient descent is one of the most popular ways of optimizing a neural net-work. Gradient decent is an optimization technique based on taking small steps in the direction of the negative gradient of the function that is minimized [8].

5_θ_{J(θ) = −E}_{x, y ∼ ˆ}_p

data5θlog pmodel(y|x) (2.5)

Computing the expectation in equation 2.5, which is the derivative of equation 2.2, exactly is very expensive, since it requires evaluating the model on every ex-ample in the dataset. One version of gradient decent is the Stochastic Gradient Decent, (SGD). SGD sees the gradient as an expectation that may be approxi-mately estimated using a small set of samples, also known as aminibatch. The

size of the minibatch is typically chosen to be small but can range from one to a few hundred samples [8]. Using the gradient estimate, the parameters of the func-tions are updated with a step in the direction of the negative derivative scaled by the learning rate. A crucial parameter for SGD is therefore the learning rate. With the learning rate set too high, violent oscillations will show on the learning curve, whilst a learning rate too low will cause a risk of the learning to get stuck at a high loss value.

Another popular optimizer is the algorithm called Adam. Adam is a method for efficient stochastic optimization with low memory requirements and the method only requires first order gradients [13]. Adam is generally seen as robust to changes of hyper parameters and converges relatively fast [8]. Adam is an op-timization algorithm based on an adaptive learning rate. The name Adam is de-rived from the phrase ’adaptive moment estimation’. This because Adam uses es-timates of the previous gradients when calculating the individual adaptive learn-ing rates for all weights. This can be seen as keeplearn-ing the momentum from earlier steps. Adam also includes initialization bias correction terms to the estimates of

(26)

16 2 Theory

both the first and the second order moments. The parameter update is defined according to equation 2.6, wheret simply is the timestep.

θt+1= θt− η √ ˆ vt+  ˆ mt (2.6)

The parameter η is the learning rate and is a smoothing term to avoid division by zero. ˆvtand ˆmt are the bias corrected first and second order moments of the

derivative [13].

2.4.3 Evaluation metrics

The performance of each cnn will be evaluated using precision, P, recall, R, and F1-score. The definitions of these are shown in equations 2.7, 2.8 and 2.9.

P = TP TP + FP (2.7) R = TP TP + FN (2.8) F1 = 2 ∗ P ∗ R P + R (2.9)

The true positives, TP, represents the number of correctly identified building

footprints. The false positives, FPare when the network has reported a building

where there is none, whilst the false negatives, FN, are the buildings missed by

the network. Precision is a measurement of the correctly identified positive cases from all the positive cases and will drop if the number of false positives is high. Recall is a measurement of the correctly identified positive cases from all of the actual positive cases. This will indicate if the false negatives have a high impact on the performance of the model. The F1 score is the harmonic mean between precision and recall and is used since it penalizes extreme values. All three scores range from 0 to 1 where a higher number indicates a better performance [27]. The evaluation score is calculated per instance. This means that in order to correctly identify if a prediction is a true positive or false positive, it has to be paired with one instance from the ground truth. This pairing between predic-tions and ground truth is called the assignment problem. There are multiple solutions to the assignment problem, but it will in this thesis be solved according to SpaceNet’s evaluation algorithm [21]. The algorithm uses the intersection over union (IoU) score, defined in equation 2.10.

IoU(A, B) = Area(A ∩ B)

(27)

The algorithm is based on a greedy pairing of the predictions and ground truth, where a predicted building mask is classified as a true positive if it has an IoU score with a ground truth building mask over 0.5. If there are multiple predicted masks that intersect with one of the ground truth masks, the one with the highest IoU score will be paired with the ground truth mask [20].

(28)

(29)

3

Method

3.1 Implementation

In this chapter details of the implementation are described.

3.1.1 Input data

The data provided by Vricon is described in section 1.4 and consists of images with annotated buildings which are stored as polygons. To be able to use the data as training, test and validation data for Mask r-cnn, it has to be slightly adapted. The images are cropped to a size of 512 × 512, since larger images would make the GPU to run out of memory during training. A training data sample used by Mask r-cnn is an RGB image with pixel-wise binary masks and bounding boxes for each object instance in the image. The masks should be stored as an array of shape (width, height, number of instances), whilst the bounding boxes should

be stored as a set of coordinates describing its center position, width and height. To adapt the provided data to the required format, a file containing meta data is generated for each image. The information provided by these files is then used to generate the correct input.

After cropping the images, there are a total of 14 793 cropped images available for training, 1 644 images for validation and 3 406 for evaluation. The total number of annotated building footprints is 954 871 in the training set, where the number of building footprints in one image vary from 0 to 266.

3.1.2 Mask R-CNN

An implementation of Mask r-cnn was found and downloaded from [4]. The code was cloned and altered to fit the dataset used in this thesis. The alterations

(30)

20 3 Method

made are mainly concerning how to load the data, where the existing "Dataset" class is expanded with a new instance that inherits and overrides some of its functionalities. Functionalities for adding input channels and changing between the SGD and Adam optimizer are implemented. In order to prevent overfitting, image augmentation was also implemented with random horizontal and vertical flips of the images and rotations with 180 or ± 90 degrees.

Some initial changes of the configurations were made compared to the default settings in [4]. At first, the number of classes were specified to one. The number of anchors per image to use for RPN training was increased from 256 to 400. This since there could be up to 266 building footprints in one image, which is consid-erably more than objects per image in typical object detection problems. For the same reason the maximum number of possible detections were increased from 100 to 500, and the number of RoIs to feed to the classifier were increased from 200 to 512. Since most of the buildings are relatively small in the image, the RPN anchor scales were reduced from the initial settings of [32, 64, 128, 256, 512] to [8, 16, 32, 64, 128].

L = w1Lcls+ w2Lmask+ w3Lbox (3.1)

The loss function for Mask r-cnn is a multitask loss that depends on the dif-ferent branches, classification, bounding-box regression and mask segmentation, according to equation 3.1, where the vector [w1, w2, w3] are weights. The loss

function used for classification, Lcls, and the mask segmentation, Lmask, is binary

cross entropy. The loss function used for bounding box regression, Lbox, is smooth

L1, explained in section 2.4.1. The weights of the different losses were altered in

order to find which loss weight would yield the best model performance.

In the downloaded implementation [4], there is a functionality allowing different layers to be frozen during training of the network. This is useful when designing an application with a dataset similar to the COCO dataset, since there are pre-trained weights available. In that case, only the network "heads", i.e. the layers closest to the output, need to be trained in order to obtain good segmentation results. Since the dataset used in this thesis looks quite different from the COCO dataset, some additional sets of layers are grouped and made available to set as frozen during training. The combinations of layers available to freeze are;

• All layers but the mask head.

• All layers but the bounding box head. • Only the mask head.

(31)

3.1 Implementation 21

3.1.3 U-net

An implementation of U-net provided through Vricon was used as baseline dur-ing this thesis. The network is an implementation of U-net with ResNet as back-bone structure. The input used was the same cropped images, using the same channels as when training the Mask r-cnn network, i.e. either RGB images or RGB-D. The mask provided as input is a single binary mask with pixel-level an-notations of the class ’building’.

3.1.4 Post processing

Since Mask r-cnn outputs mask in a multidimensional array with one binary mask per channel, there might occur overlapping masks. This feature can be de-sirable for other applications, but is in this thesis considered an artifact. Because of the overlaps, some post processing steps were implemented in order to reduce the occurrences overlapping masks. Two different approaches were implemented, one focusing on the iou and the other mainly on the intersections between the masks.

The first method focusing on the iou, compares the confidence scores of masks that have an iou bigger than a given threshold. The mask with the lowest confi-dence score will be removed.

The first method has its flaws, especially when comparing one bigger mask and one smaller that overlaps. None of these will be removed since the union is much bigger than the intersection, making the iou score very low. This is solved by looking at the intersection area and comparing it to the areas of the intersecting masks.

Another method was implemented with the goal to treat situations where one bigger mask covers multiple smaller masks. Using the first method, the bigger mask would be compared multiple times in separate cases, once for each over-lapping mask. This can result in the incorrect removal of more than one of the involved masks. To avoid this problem, masks with multiple intersecting masks are treated separately. The case of one bigger mask and multiple smaller overlap-ping is visible in figure 3.1.

3.1.5 Evaluation

To evaluate the performance of the different models, SpaceNet’s [20] evaluation algorithm is downloaded.

During evaluation of a model, the model prediction is infeered on one image at the time, where the predicted binary mask array is converted into polygons. The geometries of all polygons in one image is stored in a separate file. The same procedure has been done on the ground truth data, and the predicted polygons are compared to the ground truth. The procedure is the same for both types of

(32)

22 3 Method

(a)Initial predictions of Mask r-cnn. (b)Predictions after applying post pro-cessing.

Figure 3.1: An example of how the post processing removes overlapping masks. Most noticeable in the lower left corner.

models, with some minor alterations in the conversion of mask into polygon since the format of the output varies between the different models.

The result is presented as the F1 score, along with the precision and recall. Dur-ing the evaluation, some of the images are saved with their predictions, in order to make it possible to visually review the results.

3.2 Training

The networks were trained until the training loss flattened or when signs of over-fitting occurred. During training both the training and validation losses were monitored using TensorBoard [2]. The weights with the lowest validation loss were saved and used for evaluation.

3.2.1 Experiments

The main focus of this thesis is to adapt the Mask r-cnn model to the given input and to make it able to detect building footprints as accurate as possible. In order to find the best performing configuration of the model, different parameters and training setups were tested and tuned. In the following sections some fine tuning methods used during training are explained along with an explanation of the learning rate schedule used in one of the experiments.

Fine tuning

Since training all weights of a deep network is a time consuming task, there are a range of different fine tuning methods available. These methods are based on the concept that if the network uses pre-trained weights, it will not be necessary

(33)

3.2 Training 23

for all layers to update, but some layers can stay frozen during training. During the experiments in this thesis, all models are initialized with pre-trained weights from [1]. In the tables 3.1 to 3.3 the fine tuning schedules that will be used in this thesis are explained. Theheads layers represent the RPN, classifier and mask

heads of the network, while4+ represent the layers of Resnet stage 4 and up and

themasks layers represent the mask branch.

Table 3.1:Fine tuning using initial weights and training layers ’heads’. Iterations Learning rate Epochs Layers trained

1 10−4 50 heads

Table 3.2: Fine tuning approach tested while training Mask r-cnn using pre-trained weights.

Iterations Learning rate Epochs Layers trained

1 10−4 15 heads

2 10−5 15 4+

3 10−6 20 all

Table 3.3:Fine tuning approached tested while training Mask r-cnn using pre-trained weights.

Iterations Learning rate Epochs Layers trained

1 10−4 30 all

2 10−4 30 masks

Learning rate schedules

Finding the optimal learning rate can be problematic and differs between differ-ent problems. One way is to use a constant learning rate throughout the training of the network. Another option is to use a learning rate schedule. A learning rate schedule is a way to decrease the learning rate, since it is often useful to re-duce the learning rate as the model converges. One example of a learning rate schedule isstep decay. The step decay schedule drops the learning rate by a factor

every few epochs. A typical example is lowering the learning rate by half every 10 epochs [14]. This is visualized in figure 3.2.

(34)

24 3 Method

Figure 3.2: The learning rate schedule step decay. The learning rate is low-ered by half every 10th epoch.

(35)

4

Result

The final results of the proposed and baseline methods are presented in the fol-lowing chapter.

4.1 Experiments

In order to find the best performing configuration of Mask r-cnn, different pa-rameters and training setups were tested and tuned, such as compairing the Adam and SGD optimizer, testing different learning rates and changing weights of the loss function. RGB images were used as input during the initial experi-ments, with the results presented in the sections below.

The backbone network used by Mask r-cnn was resnet101 since it performed better in the paper by Kaiming He et al. [10], compared to resnet50 which was also available to use as a backbone network.

(36)

26 4 Result

4.1.1 Results using Adam optimizer

The results from trying different learning rates when using the Adam optimizer are shown in figure 4.1, where the optimal learning rate was found to be 10−5. This was the learning rate used throughout the rest of the experiments using the Adam optimizer. 10−4 10−5 10−6 0 0.2 0.4 0.6 0.8 1 0.51 0.61 _0.58 0.52 0.59 0.54 0.51 0.62 0.62 Learning rate

F1-score Precision Recall

Figure 4.1:The bar chart shows the resulting F1-score, precision and recall for Mask r-cnn using Adam optimizer with different learning rates. All layers are allowed to update during training.

By visually inspecting the results of the model’s performance, the lower scores seemed to depend mainly on badly segmented masks. An example of input and corresponding predictions can be seen in figure 4.2. The figure showcases several bounding boxes without any masks and some with poor mask estimations.

(37)

4.1 Experiments 27

Figure 4.2: Resulting predictions of 3 images, using Mask r-cnn trained with Adam optimizer and learning rate 10−5_{. The mask predictions are}

(38)

28 4 Result

In order to improve the results, the weights of the different loss functions were altered to add focus to the segmentation. The weight vector, explained in section 3.1.2, were changed from [1, 1, 1] to [1, 1.5, 1], giving the cross entropy loss used for the mask segmentation a higher weight. The learning rate schedule explained in section 3.2.1 was also applied with the attempt to improve the scores. The step decay schedule had the initial learning rate of 10−₅

. The results from these experiments are shown in table 4.1. Post processing was applied to the predictions from the best performing model to see if the score would improve with the removal of overlapping masks.

Experiment Model configuration F1 - score Precision Recall

1 Using step decay 0.601 0.641 0.566

2 Altered loss weights 0.563 0.561 0.521 3 With post processing 0.628 0.630 0.625

Table 4.1: F1-score, precision and recall for Mask r-cnn using Adam opti-mizer and learning rate 10−5. The first experiment have used the learning rate schedule step decay explained in section 3.2.1 during training. The sec-ond model setup has used an altered loss weight vector explained in section 3.1.2. In the third experiment, post processing was applied to the best per-forming model setup with Adam optimizer.

(39)

4.1 Experiments 29

4.1.2 Results using SGD optimizer

The results from trying different learning rates when using the SGD optimizer are shown in figure 4.3, where the optimal learning rate was found to be 10−4. This was the learning rate used throughout the rest of the experiments using the SGD optimizer. 10−3 10−4 10−5 0 0.2 0.4 0.6 0.8 1 0.64 0.67 0.6 0.67 _0.64 0.58 0.61 0.7 0.62 Learning rate

F1-score Precision Recall

Figure 4.3:The bar chart shows the resulting F1-score, precision and recall for Mask r-cnn using SGD optimizer with different learning rates. All layers are allowed to update during training.

The results were visually inspected and some examples are displayed in figure 4.4. Since the performance increased using the SGD optimizer, based on both the F1-score and by visually inspecting the predictions, more experiments were conducted using the SGD optimizer.

(40)

30 4 Result

Figure 4.4: Resulting predictions of 3 images, using Mask r-cnn trained with SGD optimizer and learning rate 10−4_{. The mask predictions are}

(41)

4.1 Experiments 31

The same experiments as while using Adam optimizer were executed with the SGD optimizer. The results are displayed in table 4.2. In addition, some further methods were tested. The different fine tuning methods explained in 3.2.1 were tested in order to see how this affected the results. The results are presented in table 4.3.

1 Using step decay 0.640 0.625 0.656

2 Altered loss weights 0.643 0.624 0.663 3 With post processing 0.679 0.679 0.679

Table 4.2: F1-score, precision and recall for Mask r-cnn using SGD opti-mizer and learning rate 10−4. The first experiment has used the learning rate schedule step decay explained in section 3.2.1 during training. The sec-ond model has used an altered loss weight vector explained in section 3.1.2. In the third experiment, post processing was applied to the best perform-ing model setup with SGD optimizer, which was usperform-ing default settperform-ings and using the constant learning rate of 10−4.

1 Training’heads’ 0.566 0.537 0.598

2 Training’heads’ + ’4+’ + ’all’ 0.563 0.523 0.610 3 Training’all’ + ’mask’ 0.625 0.619 0.630

4 Training’all’ 0.668 0.639 0.700

Table 4.3: F1-score, precision and recall for Mask r-cnn using SGD opti-mizer and learning rate 10−₄_{. The experiments 1 - 4 use the different fine}

tuning methods explained in section 3.2.1

Since the range of building sizes was quite large, an additional set of anchor scales was tested, [16, 32, 64, 128, 256] in comparison to [8, 16, 32, 64, 128] which was used in the other experiments. The sizes of the anchor scales did affect the results significantly, having too large scales would make the performance of the RPN part of the model to decrease. This since the anchors now were too large to be able to detect the smaller buildings, causing the number of false negatives to increase and the recall to drop. The results from two different trainings using the same settings except from the different sets of anchor scales are visible in table 4.4.

(42)

32 4 Result

Model configuration F1 - score Precision Recall Anchor scales [16, 32, 64, 128, 256] 0.169 0.743 0.095 Anchor scales [8, 16, 32, 64, 128] 0.668 0.639 0.700 Table 4.4: F1-score, precision and recall for Mask r-cnn using SGD opti-mizer and learningrate 10−4. The models have been trained in the same way except for using different sets of anchor scales.

4.1.3 Comparing with baseline

The best performing Mask r-cnn was used to compare its results with the base-line model performing semantic segmentation. The results are displayed in the table below, accompanied with the loss graph and some sample result images fur-ther down. Using the best performing model setup, the input was changed from RGB images to RGB-D images and the results are represented in table 4.6 As for training the baseline model, two different learning rates were tested, 10−₄

and 10−5where 10−4performed better and were used in the experiments. Model name F1 - score Precision Recall U-net RGB 0.69782 0.68528 0.71082 Mask r-cnn RGB 0.66803 0.63865 0.70025 Mask r-cnn RGB,

post processing

0.67887 0.67907 0.67867

Table 4.5: F1-score, precision and recall for the best performing networks using RGB images as input.

Model name F1 - score Precision Recall U-net RGB-D 0.68435 0.68342 0.68529 Mask r-cnn RGB-D 0.67626 0.71956 0.63787 Mask r-cnn RGB-D,

post processing

0.67711 0.67892 0.67531

Table 4.6: F1-score, precision and recall for the best performing networks using RGB-D images as input.

Note that the different methods have different loss functions, which contributes to the placement of the graph on the vertical axis in figures 4.5 and 4.6. In figures 4.7 and 4.8, it is noticeable that the masks of the building footprints pre-dicted by Mask r-cnn has slightly more rounded edges compared to the building footprints predicted by U-net. Mask r-cnn extracts its masks from RoIs that has been reshaped using RoI pooling described in section 2.3.3, with the rounded

(43)

4.1 Experiments 33

(a)U-net RGB (b)U-net RGB-D

Figure 4.5:Training and validation loss for both U-net networks.

(a)Mask r-cnn RGB (b)Mask r-cnn RGB-D

Figure 4.6:Training and validation loss for both Mask r-cnn networks.

(44)

34 4 Result

(a)Original image. (b)Ground truth.

(c)U-net RGB (d)U-net RGB-D

(e)Mask r-cnn RGB (f)Mask r-cnn RGB-D

Figure 4.7: Resulting predictions of an image with a high F1-score for both models, where the outputs from Mask r-cnn models is an instance segmen-tation and the output from U-net is a semantic segmensegmen-tation.

(45)

4.1 Experiments 35

(c)U-net RGB (d)U-net RGB-D

(e)Mask r-cnn RGB (f)Mask r-cnn RGB-D

(g) Mask r-cnn RGB with post processing

(h) Mask r-cnn RGB-D with post processing

Figure 4.8: Resulting predictions of an image, using both U-net and Mask r-cnn. The results from the model Mask r-cnn are displayed both with and without post processing.

(46)

(47)

5

Discussion

5.1 Discussion

In this chapter the results of the experiments in the thesis is discussed. The data used as input is also discussed and how the data has effected the results, along with some comments on the choice of method.

5.1.1 Data

The data used as ground truth in this thesis have been manually annotated, which has resulted in slightly inconsistent labeling of buildings. In some cases each building is annotated as a separate building, while other times multiple close ad-jacent buildings are grouped as one large building. This is a consequence of the fact that there is no true ground truth in the perception of what is one building. This is simply up to the person labeling the data, and the human factor influences the dataset.

Another factor that has made it harder for the model to adjust is the fact that the size of the buildings vary greatly. One building can be defined by a couple of pixels or in other cases cover most part of the image. This makes it necessary to have more anchors of different scales to be able to find all of the different sized buildings. This causes the model to produce more boxes than might be necessary, which increases the risk of overlapping masks and false positives. More anchors require more calculation and computational power as well.

5.1.2 Method and experiments

The source code downloaded from [4] was not implemented by the authors of the article [10]. There is a risk of using code found online, since it might contain

(48)

38 5 Discussion

errors or deviate from the original article. The code and article were compared and no errors were found.

With the Adam optimizer a lower learning rate were required to get similar re-sults as to when using SGD. Although Adam is more popular and considered to give better and faster results, SGD gave better results when using Mask r-cnn. Using a training schedule, with different layers frozen during different phases of the training, did not generate better results than when allowing all layers to update throughout the training. Neither did the learning rate schedule.

5.1.3 Result

Overall, Mask r-cnn has similar results to U-net, implicating that the extra com-plexity of an instance segmentation network does not give the desired improve-ment of the results. The performance is measured by the F1 score, which is a good indicator on how well the models perform. However, this does not reflect the whole complexity of the different models.

To compare the performance on the specific task of separation of adjacent build-ings, the precision and recall can give some hints on how this behaviour differs between the different methods of semantic segmentation and instance segmen-tation. The lower precision score achieved with Mask r-cnn implies that there are more false positive detections made, which can be an effect of over segmenta-tion compared to the ground truth. Figure 5.1 shows an example of where Mask r-cnnhas gotten a very low F1 score on the account of a high number of false positives. This can be an effect of the dataset not being consistent enough.

The occurrences of overlapping masks was often caused by two different sce-narios. Either a group of buildings is both detected separately and as one bigger building, or a building was assigned a bounding box that covered both the in-tended building but also parts of other buildings. Using post processing on the detections managed to increase the precision, although at the cost of lowering the recall. This because the number of false positives decreased but the false neg-atives increased. The overall F1 score did therefore only increase slightly. The frequency and extent of the overlapping masks did also decrease with higher per-forming models, as is visible when comparing figures 4.2 and 4.4 in the result. This made the possible improvement by post processing smaller.

By visual inspection of the results, the Mask r-cnn seemed to have a higher performance on the object detection part of the network, relative to the mask seg-mentation part. This was detected as bounding boxes containing masks covering only a fraction of the intended building. These masks have a high risk of being removed during post processing and will end up represented as a false negative, although the network actually has provided the information that there is a build-ing present. A simple experiment where empty boundbuild-ing boxes were ’colored’ and used as mask increased the F1 score of the highest performing Mask r-cnn

(49)

5.1 Discussion 39

(c)Predictions Mask r-cnn. (d)Predictions U-Net.

Figure 5.1:An example of a scenario where Mask r-cnn gets a very low F1 score because of over segmentation. Image results for Mask r-cnn model (RGB input): F1 score: 0.099, true positives: 5, false positives: 77, false negatives: 14.

by around 1%. Even though training the network specifically on the mask branch by freezing other layers, the problem did still occur, and causes some ideas for fu-ture work. The problem did however decrease with further training of the model, but still occurs even with the highest scoring Mask r-cnn. The masks did also tend to have rounded edges, most probably caused by the reshaping done by the RoI pooling. In order to get a higher accuracy of the masks, the sizes of the RoIs could be increased to achieve a higher resolution. This was tested during the experiments but caused memory failure during training. In order to make this possible, a smaller batch size or smaller input images might have enabled a big-ger size of the RoIs fed to the classifier and mask segmenting branch. Another approach would be to use bounding regulation, which is further discussed in ”Fu-ture work”, section 6.2.

(50)

40 5 Discussion

and U-net, but only with about 1%. The small improvement of performance could depend on the fact that the optimal hyper parameters were not found for this specific input. More time was spent on optimization of hyper parameters using only RGB images and with more time and experiments with the additional input channel, a better result might have been possible. There is also the possi-bility that adding depth did not contribute as much to useful information for the networks as hoped. Adding more or different channels might have resulted in higher performance for both models.

(51)

6

Conclusion

6.1 Summary

Both methods give very similar results when evaluated with the F1 score. The result implies that instance segmentation using Mask r-cnn will not give better results when extracting building footprints using rendered images from a true ortho view of 3d models, created from multi view satellite images as input. This compared to using semantic segmentation with U-net. Although, there is room for improvement of Mask r-cnn and with further investigation of the method a higher performance might be achieved. Since the object detection performed by Mask r-cnn shows that the model has potential to increase its performance with a better mask segmentation, some further investigation on improvement of specifically the mask branch could give the models overall performance a boost. The following questions are answered:

• How will an instance segmentation network perform on the task of extracting instances of building footprints?

Mask r-cnn is able to detect building footprints in the images and reaches a F1 score of 0.676. This proves that it is possible to train a cnn using the provided input and to achieve results similar to the results of a seman-tic segmentation, which was used as a baseline, when extracting building footprints.

• Is there an advantage of using instance segmentation over semantic segmenta-tion when comparing the separasegmenta-tion of adjacent buildings in an image?

Mask r-cnn performs similar to U-net when comparing the F1-scores of the two models. The results show that Mask r-cnn tends to over segment images, separating buildings to a higher extent than U-net. This however has contributed to a higher rate of false positives, resulting in the

(52)

42 6 Conclusion

sion that the methods of semantic and instance segmentation produces sim-ilar results when separating adjacent buildings in an image.

6.2 Future work

To improve the mask predictions from Mask r-cnn, a further investigation could be done on how to improve the mask branch. Replacing or expanding that part of the network could improve the results, or continue with further experiments with the weights of the loss function or freezing different layers.

To make sure that there is only one building inside of each bounding box, an alternative solution would be to represent each bounding box with an additional angle, making every bounding box a minimum area bounding box. Another idea for improvement of the masks would be to use building bounding regulation, i.e. smooth edges of the buildings. This would make the buildings to have straight lines as edges which they may not have because of the pixel-wise segmentations. This would also account for the rounded edges caused by the RoI pooling. Adding more input channels is a simple experiment which would give the model more information and might improve the result. Expanding the dataset or adding more alternatives for data augmentation could also improve the model perfor-mance.

(53)

Bibliography

[1] Splash of color: Instance segmentation with mask r-cnn and tensor-flow. https://engineering

.matterport.com/splash-of-color- instance-segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46, last accessed on 2019-12-19. Cited on page 23.

[2] Tensorboard: Tensorflow’s visualization toolkit. https:// www.tensorflow.org/tensorboard, last accessed on 2019-12-19. Cited on page 22.

[3] Coco, common objects in context, 2019. http://cocodataset.org/

#home, last accessed on 2020-01-08. Cited on page 12.

[4] Waleed Abdulla. Mask r-cnn for object detection and instance segmen-tation on keras and tensorflow. https://github.com/matterport/

Mask_RCNN, 2017. Cited on pages 19, 20, and 37.

[5] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmenta-tion via multi-task network cascades. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3150–3158, 06 2016. doi: 10.1109/CVPR.2016.343. Cited on page 12.

[6] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international confer-ence on computer vision, December 2015. Cited on page 9.

[7] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmenta-tion. The IEEE International Conference on Computer Vision (ICCV), 2014. ISBN 978-1-4799- 5118-5. doi: 10.1109/CVPR.2014.81. URL http://ieeexplore.ieee. org/document/6909475/. Cited on page 9.

[8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. Cited on pages 5, 6, 13, 14, and 15.

[9] Lisa Haskell and Rob O’Donnell. Stand up straight—ortho perspective on downtown denver, 2001. https://www.esri.com/news/arcuser/ 1001/standup.html, last accessed on 2020-01-16. Cited on page 2.

(54)

44 Bibliography

[10] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. Cited on pages 12, 13, 25, and 37.

[11] Yoshua Bengio Geoffrey Hinton. Deep learning. A Nature Research Journal, 521(7553):436, 2015. https://www.nature.com/articles/

nature14539.pdf. Cited on page 6.

[12] Vladimir I. Iglovikov, Selim S. Seferbekov, Alexander V. Buslaev, and Alexey A. Shvets. Ternausnetv2: Fully convolutional network for instance segmentation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 228–2284, 2018. Cited on page 12. [13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-tion. International Conference on Learning Representations, 12 2014. Cited on pages 15 and 16.

[14] Suki Lau. Learning rate schedules and adaptive learning rate methods for deep learning, 2017. https://towardsdatascience.com/learningrate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1, last accessed on 2020-01-31. Cited on page 23.

[15] Xiaolong Liu, Zhidong Deng, and Yuhan Yang. Recent progress in semantic image segmentation. Artificial Intelligence Review, 52(2):1089–1106, Jun 2018. ISSN 1573-7462. doi: 10.1007/s10462-018-9641-3. URL http:// dx.doi.org/10.1007/s10462-018-9641-3. Cited on page 7.

[16] Xiaolong Liu, Zhidong Deng, and Yuhan Yang. Recent progress in semantic image segmentation. Artificial Intelligence Review, 52(2):1089–1106, Aug 2019. ISSN 1573-7462. doi: 10.1007/s10462-018-9641-3. URL https: //doi.org/10.1007/s10462-018-9641-3. Cited on page 10.

[17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for se-mantic segmentation. 2015 IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pages 3431–3440, June 2015. ISSN 1063-6919. doi: 10.1109/CVPR.2015.7298965. Cited on page 10.

[18] Krishna Panduru Daniel Riordan Joseph Walsh Niall O’ Mahony, Trevor Mur-phy. The nature of code, 2016. http://natureofcode.com/book/

chapter-10-neural-networks/, last accessed on 2019-12-10. Cited on page 5.

[19] Niall O’Mahony, Sean Campbell, Anderson Carvalho, Suman Harapanahalli, Gustavo Velasco Hernandez, Lenka Krpalkova, Daniel Riordan, and Joseph Walsh. Deep learning vs. traditional computer vision. Springer Interna-tional Publishing, pages 128–144, 2020. Cited on page 5.

[20] SpaceNet on AWS. Spacenet round 2 challenge implementations, 2016. https://github.com/SpaceNetChallenge/utilities/blob/