Improving Photogrammetry using Semantic Segmentation

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2018

Improving Photogrammetry

using Semantic

Segmentation

Björn Kernell

(2)

Master of Science Thesis in Electrical Engineering

Improving Photogrammetry using Semantic Segmentation

Björn Kernell LiTH-ISY-EX--18/5118--SE

Supervisor: Karl Holmqvist

isy_{, Linköpings universitet}

Martin Svensson

Spotscale

Examiner: Per-Erik Forssén

isy_{, Linköpings universitet}

Division of Computer Vision Department of Electrical Engineering

(3)

Sammanfattning

3D-rekonstruktion är teknologin bakom att skapa 3D-modeller utifrån bilder. Det är en process med många steg där varje steg kan medföra fel. Vid 3D-rekon-struktion av stora utomhusmiljöer finns det vissa typer av bildinnehåll som ofta ställer till problem. Två av dessa är vatten och himmel. Vatten är problematiskt då det kan fluktuera mycket från bild till bild samt att det kan innehålla reflektio-ner som ger olika utseenden från olika vinklar. Himmel å andra sidan ska aldrig ge upphov till 3D-information varför den lika gärna kan maskas bort.

Manuell maskning av bilder är väldigt tidskrävande och dyrt. Detta examens-arbete undersöker huruvida denna maskning kan göras automatiskt med Falt-ningsnät för Semantisk Segmentering och hur detta skulle kunna förbättra en 3D-rekonstruktionsprocess.

(4)

(5)

Abstract

3D reconstruction is the process of constructing a three-dimensional model from images. It contains multiple steps where each step can induce errors. When do-ing 3D reconstruction of outdoor scenes, there are some types of scene content that regularly cause problems and affect the resulting 3D model. Two of these are water, due to its fluctuating nature, and sky because of it containing no use-ful (3D)data. These areas cause different problems throughout the process and do generally not benefit it in any way. Therefore, masking them early in the reconstruction chain could be a useful step in an outdoor scene reconstruction pipeline.

Manual masking of images is a time-consuming and boring task and it gets very tedious for big data sets which are often used in large scale 3D reconstruc-tions. This master thesis explores if this can be done automatically using Con-volutional Neural Networks for semantic segmentation, and to what degree the masking would benefit a 3D reconstruction pipeline.

(6)

(7)

Acknowledgments

First of all I want to thank Spotscale for giving me the opportunity to do my master thesis together with them. I want to thank my supervisor Martin Svensson for all the help and support.

I would also like to thank my examiner Per-Erik Forssén and my supervisor Karl Holmquist at Linköping University for the discussions and quick feedback throughout this thesis work.

Finally I would like to thank my family and friends for their continuous sup-port and encouragement.

Linköping, April 2018 Björn Kernell

(8)

(9)

3.1.1 Architecture . . . 23 3.1.2 Loss function . . . 24 3.1.3 Regularization . . . 24 3.1.4 Optimization . . . 24 3.1.5 Initialization . . . 25 3.1.6 Field-of-view . . . 25 3.2 Training . . . 26 3.2.1 Training Data . . . 27 3.2.2 Data Augmentation . . . 28 3.2.3 Active Learning . . . 28 3.2.4 Class Balance . . . 29 3.2.5 Epochs . . . 29 3.2.6 Validation . . . 29 3.3 Evaluation . . . 29 3.3.1 Evaluation metrics . . . 29 3.3.2 Improving Photogrammetry . . . 31 3.3.3 Qualitative evaluation . . . 31 3.3.4 Quantitative evaluation . . . 31

3.3.5 Evaluation of Photogrammetry improvement . . . 31

3.3.6 Experiment setups . . . 32

4 Results 33 4.1 Semantic Segmentation . . . 33

4.1.1 Quantitative results . . . 36

4.1.2 Qualitative results . . . 37

4.2 Qualitative Photogrammetry improvements . . . 41

5 Discussion 43 5.1 Results . . . 43 5.2 Methodology . . . 44 5.2.1 Training . . . 44 5.2.2 Training data . . . 44 5.2.3 FCN as network of choice . . . 44 5.3 Conclusions . . . 44 5.4 Future work . . . 45 Bibliography 47

(11)

Notation

Abbreviations

Abbreviation Meaning

cnn _{Convolutional Neural Network}

ann Artificial Neural Network

sfm Structure from Motion

mvs Multi-View Stereo

fcn Fully Convolutional Network

(12)

(13)

1

Introduction

This thesis explores if water and sky can be masked automatically from images using Semantic Segmentation with Convolutional Neural Networks and to what degree it would benefit 3D reconstruction in a Photogrammetry pipeline.

Photogrammetry is a set of techniques that extract measurements of different kinds from images[8]. Stereophotogrammetry is a subgroup to these techniques and it specifically aims to estimate 3D coordinates of objects in the images. The process when this is done from images only is often, and will be throughout this report, referred to as 3D reconstruction.

Semantic segmentation of an image is the process of dividing the image into segments of different semantics[16]. This can for instance be a self-driving car finding what pixels of the scene, captured from its front-facing camera, is of road, people or other cars. Semantic segmentation is an old computer vision problem that has had much success in the recent years with the introduction of convolu-tional neural networks.

1.1 Motivation

When doing 3D reconstruction of outdoor scenes, there are some types of scene content that regularly cause problems and affect the resulting 3D model. Two of these are water, due to its fluctuating and reflective nature, and sky because of it containing no useful (3D)data. These areas cause different problems through-out the process and do generally not benefit it in any way. Therefore, masking them early in the reconstruction chain could be a useful step in an outdoor scene reconstruction pipeline.

Large scale 3D reconstructions use very large amounts of data in the form of images. A reconstruction of a medium sized outdoor scene can use over 50,000 high resolution images. Manual masking of images is a time-consuming and

(14)

2 1 Introduction

ing task and it gets very tedious for big data sets like these. If this could be done automatically it could be a step in a photogrammetry pipeline.

There are several parts of the pipeline that could benefit from the masking. For instance, the SFM (detailed in 2.8) would not use points in the water and sky unless it makes an error and can ignore these areas using the masking. This could potentially speed up this part of the process. The texturing can benefit from knowing the semantics of the texture. For instance, it should never apply the sky texture on the models and can use a different texturing (reflections) for water. The MVS (detailed in 2.8) can filter out points that it would potentially get from the sky. It could also benefit from knowing what is water since these areas should be approximately flat in the 3D model. The semantic information can also be projected onto the model to obtain a 3D semantic segmentation.

1.2 Goal

The goals of this thesis are

• to fine-tune a pre-trained CNN on a small set of images to segment and subsequently mask water and sky from images

• to quantitatively and qualitatively evaluate to what degree the segmenta-tion is successful

• to qualitatively evaluate how this masking improves the resulting 3D model.

1.3 Problem formulation

This thesis will attempt to answer the following questions:

• Can a pre-trained CNN be fine-tuned on a small batch of data to perform semantic segmentation?

• How well can the network perform the semantic segmentation of sky and water?

• Can masking of water and sky improve a photogrammetry pipeline?

1.4 Limitations

As with most machine learning applications, the results of the thesis will be lim-ited by the quality and quantity of the training data. This is because the data available is limited and that labelling is an expensive task.

The result of the semantic segmentation will be limited by the chosen network architecture (FCN). There is no perfect architecture. The chosen network must also have a pre-trained model available since training data is limited.

Since transfer learning will be used, the alterations in architecture is limited; many alterations make using the pre-trained weights impossible and even if the filtersdo fit they still risk making the pre-training obsolete.

(15)

1.5 Related works 3

1.5 Related works

This thesis uses the FCN[16] architecture for semantic segmentation. However, there are several other networks with different architectures that each has its benefits and drawbacks.

1.5.1 Convolutional neural networks

Convolutional neural networks (CNNs, or ConvNets) have been the method of choice for image classification since 2012 when AlexNet[14] won the ILSVRC[19] (ImageNet Large-Scale Visual Recognition Challenge). The network used tech-niques that are still prevalent today, like data augmentation and the ReLU as activation function.

VGG[20] is a classification CNN made by Simonyan et al. Visual Geometry

Group at Oxford University. There are two versions of the network, a 16 layer and a 19 layer. The novelty of this net was its smaller filter sizes and, at the time, deep architecture. It is widely used as an encoder for semantic segmentation networks[16][2].

1.5.2 Semantic segmentation

With the recent success of CNNs in classification, these new techniques were be-ing applied to other classical computer vision tasks, one bebe-ing semantic segmen-tation. The first real breakthrough here came when Long et al.[16], with their

network FCN, proposed removing the last layers of CNNs used for classification, and adding deconvolutional (convolutional transpose[11]) layers to upsample to a full image resolution output with class labeled pixels. Since FCN, there have been a number of different networks, each successful in some way. Some of them are detailed below.

SegNet[2] is a network proposed by Badrinarayanan et al. in 2016. They

use a decimated VGG16 (the 16 layer version) architecture as encoder and, a decoder that performs a non-linear upsampling by using a reversed max-pooling operation, with pooling indices memorized from the encoder. SegNet is efficient in terms of memory usage and inference computation time. It has significantly fewer parameters than alternative architectures like DeepLab-LargeFOV[4] and FCN[16].

DeepLab[4] by Chenet al. tries to solve two common problems among

D-CNNs (Deep Convolutional Neural Networks) for semantic segmentation. The first one being that the repeated pooling operations, while allowing the network to get a larger field-of-view, can hinder its attentiveness to detailed spatial infor-mation. The second one being the problem of handling objects at different scales. DeepLab attempts to solve the first problem with an operation they call Atrous convolution (kernel spreading) and the second one by using different scale pyra-mids. Their newest network DeepLabv3 achieves state-of-the-art results on the PASCAL VOC 2012 semantic image segmentation benchmark [7].

(16)

4 1 Introduction

Enet[18] by Paszke et al. is a light network with only 370 000 parameters

(compared to FCN with over 100 million). Despite its small size it performs decent predictions on several benchmark data sets. The small size also makes the network very fast.

(17)

2

Theory

This chapter will explain the theory of this thesis with regards to the topics of semantic segmentation and 3D reconstruction.

2.1 Artificial Neural Networks

This thesis uses Convolutional Neural Networks (CNNs)[9] to perform Semantic Segmentation[16]. CNNs are a subclass to a broader collection of networks called Artificial Neural Networks (ANNs)[9]. To understand CNNs, it will be necessary to first understand ANNs in general.

ANNs take inspiration from the Biological Neural Networks in human brains[9]. As the name implies these are networks of neurons that are in some way con-nected. While a biological neuron is very complex, the function of a neuron in an ANN is condensed to a node that takes a number of inputs and only has one output. When a neural network is large enough it becomes a powerful tool for image classification and semantic segmentation. It sometimes seem almost hu-man or magical but the power comes from its ability to store and combine a very large number of feature representations. How this works will be explained in this chapter.

(18)

6 2 Theory

Figure 2.1: A small artificial neural network. Each neuron has multiple in-puts but only one output that is then fed forward to the neurons in the next layer.

To understand ANNs better, it can be helpful to take a few steps back from classifying images and to study a minimal network on a much simpler classifica-tion task: to separate (classify) points of two classes, minuses and pluses, with a linear classifier. The network knows of only two features (characteristics) of these points, their x and y coordinates in a two-dimensional space as in figure 2.2b. The very simple network in figure 2.2a takes these features as input and multiplies them with the weights w1 and w2(∈ R) and then adds them together

in the output node z. The network knows that a negative output corresponds to the minus class and that a positive output corresponds to the plus class. The weights determine the orientation of the separating line, and in the examples be-low the weights are tuned to the specified task. How the weights are tuned is explained later in the report.

(19)

2.1 Artificial Neural Networks 7

(a) Small network with no bias

(b) Resulting separating line with tuned weights

Figure 2.2:Single neuron classifier without bias and its resulting separating line

The small network above corresponds to the equation

z = w1x + w2y. (2.1)

The problem with this separator is that it will always be a line though the origin. However, if a term b is added to the output node, any line in the space can be obtained. This term is referred to as the bias term. Imagine then that a positive output corresponds to the plus class and a negative output corresponds to the minus class. Then the weights can be tuned in a way that a linear classifier, like in figure 2.3b where the line is the boundary between the classes, is obtained.

z = w1x + w2y + b (2.2)

(a)Linear separator with bias (b) Resulting separating line with tuned weights

(20)

8 2 Theory

Now imagine a case where the classes are not linearly separable. Can they still be separated by combining different linear classifiers? The answer is no. Let two linear separators, h1and h2, be added together in a node z.

(a)Linear classifier with one hidden layer (b)The separating lines from the two neu-rons h1and h2

Figure 2.4:Small neural network with one hidden layer How will the resulting separating line for z be in this case?

h1= w11x + w12y + b1 (2.3)

h2= w21x + w22y + b2 (2.4)

z = h1+ h2 (2.5)

z = w11x + w12y + b1+ w21x + w22y + b2 (2.6)

z = (w11+ w21)x + (w12+ w22)y + b1+ b2 (2.7)

which can be written as

z = w3x + w4y + b3. (2.8)

This of course is another linear classifier. No matter how many linear separators are added together, the result will be another linear separator. This motivates adding a non-linearity to the network. This part is what is called theactivation function.

(21)

2.2 Activation function 9

2.2 Activation function

The activation function [9] is added onto a node and takes the output of that node as input. There are many kinds of activation functions. Some just squash the inputs into a range between, for instance, 0 and 1 and others make the preceding node act more like a biological neuron, firing only for certain inputs.

The activation function g is applied to the output h of a node like g(h). The two primary desirable properties of an activation function for ANNs are non-linearity and differentiability. The non-non-linearity is motivated in the last chapter and the differentiability is needed for the optimization of the network. This will be explained further in section 2.5 about optimization. The Step function and a linear activation function are two functions that are excellent in one of the ways but disqualified by the other.

(a)The step function (b)The Sigmoid function

(c)The TANH function (d)The Rectified Linear Unit function

Figure 2.5:Activation functions

Common choices for neural network activation functions in the past were the Sigmoidand the Tanh function. They are non-linear and fairly easy to differen-tiate. The problem with them is that their gradients easily get very small when doingbackpropagation (introduced in section 2.5.2). This is evident when

study-ing figure 2.5b and 2.5c. A loss that gets repeatedly squashed through the back-wards pass will effectively lead to a dead (local) gradient. It is therefore also

(22)

10 2 Theory

important to be careful when doing initialization and choosing a learning rate when using these functions. The most common activation function for modern neural networks is the Rectified Linear Unit[9] function (ReLU). This is a simple non-linear function that is easily differentiable and generally works well. One problem with it is that a negative loss kills the backwards flow in backpropaga-tion. This can lead to a dead neuron which never activates and whose gradients are always zero. This problem can be solved with slight alterations to the ReLU like the leaky ReLU, the ELU and SReLU (all described in [5]) which all deal with problems with negative inputs to the ReLU. Despite this, the ReLU is the most commonly used activation function.

With the non-linearity added, the network can generate a separating bound-ary like in 2.6.

(a)Non-linear classifier (b)The boundary from the non-linear clas-sifier with tuned weights

Figure 2.6:Small neural network with one hidden layer and non-linear acti-vation functions

How does this connect to the task of classifying images? The naïve approach is to see every pixel of an image as one feature (or three for a RGB image) and then input it to a network. This actually works pretty well for small images but if the images are larger the network becomes too complex and hard to train.

So far the network has magically obtained parameters that make these sepa-rations which would not make it very useful in the real world. To optimize these parameters by itself, the network needs a metric of how well it is doing. This metric is obtained using aLoss Function.

2.3 Loss Functions

The point of the loss function (or cost function)[9] to give the network the ability to punish different mistakes it makes. A simple example is the quadratic loss

(23)

2.3 Loss Functions 11

function that punishes outliers hard. There are many types of loss functions but the most commonly used ones for classification areMulticlass Hinge loss and Cross Entropy loss. However, before the loss function is applied, it can be helpful

to introduce aScore function that abstracts the whole network into something less

complex.

2.3.1 Score Function

In the classification examples above, the output of the network was a single score and the sign of it decided the class. This was to help create intuition. In practice, the networks typically have one output score for each class. This is also necessary for tasks with more than two classes. For the previous examples this would look like figure 2.7a.

(a)The previous example but with one out-put for each class

(b)The network can be seen as score func-tion, in this case with an arbitrary input and as many outputs as the number of classes.

Figure 2.7:Outputs as class scores The entire network, up to this point, can be seen as a function

f (xi, P ) = Si (2.9)

that takes some data x, a set of parameters (weights and biases) P and outputs a score (prediction) S, for a sample i. This function is called the score function. For classification, the score function yields a vector with the length of the number of classes the network should classify. In the case of this thesis, and FCN, the score function outputs a H by W by C array or tensor where H is the input image height,

W is the input image width, and C is the number of classes the network is trained

to recognize. In the examples below, the score for a certain class j will be denoted as sj. Among these scores lie the score of the correct class sc, which the network

(24)

12 2 Theory

2.3.2 Multiclass Hinge Loss

Multiclass hinge loss, also known as multiclass SVM loss[23], is a loss function that predominantly reacts to miss-classifications and does not reward correct clas-sifications over a specific certainty. The loss for the prediction of sample xi is

calculated as

L(xi) =

X

j,c

max(0, ∆ + sj−sc). (2.10)

∆is a hyperparameter (a parameter the network has no control over) that deter-mines the margin to where the loss should start accumulating. This parameter normalizes the score which also is a sort of regularization (introduced in 2.4) since it will control the size of the parameters P of the network. The parameter is often set to one.

2.3.3 Cross Entropy Loss

The cross entropy loss, also known as Softmax loss[9], is the other most common loss function in classification and semantic segmentation. It treats the incom-ing scores as unnormalized log probabilities. Therefrom, the function computes the corresponding class probabilities, normalizes them and, then again, converts them into log probabilities. The cross entropy loss for sample xi is calculated as

L(xi) = −log(

esc P

jesj

). (2.11)

The output of this function goes to zero as the normalized probability approaches one and it theoretically yields infinite loss as the probability of the correct class goes to zero.

2.4 Regularization

The goal of most classification and semantic segmentation tasks is to get a general solutioni.e. the network which is optimized should generalize to data different

from the training data. The simplest regularizations are added as a term to the total loss of the training set of N images as

L = 1 N

X

i

L(xi) + λR(P ) (2.12)

This term should be something that scales with a property one wants to regulate. For instance, L2 regularization[9] scales with the square of the parameter vector

P . This constrains the size of the parameters of the network, preventing them

from growing too large.

It is important to note that regularizations often increase the training time of the network and that many regularization methods accomplish approximately the same thing. Therefore it is wise to study which methods might be relevant for the task at hand.

(25)

2.5 Optimization 13

2.4.1 Dropout

Dropout[9] is a very common method for regularization in training of ANNs. The method adds a probability that the output of each neuron will be set to zero for each training iteration. This method improves the generalization of the network since it prevents it from over-using the same neurons. The probability p to keep a neuron activation is a hyperparameter that can be tuned. Small values of p will cause higher training time and might lead to underfitting but a value too high might not yield enough regularization. Typically p is set to a value between 0.5 and 0.8[21]. Many modern architectures for classification have abandoned dropout, in part because the need for its regularization is smaller for the huge architectures but also because of a newer technique calledbatch normalization.

2.4.2 Batch normalization

Batch normalization[12] normalizes each batch to zero mean and unit variance. This is generally applied as a layer before or after the activation function. This allows for a higher learning rate which can decrease training time. A downside with batch normalization is that is does not have a control parameter to control its regularizing effect.

2.5 Optimization

With the loss function defined, the next step is for the network to learn the param-eters of the network that minimize the loss for the training data set, preferably in a way that generalizes well to other data. There are several ways to do this but the most common method for ANNs isGradient Descent[9].

2.5.1 Gradient Descent

The classic analogy used to explain Gradient Descent[9] is to imagine you are on a big mountain with a thick fog where the height of where you stand is thought of to be the loss we want to minimize. The fog makes it impossible to see any-thing but the immediate surroundings. The proposed method to get down to the lowest point of the mountain would then be to walk in the direction of the steepest descent. The analogy probably works best for the two-dimensional case where each dimension would be a parameter to optimize over but one can try to imagine a mountain in several million dimensions which would be the case when optimizing a large neural network.

In this example, the height of the mountain is the loss function L and our ground coordinate is the current parameter setting of our two parameters. The direction of the steepest descent is the negative gradient and thus the parameter update (the next position) is calculated as

(26)

14 2 Theory

where e is the current epoch and γ is the length of the step that is taken (learning rate).

This basic gradient descent method calculates the gradient over the whole dataset for one update. This is method is slow and very difficult to do for most applications due to memory issues. Therefore it is common to doStochastic Gradi-ent DescGradi-ent(SGD), which is when the parameter update is done for each training

sample orMini-Batch Gradient Descent which is when it is done over a small batch

of data. As it turns out, these stochastic methods have other benefits. For instance, the fact that each batch has a slightly different "mountain" makes it less likely to be stuck in local minima on the way to a general global solution.

There are still some problems with these gradient descent methods that often occur. For instance, even the stochastic versions have a hard time dealing with local minimas and especially saddle points. It is therefore common to use a bit more sophisticated methods. One addition to the standard gradient descent is to add a momentum term. This especially helps the loss function converge in places where it has a very small but consistent slope or where a minima has a very high second derivative (think a steep ravine).

Nesterov Momentum[10] is a method that uses momentum but calculates the gradient after the momentum update. As it turns out this method outperforms the standard momentum update most of the times.

Adagrad[6] stands for adaptive gradient and uses an adaptive learning rate which means that it has a tailored learning rate for every parameter in the net-work. This learning rate is also updated with every update step.

RMSprop[10] uses a moving average of its squared gradients to normalize the new gradient. The effect of this is a more balanced step size and a lower chance for exploding or vanishing gradients.

Adam[13] uses an adaptive learning rate and adaptive momentum. This method usually outperforms all the previously described methods and is often the method of choice. To compute the gradient, the derivatives of the loss with respect to each parameter are needed. For FCN there are over a hundred million parameters that need to be optimized. This means the network needs to compute over a hundred million derivatives for each training iteration. Fortunately, there are methods that make this task easier.

2.5.2 Backpropagation

The part when the information flows forward in the network is called theforward propagation[9](or forward pass). This operation, together with a loss function

and ground truth data, can then produce a loss. Backpropagation[9] is a way to calculate the gradients for the ANN by propagating the loss backwards through the network using the chain rule. What makes backpropagation so useful is that it saves many computations since the loss only has to be computed once for each training sample instead of once for each parameter update.

(27)

2.6 Convolutional Neural Network 15

Numerical gradients

One way of calculating the derivatives is to compute them numerically using the finite-differences method. This method calculates the change in loss for a small step in the parameter of interest. Since the loss has to be computed for the whole network for each parameter, this method does not fully take advantage of the big benefits of backpropagation. It also does not compute exact derivatives however it is very straight forward and easy to understand. Often, gradients are computed with other methods and are then compared to the numerical gradients. This operation is called thegradient check.

Analytic gradients

Another way to calculate the derivatives is doing it analytically (using calculus). This way the network does not need to compute the loss for the entire network for every parameter. The downside of this method is that the network needs to know the derivative of each operation beforehand and if a derivative is inaccurate the results will be poor.

Automatic differentiation

In modern machine learning libraries like Tensorflow or Pytorch, the gradients are computed with a method called Automatic Differentiation[3]. This method takes advantage of the fact that the network is running on a computer. A com-puter has a fixed number of mathematical expressions it can compute and it al-ready breaks down big mathematical expressions into these. These simple expres-sions have simple derivatives and the network can use that along with backprop-agation to effectively calculate the gradients.

2.6 Convolutional Neural Network

The Convolutional Neural Network (CNN) is an extension to the Artificial Neu-ral Network[9]. It adds classical computer vision techniques to the ANN. Inter-estingly, in the same way that the ANN takes inspiration from the human brain, the CNN takes inspiration from the synergy between the human eye and brain. For instance, the human eye only focuses on about 2 degrees of its total field of view input[15]. Similarly, a CNN can be seen as only focusing on a part of the input image at a time. They also have other similarities like focusing on places with high contrast.

In classical computer vision, it is common to use feature extractors as a way to abstract and extract information from images. For instance, a popular way to detect faces is to use Haar features[24]. These features extract different shading information and detects places of certain contrast in the images in areas they are applied. Haar features are pre-determined but in the CNN similar features can be learned. This proves to be a powerful tool combined with the ANN which enables combining these features to create more specific feature representations.

(28)

16 2 Theory

2.6.1 Feature Maps and Filter Kernels

To understand convolutional neural networks, it can be useful to introduce the concept of feature maps and filter kernels. A feature map is a H (height) by W (width) by D (depth) array or tensor that stores information. For instance, an input color image (RGB coded) can be seen as an H by W by 3 feature map. The other tool to explain the CNN is the convolution filter kernel. It is a smaller 3D tensor, often with the same depth as the feature map on which it is applied.

(29)

2.6 Convolutional Neural Network 17

Figure 2.9: A convolutional layer visualized as feature maps and filter ker-nels

2.6.2 Convolution

The convolution operation[9] is the main operation of a CNN. The basic concept is that an N × N kernel is multiplied across a feature map in a manner that resem-bles the mathematical 2D convolution (actually correlation). There are different ways to perform the convolution operation. The stride determines the size of

each step between each kernel application. In figure 2.10 a stride of 1 is used. If a stride of 3 was to be used, there would be no overlap and, subsequently, the output would be of smaller spatial (H and W ) size. Another thing that alters the convolution is the way it deals with edges. It is common to treat the areas outside the feature maps as zeros. This is called zero padding.

(30)

18 2 Theory

(a)Pixel-wise multiplication with a 3 × 3 × 1 kernel

(b)Stride of 1 and 3 × 3 × 1 kernel implies an overlap

Figure 2.10:Convolution with 3 × 3 × 1 kernel on 9 × 7 × 1 input image with stride of one yields 3 × 3 × 1 output

In figure 2.10 the depth is 1 but the same principle applies for a depth of D. The depth of the output is determined by the amount of different kernels that are applied.

2.6.3 Pooling

Pooling is a form of downsampling used to reduce the spatial complexity of the feature maps[9]. As with the convolution, the pooling operation applies a kernel with a size N × N and applies it with a specified stride. These two parameters determine how much the output feature map will be downsampled. It is common to use small sized kernels and strides to avoid destroying too much information in the process. A common choice is a 2 × 2 filter with stride 2.

Figure 2.11:A max pooling operation with kernel size 2 × 2 and stride 2 The most used pooling operation is theMax Pooling operation which stores

the max of where the kernel is applied. Another choice is the Average Pooling

which takes the mean of where the kernel is applied.

2.6.4 Convolutional Neural Networks for Classification

If the techniques described above are combined with the ANN techniques in chap-ter 2.1 one obtains a powerful tool for image classification. There are many

(31)

archi-2.7 Semantic Segmentation 19

tectures for CNNs for classification but what most of them have in common is a convolutional part that combines different layer types described above followed by a fully connected neural network (ANN) that results in a vector of all trained classes. The convolutional layers can be seen as layers combining different low level features to create more complex feature representations and the ANN part can combine these to create feature spaces where classes can be separated.

2.7 Semantic Segmentation

So far, the CNNs explained have been CNNs for classification. In this thesis, however, the usage of CNNs for Semantic Segmentation[16] is of greater interest. After repeated convolutions and pooling, the spatial dimensions of the feature maps have decreased significantly, even if the last layers of the classification net are removed. Therefore, some kind of upsampling of the feature maps is needed.

2.7.1 Unpooling

Unpooling[17] is a form of upsampling used to increase the spatial dimensions of the feature maps. It basically works like an inverted pooling operation. What differs unpooling operations from each other is the way that the values of the new pixels in the output feature map are chosen. In figure 2.12, unpooling operations usingNearest Neighbor and Bed of Nails are shown.

Figure 2.12: The unpooling operations Nearest Neighbor (left) and Bed of Nails (right)

2.7.2 Fractionally strided convolution

Fractionally strided convolution[16] (sometimes referred to as deconvolution or transpose convolution[11]) is a form of learnable upsampling that works like an inverse of the inherently down-sampling convolutional layer from 2.6.2. It is important to note that this is not the same as a Deconvolution from classical signal processing. The goal of a fractionally strided convolution is to get from a certain spatial size to a larger one using the convolution operation (from 2.6.2) to get the effectively reverse a convolutional layer. To do this, one needs a filter (which is learned), the input feature map, a specified stride and padding.

(32)

20 2 Theory

(a) Convolution with stride 2 and no padding

(b) Fractionally strided convolution with stride 2 and no padding. The stride of 2 leads to the fractioning of the input. The grey elements are zero and the filter is ap-plied to the input with full zero padding.

Figure 2.13: How a convolution in a convolution layer is related to a frac-tionally strided convolution

Figure 2.13 shows how the fractionally strided convolution operation is re-lated to a convolution of the same stride and padding. In both the convolution and the fractionally strided convolution the filter is applied on the input as in a normal convolution. The padding for the filter application in the fractionally strided convolution is reversely proportional to the padding used in the convo-lution. For instance a convolution with full zero padding yields a fractionally strided convolution with no padding, and a convolution with no padding would yield a fractionally strided convolution with full zero padding.

2.7.3 Fully Connected Layers

After the feature maps are at the input spatial size again, the depth needs to be squeezed into the dimension of the number of classes the network should classify. This is done using a layer with a 1 × 1 × D convolution kernels with one kernel for each class. This can be seen as it being fully connected for each pixel (spatial position) in the incoming feature map[16]. Then, for each pixel, the theory from section 2.3 can be applied and a loss for each pixel is obtained. This loss can then be used for parameter updates using backpropagation(2.5.2).

2.7.4 Transfer Learning

Large networks like FCN[16] have over 100 million parameters which makes them difficult to train. Therefore Transfer Learning is a very useful concept to apply. Since the early layers of a CNN only contain very basic features they will be similar for different training data. If a classification CNN has been trained to recognize cats and non-cats it can, without much effort, be retrained to classify dogs and non-dogs. The training will mostly alter the deeper layers of the net-works, especially the prediction layer. The earlier layers of the network can also be frozen to make sure the training focuses on the later parts.

(33)

2.8 3D reconstruction 21

2.8 3D reconstruction

To get an understanding of where a masking of water or sky could help the photogrammetry[8] pipeline, a brief description of a typical 3D reconstruction pipeline is needed. The basic approach to the 3D reconstruction is to first find corresponding points in consecutive images. This is done by using some sort of feature descriptor that describes an area around a point and an algorithm to match these descriptors into pairs that it thinks are probable.

The next step is to introduce a camera transformation matrix that transforms a 3D point into a 2D point. This matrix also has a pseudo-inverse that can map a 2D point to a line in the 3D space. This means that two camera matrices theoretically can triangulate a 3D point from a 2D point pair. Now an optimization problem can be set up which tries to model camera matrices while triangulating 3D points. This results in a trade-off between the accuracy of the cameras and 3D points and the number of 3D points in the resulting point cloud. This is known asStructure from Motion (SFM)[8]. As it turns out the accuracy is preferable since a denser

point cloud can be obtained through other techniques calledMulti-View Stereo

(MVS)[8]. These techniques assume known cameras and use different methods to get a much denser point cloud. The next common step is to do a meshing which is a way to go from a point cloud to a mesh of polygons. Meshing methods usually do a bad job at retaining color information and often completely removes it. Therefore a texture is often projected to the mesh afterwards. In summary, a typical 3D reconstruction pipeline contains the steps

• feature descriptor extraction and matching, • Structure from Motion,

• Multi-View Stereo, • meshing and • texturing.

(34)

(35)

3

Method

The network used in this thesis is an FCN-8s[16] with a pre-trained VGG16[20] as encoder. The network is implemented in Tensorflow[1].

3.1 Fully Convolutional Network

The CNN used in this thesis is the Fully Convolutional Network by Long et al.

This is a CNN architecture that takes a network for classification and extends with an upsampling module making it usable for Semantic Segmentation. There are several versions of the net with different encoders. AlexNet[14], GoogLeNet[22] and VGG versions of the FCN are compared in the paper. This thesis uses the VGG16 version of FCN because of its first-rate performance and that the pre-trained model is easily available.

3.1.1 Architecture

The FCN uses a technique called skip connections[16]. As the information travels deeper through the network it gets increasingly coarse and loses the finer details in the process. The skip connections add connections from the earlier layers in the network that are fused together with the coarser information at the end. This way the network can use the deep layer information and still retain details from the earlier layers. Where this is applied can be seen in figure 3.1 which shows the FCN-8s-VGG16 architecture.

(36)

24 3 Method

Figure 3.1:The FCN-8s-VGG16 architecture used in the thesis.

Each convolutional layer, in the figure above, is actually two or three convolu-tional layers (see 5.1) followed by a pooling layer. The fully connected layers are applied as a 1 × 1 × Dpconvolution kernel where Dpis the depth of the preceding

feature map. The scoring layers work in the same way as the fully connected and outputs a feature map with the depth of the number of classes.

3.1.2 Loss function

Cross entropy loss (introduced in section 2.3) was chosen as loss function. This was the only appropriate loss function offered by Tensorflow and even though the SVM can outperform the cross entropy loss[23], the difference in performance should be small.

3.1.3 Regularization

Dropout (2.4) was used for regularization. For most experiments, the keeping probability was set to 0.5.

Since a pre-trained net trained without batch normalization was used, this reg-ularization was not an option. No batch normalization means no normalization between the layers in training and the network will therefore have completely different size on the weights compared to a network trained with batch normal-ization. Fine-tuning with batch normalization would make the pre-training obso-lete.

3.1.4 Optimization

The Adam (described in 2.5.1) optimizer was chosen as optimizer. It has four hyperparameters but the default parameters are usually recommended. However, choosing a good initial learning rate can speed up convergence. Therefore, the learning rate was one of the hyperparameters chosen for optimization.

(37)

3.1 Fully Convolutional Network 25

3.1.5 Initialization

The convolutional layers are initialized with weights from a pre-trained VGG16 model. The fully connected layers are initialized with a zero mean and a standard deviation of 0.02. The transpose convolution layers are initialized as bilinear upsampling but are then trainable making them able to learn more complex non-linear upsamplings.

3.1.6 Field-of-view

The theoretical field-of-view (FOV) of FCN is 404 × 404. This can be shown by following an output pixel backwards through the network.

Figure 3.2 shows the how the operations of the first convolutional layer affects the FOV.

Figure 3.2:Above is a visualization of field-of-views of different operations of the network, more specifically the operations in the first convolutional layer.

This 2 × 2 feature map has a FOV of 8 × 8 and the single red pixel has a FOV 6 × 6. An n × n feature map has a (2n + 4) × (2n + 4) FOV from this layer. The first two convolutional layers of FCN has this structure. The third through fifth layer have three convolutions followed by a pooling, making their contribution to the FOV 2n + 6 each. The last layer that downsamples the spatial dimensions of the feature maps is the first fully connected layer. This layer has a spatial filter kernel size of 7 × 7. If one starts from a pixel in this layer and walks backwards through the network one gets the spatial dimensions as

(38)

26 3 Method

Table 3.1: Field-of-view contribution as one walks backwards through the different layers

Layer FC1 Conv5 Conv4 Conv3 Conv2 Conv1

Contribution f0 2 ∗ f0+ 6 2 ∗ f1+ 6 2 ∗ f2+ 6 2 ∗ f3+ 4 2 ∗ f4+ 4

Total FOV 7 × 7 20 × 20 46 × 46 98 × 98 200 × 200 404 × 404

Thus, the FCN has amaximum FOV of 404 × 404. It is likely that the network

mostly uses the pixels closer to the pixel that is to be predicted but it is important to know that the network can never use information outside that box for predic-tion. In figure 3.3 the maximum input information to classify a water pixel is shown in cyan.

(a)The maximum FOV the network can predict from (b)FOV for one pixel classification

Figure 3.3: The input information the network can use to classify a water pixel in Riddarholmen

3.2 Training

The training of CNNs is generally done with supervised learning which means that the network needs examples of input images and their corresponding cor-rectly classified label maps. These image and label map pairs are what is called thetraining data.

The part where data is traveling forward in the network is called theforward pass and the part where the gradients are traveling backwards through the net is

called thebackward pass. Doing a forward pass and outputting the predictions is

called doinginference. Doing a forward pass followed by a backward pass is the

(39)

3.2 Training 27

3.2.1 Training Data

The network was fine-tuned (trained) with images from Spotscale. This data con-sisted of orthographic, oblique and ground images from outdoor areas of urban, suburban and rural places as well as images of nature. The images were manually labelled into the classes sky, water and other. One of the limitations of this thesis is to see if decent results can be attained with limited amount of data. Therefore only about 500 images were labelled.

The image size of the original Spotscale images were over 20 Mpixels which is far too big for even SGD (2.5.1) for a large network like FCN. Therefore the im-ages had to be down sampled to a smaller size. Larger imim-ages have more detailed features which should lead to a more coarse output but smaller images have a higher FOV to image size ratio which certainly could be beneficial. Furthermore, if mini-batches were to be used, the images have to be significantly smaller than what the graphics card allows. Different sizes of images were tried and images of about 1.5 Mpixels seemed to allow small batches, keep details in the images while not being too large.

(a)Ground image (b)Orthographic image (c)Oblique image

Figure 3.4:Examples of training images

(a) Labelled ground im-age

(b) Labelled ortho-graphic image

(c) Labelled oblique im-age

Figure 3.5:Three examples of labelled images. The labels are superimposed on the corresponding images for visualization purposes. The label maps used for training are 2D-matrices of image spatial dimensions with values 0, 1 and 2 for the classes other, sky and water respectively.

(40)

28 3 Method

3.2.2 Data Augmentation

Data Augmentation[9] is a way to increase the training data. An image and a copy of itself flipped vertically is for the network two entirely different images and will activate completely different parts of the network in a forward pass. This fact can be used in the training of the network to increase the training data manifold. The dataset of this thesis was augmented with horizontal and vertical flip, 180 degree rotation and different resizing. This way, the number of data samples was increased by about a factor of 16. In addition to these methods, an online random cropping was applied with a certain probability when reading the image-label pairs. The crop has the same spatial dimension image ration as the input image and has the size ranging from 9% (30% on each side) to 100% of the input image size.

(a)Original image (b)Horizontal flip

(c)Vertical flip (d)180 degree rotation

Figure 3.6:Data augmentation

3.2.3 Active Learning

Active learning is a way to add high quality training data. The basic idea is to do inference on a large batch of data and then label the images the network has trouble segmenting. This way, the images that will not have a big effect on the training can be ignored.

(41)

3.3 Evaluation 29

3.2.4 Class Balance

The dataset contained about 77% pixels of the class other, 7% pixels of sky and 16% pixels of water. This is a quite large imbalance. However, class balancing does not necessarily have to be a problem. According to the FCN paper[16], their class imbalance of 75% background did not bother them. Even so, to attempt to combat this class imbalance, the dataset was augmented with arbitrary images water from images not from Spotscale. This augmentation changed the class im-balances to 73% pixels of the class other, 7% pixels of sky and 20% pixels of water.

3.2.5 Epochs

The number of epochs the network is trained over can be optimized by over train-ing the network and then, by studytrain-ing the validation loss and other measures, decide an epoch for early stopping. The network saved a model for lowest loss, highest accuracy, highest water recall and from the last epoch. Since a model is several gigabytes, a model for each epoch could not be saved. If no model saved was sufficient, the network was retrained for the desired number of epochs.

3.2.6 Validation

To track the training, the network does a hold out validation on a separate data set after each epoch. The network calculates the combined loss, accuracy (see 3.3.1) and class recalls and precisions (see 3.3.1 and 3.3.1) and stores the information. The network also saves the model with the highest accuracy and class recall for a specified class along with the model of the last epoch. This reduces the need to stop the training at the right time.

3.3 Evaluation

To properly evaluate the results, a quantitative evaluation of the semantic seg-mentation was done along with qualitative evaluations of the segseg-mentation and the photogrammetry improvements.

3.3.1 Evaluation metrics

It is important to have different evaluation measurements since no measurement covers every aspect. The network outputs a prediction of which class it thinks the pixel belongs to. If the network has predicted the class correctly it is a True Positive. If it predicts it when it should not, it is a False Positive and if it predicts a different class when the current is the correct it is a False Negative. The last case is when it correctly predicts that the pixel belongs to a different class, a True Negative.

(42)

30 3 Method

(a)Confusion matrix (b)Precision measurement

(c)Recall measurement (d)Accuracy measurement

Figure 3.7: The images show quantitative evaluation metrics used in this thesis. The metrics are calculated as the sum of the green boxes divided by the red boxes.

Recall

As seen in figure 3.7c, the recall measurement is the sum of all true positives divided by itself and the sum of false negatives. An advantage of this metric is that it is not affected by potential class imbalances. However, the measurement can be misleading if only one class is studied. For instance the network could predict everything as the class water and thus get perfect recall for water. If all class recalls are studied together it provides a metric on how well the network is detecting predicting class.

Precision

Precision, seen in figure 3.7b, is the sum of true positives divided by itself and the sum of false positives. This deals with the problem where the recall metric would be misleading. However, the precision can yield unjustly high scores if it only predicts the class where it is absolutely sure it is correct in its prediction. As with recall, this can be counteracted if all the class precisions are studied.

(43)

3.3 Evaluation 31

Accuracy

Accuracy is a measure of the number of times the classifier is correct in its predic-tion, divided by itself and the number of times it is incorrect. For datasets with high class imbalance, this metric can give unjust results. For instance if a class the system detects well is over-represented in the test set it will give a high accu-racy even though it completely fails on a under-represented class. However it can still be a useful tool to quickly see how the training is going beside the loss from the loss function. In tasks with many classes the class-specific metrics mentioned above can be hard to quickly get a grasp of.

3.3.2 Improving Photogrammetry

There are many ways the photogrammetry can be improved by knowing the se-mantics of the pixels(some described in 1.1). One that is easy to illustrate, and will be shown in this thesis, is the improvement of the Multi-View-Stereo algo-rithm (MVS explained in 2.8). The MVS used in this thesis uses depth maps to obtain a dense point cloud. This can in some cases cause artefacts, especially from the sky in the background. The masking can remove the depth information from these areas.

3.3.3 Qualitative evaluation

A qualitative evaluation can give information on what kind of areas that cause problems for the network. For instance a rare type of ground could have a very unique feature representation that confuses the network. With a qualitative eval-uation this can easily be identified. It can also show whether the error the net-work makes are reasonable or if the netnet-work should be able to manage it.

3.3.4 Quantitative evaluation

The network was quantitatively evaluated by performing various experiments us-ing different hyperparameter settus-ings and augmentation techniques. The learn-ing rate, batch size and number of epochs were assumed to be important and interesting hyperparameters to test. The importance of data augmentation was tested by using no augmentation, only upright augmentation (no horizontal flips or 180 degree rotations) and full augmentation. On top of this, different kinds of random cropping were tested. The experiment setups are detailed in Table 4.1.

3.3.5 Evaluation of Photogrammetry improvement

It is difficult to quantitatively evaluate a photogrammetry improvement, espe-cially without a ground truth data set. Therefore, only a qualitative evaluation was made. To show a qualitative improvement of the photogrammetry, there has to be artefacts in the first place. Two datasets where the MVS had problems with the sky were found and were redone using masks, removing areas with sky (and water).

(44)

32 3 Method

3.3.6 Experiment setups

Training the network with different parameters over a full grid search would take too much time with the limited computational power and time of this thesis work. Instead, different experiments that seemed interesting were tried. The experiments differed in learning rate, batch size, augmentation techniques used, stopping criteria and number of epochs. The experiments S1 through S7 all use a stochastic gradient descent. MB1 and MB2 use a mini-batch of 4, which is as high as was possible with the VRAM available. AU1 uses only augmentation that preserves the upward direction of the images and NA1 uses no augmentation at all. The experiment setups are described more thoroughly in table 4.1.

(45)

4

Results

This chapter presents the results of this thesis.

4.1 Semantic Segmentation

The semantic segmentation was evaluated on two test sets, Riddarholmen and Norrköping. These data sets were completely unseen by the network prior to inference, making them great for evaluating how the network works on new data. The Riddarholmen data set contains difficult representations of water with many reflections of sky and buildings.

(a)Example image from Riddarholmen (b)Ground truth

Figure 4.1:The figures show an example image ground-truth pair from Rid-darholmen. The ground truth is used for test-set evaluation and not for train-ing or validation.

(46)

34 4 Results

The Norrköping data set also contains unusual representations of water as for instance reflections and parts with rapid water.

(a)Example image from Norrköping (b)Ground truth

Figure 4.2:Example image-ground truth pair from Norrköping

First, the network was trained over many epochs to get a view of the training loss convergence, possible validation loss divergence and validation data perfor-mance. Those metrics are plotted from two experiments in the following images. The first experiment uses SGD(2.5.1) and the second uses a mini-batch of 4. In both experiments, the validation loss starts diverging after around 5 epochs. The class-specific metrics onother and sky seem to converge quicker than water.

0 20 40 60 80 100 epoch 0.05 0.10 0.15 0.20 0.25 0.30 lo ss Train loss

(a)Training loss

0 20 40 60 80 100 epoch 0.1 0.2 0.3 0.4 0.5 0.6 lo ss Validation loss (b)Validation loss

Figure 4.3: The images show the loss on training and validation data over 100 epochs. Note that the validation loss is calculated after each epoch. This means that the first data point in the graph is after the network is already trained on over 9000 samples. The training loss is calculated on every tenth training sample and the graph shows the mean over the whole epoch.

(47)

4.1 Semantic Segmentation 35 0 20 40 60 80 100 epoch 0.970 0.975 0.980 0.985 ac cu ra cy Accuracy (a)Accuracy 0 20 40 60 80 100 epoch 0.7 0.8 0.9 1.0 re ca ll Recall other sky water (b)Class recall 0 20 40 60 80 100 epoch 0.75 0.80 0.85 0.90 0.95 pr ec isi on Precision other sky water (c)Class precision

Figure 4.4:The images show the network’s performance on validation data over 100 epochs. 0 10 20 30 40 epoch 0.02 0.04 0.06 0.08 lo ss Train loss

(a)Traning loss

0 10 20 30 40 epoch 0.1 0.2 0.3 0.4 0.5 lo ss Validation loss (b)Validation loss

Figure 4.5:Loss on training and validation data over 40 epochs trained with a mini-batch of 4 0 10 20 30 40 epoch 0.978 0.980 0.982 0.984 ac cu ra cy Accuracy (a)Accuracy 0 10 20 30 40 epoch 0.75 0.80 0.85 0.90 0.95 1.00 re ca ll Recall other sky water (b)Class recall 0 10 20 30 40 epoch 0.825 0.850 0.875 0.900 0.925 0.950 0.975 pr ec isi on Precision other sky water (c)Class precision

Figure 4.6: The network’s performance on validation data over 40 epochs trained with mini-batches

The training and validation loss show that the network is overfitting to the training data after about 3 to 5 epochs. It is important to note that, even though

(48)

36 4 Results

the loss on the validation data diverges, it is not a very extreme divergence. In the examples above the validation data points to an early stopping at around 5 epochs while the other metrics suggest bit more training. This was the theme for most of the training experiments conducted in this thesis. In the experiments below, both early stopping before validation data divergence and at high recall and precision is tried.

4.1.1 Quantitative results

The experiments were tested on the test sets Riddarholmen and Norrköping.

Training setups

Table 4.1 shows the training setups used for the experiments. The hyperparam-eters tested are batch size, learning rate, keep probability for the dropout and different data augmentation techniques. The experiments are also described in section 3.3.6.

Table 4.1:Training setups

Setup Batch size Learning rate Augmentation Keep probability Number of training samples

S1 1 1e-6 Full w/ 50% random crops 0.5 9314

MB1 4 1e-5 w/o random crops 0.5 9314

NA1 1 1e-5 None 0.5 584

AU1 1 1e-5 Upright w/ 50% random crops 0.5 2879

MB2 4 1e-6 w/o random crops 0.5 9314

In the tables below, the evaluation of different metrics on the test sets are shown. The table above describes the experiment setups. The green grids are the top performance for that specific metric on that dataset.

(49)

4.1 Semantic Segmentation 37 Table 4.2:Evaluation of different setups on the dataset Riddarholmen

Setup Epochs Accuracy Recall other Recall sky Recall water Precision other Precision sky Precision water

S1 33 0.984 0.990 0.964 0.848 0.985 0.957 0.902 S2 7 0.979 0.983 0.985 0.785 0.987 0.919 0.856 MB1 16 0.976 0.982 0.965 0.783 0.979 0.952 0.822 NA1 16 0.973 0.976 0.915 0.861 0.978 0.944 0.802 AU1 23 0.984 0.992 0.983 0.803 0.983 0.958 0.922 S3 31 0.984 0.993 0.965 0.823 0.985 0.938 0.950 S4 7 0.966 0.976 0.969 0.646 0.967 0.945 0.754 MB2 12 0.970 0.978 0.962 0.716 0.972 0.953 0.776 S5 1 0.980 0.988 0.975 0.793 0.986 0.925 0.890 S5 6 0.980 0.990 0.954 0.811 0.983 0.928 0.920 S6 2 0.982 0.987 0.986 0.812 0.989 0.925 0.893 S6 21 0.981 0.988 0.978 0.798 0.984 0.944 0.889 S7 5 0.983 0.989 0.968 0.832 0.984 0.959 0.890 S7 8 0.981 0.987 0.966 0.840 0.984 0.955 0.881 S8 43 0.980 0.992 0.979 0.737 0.983 0.912 0.940

Table 4.3:Evaluation of different setups on the dataset Norrköping

Setup Epochs Accuracy Recall other Recall sky Recall water Precision other Precision sky Precision water

S1 33 0.986 0.995 0.994 0.635 0.981 0.986 0.899 S2 7 0.987 0.991 0.996 0.729 0.986 0.976 0.852 MB1 16 0.986 0.992 0.992 0.697 0.984 0.982 0.840 NA1 16 0.985 0.988 0.977 0.774 0.986 0.980 0.801 AU1 23 0.990 0.996 0.996 0.734 0.986 0.983 0.956 S3 31 0.986 0.996 0.993 0.608 0.980 0.979 0.927 S4 7 0.980 0.993 0.992 0.438 0.971 0.981 0.834 MB2 12 0.979 0.994 0.991 0.408 0.970 0.986 0.834 S5 1 0.985 0.996 0.993 0.583 0.979 0.981 0.932 S5 6 0.985 0.997 0.989 0.552 0.977 0.981 0.975 S6 2 0.987 0.996 0.993 0.651 0.982 0.978 0.962 S6 21 0.987 0.996 0.996 0.636 0.982 0.984 0.943 S7 5 0.988 0.995 0.988 0.714 0.985 0.986 0.904 S7 8 0.987 0.996 0.984 0.672 0.983 0.982 0.912 S8 43 0.984 0.996 0.988 0.545 0.977 0.977 0.942

4.1.2 Qualitative results

This section presents the qualitative results of the semantic segmentation. The first four images are examples of where the network performs an almost perfect segmentation. The later examples show typical errors the network makes. Water with much reflections unsurprisingly seems to be difficult for the network and causes low water recall. Images of flat areas, like figure 4.9a, 4.9b and 4.10a, with low illumination seem to sometimes cause false positives of water. Areas with overexposure seem to sometimes be classified as sky, like in figure 4.10b.

(50)

38 4 Results

(a)Almost perfect segmentation on an im-age of Riddarholmen

(b)Good segmentation on an image of Rid-darholmen

(c)Good segmentation on an image of Nor-rköping

(d)Good segmentation on an image of Nor-rköping

Figure 4.7:Examples of images where the network performs well

(a)Riddarholmen image with reflective wa-ter and a lens flare

(b)Riddarholmen image with low recall on water and low precision on sky

Figure 4.8: Examples with low recall on water. The water in the images reflect the sky and skyline.

(51)

4.1 Semantic Segmentation 39

(a)Example from Riddarholmen with low water precision

(b)Example from Norrköping with low wa-ter precision

Figure 4.9:Examples where the network predicts water on other

(a) Example with high water recall but lower water precision

(b) Example with good over all accuracy but imperfect sky precision

Figure 4.10:Example predictions with good overall accuracy but problems with precision

There are versions (training experiments) of the network that handle these error-prone areas better. Below are the setups that perform best on these areas.

(52)

40 4 Results

(a) NA1performs well on this difficult Rid-darholmen water representation

(b) Better performance from NA1 on this Riddarholmen image

(c) S3 has the best performance on this troublesome cobblestone representation

(d) S6with high performance on otherwise error-prone image

(e) S5 solves the problems with shadows classified as water

(f) S7manages to classify the high reflec-tion on the roof from the sun

Figure 4.11:Examples of training setups where the network performs better on error-prone areas

(53)

4.2 Qualitative Photogrammetry improvements 41

4.2 Qualitative Photogrammetry improvements

The following images show the results of adding masks to a data set that causes problems in the MVS. Most of the artefacts disappear in both examples, while keeping the quality of the central model.

(a)Resulting model without using masks (b)Resulting model using masks

Figure 4.12:Example results on a chimney

(54)

42 4 Results

(55)

5

Discussion

This chapter discusses the results, methodology, conclusions and future work of this thesis.

5.1 Results

The results of the thesis show that the limited training data is enough to get a good generalization for segmenting sky. Water, however, seems to be a more difficult task which might not be very surprising since it has a more diverse set of representations. The performance on water could probably be improved by adding more training data containing water.

The overall accuracy is very high but this might be more of a sign of a bad evaluation metric (for this task) than good performance by the network. The biggest problem seems to be to achieve high recall and precision on water and especially at the same time. This is also consistent with the qualitative results where most of the miss-classifications involve water.

The experiments with reduced augmentation seem to perform well. This might be because that the augmentations that flip the images upside down re-move specific spatial information of the classes. For instance sky is much more likely to be at the top of an image and water is more likely to be somewhere in the middle to bottom. An example of this is the image in figure 4.11a where the net-work trained without augmentation is the only netnet-work that manages to classify the big patch of water correctly.

For masking areas for photogrammetry, the most important metrics to opti-mize are probably precision for sky and water since removing wanted informa-tion is worse than not removing unwanted informainforma-tion. However, if the overlap between images is great, removing foreground information is less of a problem since it will exist in other images. The metric that seems most difficult to achieve

Improving Photogrammetry using Semantic Segmentation

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2018