Semantic Segmentation of Building Materials in Real World Images Using 3D Information

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2021

Semantic Segmentation of

Building Materials in Real

World Images Using 3D

Information

(2)

Master of Science Thesis in Electrical Engineering

Semantic Segmentation of Building Materials in Real World Images Using 3D Information

Marcus Bejgrowicz and Jonas Rydgård LiTH-ISY-EX–21/5405–SE

Supervisor: Pavlo Melnyk

isy_{, Linköpings universitet}

Mikael Hägerström and Björn Kernell Spotscale

Examiner: Per-Erik Forssén

isy_{, Linköpings universitet}

Computer Vision Laboratory Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden

(3)

(4)

(5)

Sammanfattning

Den ökade populariteten av drönare har gjort det smidigt att ta ett stort antal bilder av en fastighet, och sedan skapa en 3D-modell. Skicket hos en byggnad kan enkelt analyseras och renoveringar planeras. Det är då av intresse att auto-matiskt kunna identifiera byggnadsmaterial, en uppgift som lämpar sig väl för maskininlärning.

Med tillgång till såväl drönarbilder av byggnader som djupkartor och normal-kartor har vi skapat ett dataset för semantisk segmentering. Två olika faltande neuronnät har tränats och utvärderats för att se hur väl de fungerar för mate-rialigenkänning. DeepLabv3+ som använder sig av RGB-data har jämförts med Depth-Aware CNN som använder RGB-D-data och våra experiment visar att Dee-pLabv3+ får högre mean intersection over union.

För att undersöka om resultaten kan förbättras med hjälp av datat i djupkartor och normalkartor har vi kodat samman informationen till vad vi valt att benämna HMN — horisontell disparitet, magnitud av normalen parallell med marken, nor-mal i gravitationsriktningen. Denna trekanalsinput kan användas för att träna ett extra CNN samtidigt som man tränar med RGB-bilder, och sedan summera båda predikteringarna. Våra experiment visar att detta leder till bättre segmenteringar för både DeepLabv3+ och Depth-Aware CNN.

(6)

(7)

Abstract

The increasing popularity of drones has made it convenient to capture a large number of images of a property, which can then be used to build a 3D model. The conditions of buildings can be analyzed to plan renovations. This creates an interest for automatically identifying building materials, a task well suited for machine learning.

With access to drone imagery of buildings as well as depth maps and nor-mal maps, we created a dataset for semantic segmentation. Two different convo-lutional neural networks were trained and evaluated, to see how well they per-form material segmentation. DeepLabv3+, which uses RGB data, was compared to Depth-Aware CNN, which uses RGB-D data. Our experiments showed that DeepLabv3+ achieved higher mean intersection over union.

To investigate if the information in the depth maps and normal maps could give a performance boost, we conducted experiments with an encoding we call HMN — horizontal disparity, magnitude of normal with ground, normal parallel with gravity. This three channel encoding was used to jointly train two CNNs, one with RGB and one with HMN, and then sum their predictions. This led to improved results for both DeepLabv3+ and Depth-Aware CNN.

(8)

(9)

Acknowledgments

We would like to thank our supervisors Mikael Hägerström and Björn Kernell at Spotscale for their consistent support and encouragement throughout this thesis. They have allowed us to be creative and pursue our ideas while guiding us with their knowledge.

We would also like to express our thanks to Pavlo Melnyk for his deep interest in our thesis. He enthusiastically helped us with both technical inquires and with writing the report.

Lastly, we would like to thank our examiner Per-Erik Forssén for his help and insights.

Thank you all for your great support.

Linköping, May 2021 Marcus Bejgrowicz and Jonas Rydgård

(10)

(11)

Notation

Abbreviations

Abbreviations Meaning

Acc Accuracy

Acc Class Class accuracy

ASPP Atrous spatial pyramid pooling

CNN Convolutional neural network

FWIoU Frequency weighted intersection over union

HHA Horizontal disparity, height above ground, and angle

with gravity

HMA Horizontal disparity, magnitude of normal parallel to

the ground, and angle with ground

HMN Horizontal disparity, magnitude of normal parallel to

the ground, normal component parallel with gravity

IoU Intersection over union

MIoU Mean intersection over union

RGB A color image composed of different intensities of the

colors red, green and blue

RGB-D An image with both color and depth information

(14)

(15)

1

Introduction

Semantic segmentation is the process of classifying every pixel in an image and assigning it to a particular label. This is useful in several different applications such as medicine, surveillance, and autonomous driving. This thesis investigates semantic segmentation for finding different materials in images of buildings. The state-of-the-art way to accomplish semantic segmentation is by using convolu-tional neural networks. CNNs have gained popularity ever since AlexNet [19] won the ImageNet challenge in 2012. Today there exist many different CNNs, and most of these are made to work with three channels as input for RGB images. It is not as common to incorporate geometric data in the network to improve its predictions, but geometric data shows great potential to differentiate between materials with help of their texture. In addition, walls and roofs can be discrimi-nated by the orientation of the surfaces relative to the ground and sky.

1.1 Problem Background

The company Spotscale1_{produces high-resolution 3D models of real-world}

build-ings from images captured by drones, making it possible to inspect, analyze and keep notes on the models easily. Spotscale already labels facades, roofs and win-dows on buildings with machine learning. The predictions are made on multiple images, combined and transferred to their 3D models. This pipeline is well suited for predicting building materials.

Identifying what material the walls and roofs consist of would make it easier to get an overview of how much of certain materials would be needed for a reno-vation by calculating how large the areas are. It makes measuring easy and safe and can be scaled quickly to quantify entire neighborhoods. Another application

1_{spotscale.com}

(16)

2 1 Introduction

is aiding temperature measurements using thermal cameras. These cameras mea-sure infrared radiation, which can be converted to temperature if the emissivity of the material captured is known [2]. This can be used for identifying energy leaks, saving both heating costs and the environment.

1.2 Goal

The goal of this thesis is to identify materials in images with convolutional neu-ral networks. To do this we will gather images of common building materials from Spotscale’s database and annotate them so they can be used for seman-tic segmentation. Depth maps and surface normal maps for the images will be extracted from Spotscale’s 3D models. Focusing on the DeepLab architec-ture we will train and evaluate two different networks. The most recent release DeepLabv3+, which uses RGB data, will be compared to Depth-Aware CNN, which is based on DeepLabv1 and uses RGB-D data. This way we can compare if it is more important to utilize the latest additions to the RGB architectures or to use depth data. A common way of exploiting 3D information with architectures that are not necessarily designed to use RGB-D input, is to encode it into images and feed it through the networks in the same way as RGB data. We will investi-gate suitable encodings for our data and run experiments to see if they improve material identification.

1.3 Problem Statement

The questions to be answered in this thesis are:

• What mean intersection over union does a state-of-the-art RGB semantic segmentation network achieve for building materials on a dataset consist-ing of buildconsist-ings?

• Does a state-of-the-art semantic segmentation architecture using RGB-D data outperform an RGB architecture for building materials on a dataset consisting of buildings?

• Can the semantic segmentations be improved with three-channel encodings of depth maps and surface normal maps?

1.4 Author Contributions

In this thesis, the two authors have had different areas of responsibility. Jonas was responsible for training DeepLabv3+ and evaluating the models with the chosen metrics and creating confusion matrices. Marcus was responsible for training Depth-Aware CNN and creating the 3D encodings. Table 1.1 gives a detailed description of which author has contributed to each section of the report.

(17)

1.4 Author Contributions 3

Table 1.1:The main contributions of each author.

Area Sections Author

Introduction 1 Both

Theory

Object Recognition Tasks 2.1 Both

A CNN Layer and Architecture 2.2.1–2.2.2 Marcus Bejgrowicz

Depth-Aware Operations 2.2.3 Marcus Bejgrowicz

Semantic Segmentation 2.2.4 Jonas Rydgård

Training a CNN 2.3 Marcus Bejgrowicz

Evaluation Metrics 2.4 Jonas Rydgård

Depth and Surface Normals 2.5 Marcus Bejgrowicz

Related Work

Image Classification 3.1 Both

Segmentation with DeepLab 3.2.1 Jonas Rydgård

RGB-D Segmentation 3.2.2 Marcus Bejgrowicz

Material Identification 3.3 Jonas Rydgård

Methodology & Experiments

Annotating Data & Split 4.1, 5.3 Both

Training Methodology 4.2.1, 4.3, 4.4 Both

Training Deeplabv3+ 4.2.2, 5.1, 5.4 Jonas Rydgård

Training Depth-Aware CNN 4.2.3, 5.2, 5.4 Marcus Bejgrowicz

Preprocessing Depth & Normal Maps 4.1.2 Marcus Bejgrowicz

Creating HMA and HMN Encodings 4.5, 5.4 Marcus Bejgrowicz

Results from DeepLabv3+ 5.5, 5.6 Jonas Rydgård

Results from Depth-Aware CNN 5.5, 5.6 Marcus Bejgrowicz

Discussion

Dataset 6.1 Marcus Bejgrowicz

Hyperparameter Optimization 6.2 Jonas Rydgård

Material Predictions 6.3 Jonas Rydgård

3D Encodings 6.3.4, 6.4 Marcus Bejgrowicz

(18)

(19)

2

Theory

Deep learning [13] is a well-suited approach to identifying materials in images. This chapter gives an introduction to convolutional neural networks and how to use them to identify materials.

2.1 Object Recognition Tasks

Figure 2.1 illustrates how the tasks of recognizing objects in images progress from coarse to fine: classification, detection, semantic segmentation, and instance seg-mentation. The goal of image classification is to assign the correct label to a whole image, based on the object that is in the image. Figure 2.1a shows an example

where the assigned label iszebra. Object detection gives additional information

about the spatial location of the objects, for example, in the form of bounding boxes. In semantic segmentation, each pixel is assigned a label. The pixels in

the purple area in Figure 2.1c are of the classzebra, however, it is not possible

to tell the two zebras apart. Instance segmentation also differentiates multiple instances of the same class, hence the zebras in Figure 2.1d are marked with sep-arate colors. Convolutional neural networks that are used for image classification can be used as building blocks for segmentation architectures [11].

2.2 Convolutional Neural Networks

Convolutional neural networks are a special kind of neural network that have been successful in tasks such as image classification and segmentation [11].

(20)

6 2 Theory

(a)Classification (b)Object detection

(c)Semantic segmentation (d)Instance segmentation

Figure 2.1: Object recognition from coarse-grained to fine-grained. Image

from pixabay.com, used for illustration purposes only.

2.2.1 A Convolutional Neural Network Layer

Goodfellow et al. [13] describe how a convolutional neural network is usually cre-ated by stacking a number of layers, each containing the operations: convolution, non-linear activation function and pooling.

Following Wang and Neumann [29] and standard machine learning notation, 2D convolution is defined as

y(p0) =

X

pn∈R(p0)

w(pn) · x(p0+ pn) , (2.1)

where R is the local grid around the pixel location p0in x and w is the

convolu-tion kernel. Most popular CNNs use the ReLU (rectified linear unit) activaconvolu-tion function

y(p0) = max(0, x(p0)) . (2.2)

Two frequently used pooling functions are average pooling y(p0) = 1 |R_(p₀_)| X pn∈R(p0) x(p0+ pn) , (2.3)

(21)

2.2 Convolutional Neural Networks 7 0 0 0 -1 -2 -1 1 2 1 ×0 ×0 ×0 ×-1 ×-2 ×-1 ×1 ×2 ×1 5 -2 -8 -8 -15 15 11 11 11 -6 13 7 -5 12 6 1 4 -5 -5 -5 -4 -4 -4 2 2 2 -3 -3 3 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Convolution 5 0 0 0 0 15 11 11 11 0 13 7 0 15 13 7 0 12 6 1 Max pooling ReLU Weights (3×3×1) Input (4×4×1) Output (2×2×1)

Figure 2.2:Convolution with zero padding, ReLU and max pooling applied

to a 4 × 4 × 1 input image.

and max pooling

y(p0) = max

pn∈R(p0)

x(p0+ pn) . (2.4)

Figure 2.2 shows how convolution, ReLU and max pooling are applied to an input matrix. The convolution uses a local grid R of size 3×3×1 and zero padding so the dimensions before and after the convolution are the same. The ReLU step sets all negative values to zero, and the max pooling step reduces the size of the matrix by finding the maximum value in each 2 × 2 region.

2.2.2 Layers in an Image Classification Architecture

VGG-11 [28] is an example of a convolutional neural network used for image clas-sification. As seen in Figure 2.3, it has eight convolutional layers of the kind de-scribed in subsection 2.2.1, but with some layers omitting the pooling operation. It uses RGB images of size 224 × 224 px as input and has an output vector size of 1000 classes. The output shows how likely it is that the input image belongs to each of the 1000 classes in the ImageNet dataset [25]. The number of parameters (weights in convolution kernels and fully connected layers) is 133 million.

(22)

8 2 Theory 224×224×64 224×224×3 112×112×128 56×56×256 28×28×512 _14×14×512 1×1×4096 1×1×1000 7×7×512 convolution + ReLU max pooling

fully connected + ReLU softmax

Figure 2.3:VGG-11, a CNN used for image classification.

2.2.3 Depth-Aware Operations

Wang and Neumann [29] introduce depth-aware operations as an extension to convolution and average pooling that are normally found in convolutional neural networks. The depth-aware operations are based on the intuition that pixels with similar depth are more likely to belong to the same object. They make use of the depth similarity function

FD(pi, pj) = exp(−α|D(pi) − D(pj)|) , (2.5)

where D (pi) is the depth value of the pixel piand α is a constant. FDdecreases as

the depth difference between the pixels pi and pjincreases. Figure 2.4 shows the

depth similarity between points in a 3x3 grid. There is a high depth similarity between points on the roof and a low depth similarity between points on the roof and on the grass.

Depth-aware convolution is defined as y(p0) =

X

pn∈R(p0)

w(pn) · FD(p0, p0+ pn) · x (p0+ pn) , (2.6)

and depth-aware average pooling is defined as

y(p0) = P 1 pn∈R(p0)FD(p0, p0+ pn) X pn∈R(p0) FD(p0, p0+ pn) · x (p0+ pn) . (2.7)

2.2.4 Semantic Segmentation

The DeepLab architectures [3–6] have introduced atrous convolution and atrous spatial pyramid pooling. Another upgrade they added to the DeepLab network is an encoder-decoder.

(23)

2.2 Convolutional Neural Networks 9

(a)Image (b)Depth map

Figure 2.4:Depth similarity FDbetween the blue point and the pink points,

indicated by the dot size. The point to the right of the center point has depth similarity 0.9 and the point to the left of the center point has depth similarity 0.003.

rate = 1 rate = 2 rate = 3

Figure 2.5: Illustration of atrous convolution with three different rates.

Rate=1 corresponds to standard convolution.

Atrous Convolution

The atrous algorithm used in DeepLab makes it possible to arbitrarily control the field of view [3]. This is possible as atrous convolutions can increase the field of view while keeping the same resolution. The convolution kernels cover larger areas by skipping a number of pixels between the ones used in the convolution. The rate tells how many pixels are skipped. Figure 2.5 shows a 3 × 3 kernel with rates 1 (same as regular convolution), 6 and 24.

Atrous Spatial Pyramid Pooling

The idea of atrous spatial pyramid pooling is to find objects and image context of different sizes. ASPP does this by using multiple parallel atrous convolutions with different sampling rates and effective field of views. The extracted features are then fused. An illustration of ASPP is seen in Figure 2.6.

(24)

10 2 Theory

rate = 6 rate = 12 rate = 18

rate = 24

Input feature map

Figure 2.6: Illustration of Atrous Spatial Pyramid Pooling. To classify the

center pixel, ASPP uses multiple atrous convolutions. The effective fields-of-views are seen in different colors.

Encoder-decoder

The encoder-decoder improves segmentation results in general and along the boundaries of the segmented objects by obtaining sharper boundaries. The en-coder involves pooling and downsampling. The deen-coder has layers that gradually recover spatial information to get a dense classification.

2.3 Training a Convolutional Neural Network

As seen in subsection 2.2.2, a convolutional neural network can have over 100 million parameters. Using images with known desired output, it is possible to fit the model parameters to the data. This is known as training the network.

2.3.1 Loss Function and Back-Propagation

In image segmentation, the networks make predictions for each pixel, how likely it is that it belongs to each of the Nclsclasses in the dataset. One type of classifier

is the softmax classifier [13], which uses the softmax function

fj(z) =

ezj

PNcls

k=1ezk

, (2.8)

on the outputs z from the network. The score fj can be interpreted as the

prob-ability that a pixel belongs to class j, since the scores for each pixel sum to one and each value lies between zero and one. To mathematically describe how well the predictions correspond with the ground truth, a loss function is used. The softmax classifier can be used together with the cross-entropy loss [23], defined

(25)

2.3 Training a Convolutional Neural Network 11 as Li = − log         ezyi PNcls j=1e zj         = −zyi+ log         Ncls X j=1 ezj         , (2.9)

where yi denotes the correct label for pixel i. The goal when training a neural

network is updating the parameters so that the outputs z yield as low loss as possible.

Minimizing the loss is an iterative process that alternates two steps: first com-puting the gradient using back-propagation, and then updating the weights of the network using an optimization algorithm [13]. Back-propagation is an algo-rithm that can be used for calculating the gradient of the loss with respect to the weights of the network for a pair of inputs and outputs. The gradients are cal-culated using a recursive application of the chain rule, starting at the loss func-tion in (2.9) and deriving it with respect to each zj, and then continuing to the

layer before and eventually propagating all the way back to the inputs. There are many algorithms for updating the weights of the network based on the values in the gradient. Two common ones are stochastic gradient descent [24] and Adam [18]. Depending on how large the learning rate is, the weights will be updated a different amount.

2.3.2 Practical Aspects

During training, the images in the dataset are divided into random selections of batches [13]. The process of calculating the loss for a batch and taking a step with the optimization algorithm is referred to as an iteration. An epoch denotes iterating over the whole dataset. For example, when training on a dataset with 200 images, while using a batch size of 50, it takes four iterations to complete one epoch.

There are machine learning frameworks such as PyTorch [23] and Tensorflow [1] that have implementations of various loss functions, derivatives of operations common in CNNs and different optimization algorithms. This way the process of training a CNN is centered around choosing an appropriate set of hyperparame-ters by determining a suitable loss function and optimization algorithm, finding the right learning rate and applying data augmentation in order to achieve as high performance as possible. The frameworks also support GPU acceleration to achieve faster training than what is possible on a CPU.

2.3.3 Hyperparameter Optimization Using Search Methods

Goodfellow et al. [13] suggest two automatic search methods, grid search and random search, which can be used to search for hyperparameters with low loss and good generalization. Grid search usually means picking hyperparameters values uniformly (e.g. batch size {2, 4, 6}) or log-uniformly (e.g. learning rates {₁₀−1_{, 10}−2_{, 10}−3}_{) within a range. In random search, the hyperparameters are} sampled from a probability distribution. In both cases, the search for good hy-perparameters is conducted from coarse to fine. First a values in a large range

(26)

12 2 Theory

brick

metal

wood

Predicted label

brick

metal

wood

True label

9

1

3

0

500

0

15

10

35 Confusion matrix

Figure 2.7:Illustration of a confusion matrix.

are tried out and then new experiments focus on refining the results based on the conclusions from the first run.

2.4 Evaluation

Plenty of different evaluation methods exist for semantic segmentation. As they cover different aspects, it is common to compare different metrics. We have cho-sen the metrics accuracy, class accuracy, mean intersection over union and fre-quency weighted intersection over union for comparing predictions with ground truth.

To calculate the metrics in Equation 2.10 through Equation 2.13 [11], a confu-sion matrix is usually used. Confuconfu-sion matrices are also a tool to visualize some metrics, a simple illustration with three classes is shown in Figure 2.7. The

diago-nal elements pii represents the number of true positive pixels for corresponding

class, while pij and pji can be interpreted as false positive pixels and false

nega-tive pixels if i , j, and the number of classes are denoted as Ncls.

Accuracy

Accuracy shows how many pixels are correctly predicted. It is calculated as

Accuracy = Ncls P i=1 pii Ncls P i=1 Ncls P j=1 pij . (2.10)

It can be misleading if the classes of the dataset vary in size, because frequent classes will contribute more than infrequent ones. The confusion matrix in

(27)

Fig-2.4 Evaluation 13

Prediction

Ground Truth

Area of overlap

IoU=

Area of union

Figure 2.8:Intersection over Union.

ure 2.7 has an accuracy of 95%, mainly because all 500 pixels of metal are cor-rectly labeled.

Class Accuracy

Class accuracy is an extension of accuracy. It shows the average accuracy over all classes. Giving the equation

Class Accuracy = 1 Ncls Ncls X i=1 pii Ncls P j=1 pij , (2.11)

which leads to a class accuracy of 76% for the example above. This metric gives a quick overview of the accuracy of all classes. It can misrepresent the situation if a small class gets a low accuracy.

Intersection over Union

Intersection over Union is the ratio of intersection of the ground truth mask and the prediction mask for a particular class over their union. A visualization can be seen in Figure 2.8. The number of pixels that overlap between the ground truth and prediction are divided by the number of pixels in the union of ground truth and prediction.

Mean Intersection over Union

Mean Intersection over Union takes the average IoU of all classes according to

MIoU = 1 Ncls Ncls X i=1 pii Ncls P j=1 pij+ Ncls P j=1 pji−pii , (2.12)

(28)

14 2 Theory

Frequency Weighted Intersection over Union

Frequency Weighted Intersection over Union is similar to MIoU, but classes that are more frequent contribute more. This metric is calculated as

FWIoU = 1 Ncls P i=1 Ncls P j=1 pji Ncls X i=1 pii Ncls P j=1 pij Ncls P j=1 pij+ Ncls P j=1 pji −pii , (2.13)

giving an FWIoU of 92% for the predictions in Figure 2.7.

2.5 Depth and Surface Normals

When working with cameras in a 3D world, it is natural to choose the camera centered coordinate system so that the first two axes point two the right and down in the image plane, and the third axis points in the viewing direction of the camera. See Figure 2.9. This way, for any 3D point, depth is given by the third component of the camera centered coordinated system. In the same way that a picture describes the color of every point in the 3D world that it captures, a depth map describes the depth to that point. A surface normal map describes the orientation of the surface that the point lies on. The normals can be written in the camera coordinate system, but since a camera can move and rotate while taking images of a building, it is of interest to work with a world coordinate system. The world coordinate system is chosen so that the z-axis points towards the sky. Then the surface normal of a wall will only have x and y components, since the normal is orthogonal to the z-axis. The camera coordinate system and world coordinate system are related via a rotation and a translation, such that

xcam= Rx + t , (2.14)

holds between the point x in the world coordinate system and the same point xcam

in the camera coordinate system. R is a 3 × 3 rotation matrix and t is a translation vector [15].

(29)

2.5 Depth and Surface Normals 15

World

Coordinate

System

x

y

z

Camera

Coordinate

System

x

_cam

y

_cam

z

_cam

Surface

Normals

Depth

R,t

Figure 2.9:Depth relates to the camera coordinate system. Surface normals

can be written in both the camera coordinate system and the world coordi-nate system.

(30)

(31)

3

Related Work

Depth-Aware CNN and DeepLabv3+, the CNNs used for semantic segmentation of building materials in this thesis, are built upon different versions of DeepLab. To understand the difference between the two CNNs, we present work related to DeepLab in this chapter. We also mention classification networks and networks pretrained on ImageNet, since it is used by all versions of DeepLab. As the goal of the thesis is to segment different materials and attempt to improve the results with the help of 3D information, we also present related work for material iden-tification and depth information.

3.1 Image Classification

The ImageNet Large Scale Visual Recognition Challenge [25] had an annual im-age classification task on a dataset containing 1000 categories and more than one million images. When training CNNs on small datasets, it is common to initialize the networks with weights pretrained on ImageNet. First-layer features tend to be general and applicable across many datasets and tasks, and initialization with transferred features can boost generalization after finetuning on a target data-set [32]. Machine learning frameworks such as PyTorch [23] and Tensorflow [1] provide easy access to the optimized parameters for popular classification archi-tectures.

Simonyan and Zisserman [28], members of the Visual Geometry Group at the University of Oxford, presented the architecture VGG16 in 2014. They were the first to use 3 × 3 convolutional filters throughout the network. The small filters made it possible to push the depth of the network to 16 weight layers.

ResNet [16] is a residual learning framework, i.e., a CNN with residual blocks and bottleneck residual blocks. They make it possible to increase the depth of the networks, since they are easier to optimize and can gain accuracy from increased

(32)

18 3 Related Work

depth. The deepest ResNet is 152 layers deep, eight times deeper than the VGG nets.

3.2 Semantic Segmentation

The task of semantic segmentation involves inferring a class label to each pixel in an image.

3.2.1 DeepLab

DeepLab is a semantic segmentation architecture by Chen et al. that was first released in 2014 [3]. Since then, it has seen progress with three newer versions: DeepLabv2 [4], DeepLabv3 [5] and DeepLabv3+ [6].

The first version of the DeepLab [3] is based on repurposing and finetuning the ImageNet pretrained classification network VGG16. The VGG16 network is improved with atrous convolutions and a fully connected conditional random field to achieve more accurate segmentation.

In DeepLabv2, Chen et al. [4] introduce atrous spatial pyramid pooling, as described subsection 2.2.4, and in DeepLabv3, Chen et al. [5] refine the ASPP by using four atrous convolutions. They also explore different modes for modifying the field of view with the ASPP.

The encoder-decoder structure in Figure 3.1 is added in DeepLabv3+ [6]. See subsection 2.2.4 for a theoretical background. At the beginning of the encoder, low-level features are extracted to use later during decoding. The decoder is an improved backbone model of Xception [7].

3.2.2 RGB-D Segmentation

Range sensors such as Microsoft Kinect and Lidar [29], and stereo cameras [8] have made depth images more accessible in the past years. This has increased the interest for semantic segmentation with RGB-D images.

Silberman et al. [27] created the NYU Depth Dataset V2 consisting of 1449 RGB-D images, capturing 464 indoor scenes. The images were taken with a Mi-crosoft Kinect, which uses structured light for depth sensing. The depth maps have areas with missing information, but this is resolved with inpainting using the colorization scheme of Levin et al. [20]. The NYUD2 dataset has been used for benchmarking by Gupta et al. [14], Long et al. [21], Wang and Neumann [29].

There are many suggested methods to fully leverage the depth information, both regarding depth encoding and efficiently incorporating depth into existing RGB architectures. Gupta et al. [14] proposed to encode the depth map into a three-channel image called HHA — horizontal disparity, height above ground, and angle with gravity — to allow a CNN to learn more effectively. Horizontal disparity is proportional to the inverse of the depth. The gravity direction is esti-mated by finding the direction that is the most aligned to or most orthogonal to locally estimated surface normal directions. Once the gravity direction is known, the height above ground is calculated as the height above the lowest point in the

(33)

3.2 Semantic Segmentation 19

Figure 3.1: Encoder-decoder structure of DeepLabv3+. The encoder

mod-ule encodes with the help of atrous convolutions and ASPP, while the de-coder module refines segmentation results along object boundaries. Figure by Chen et al. [6], reproduced with permission.

image. Experiments by Gupta et al. [14] show that RGB and HHA images have a similar structure and that it is reasonable to use an ImageNet-trained CNN as initialization for training on HHA images.

Long et al. [21] compared two ways of training networks with depth informa-tion. First, they experimented with using all four channels of RGB-D as input to one model. Then they jointly trained two models, one with RGB and one with HHA, and summed their predictions. This late fusion of RGB and HHA gave bet-ter results on the NYU Depth Dataset V2 for their fully convolutional network.

Wang and Neumann [29] presented Depth-Aware CNN, which integrated the operations depth-aware convolution and depth-aware average pooling into the existing RGB-architecture DeepLabv1. The depth-aware operations have the ad-vantage that they incorporate depth information without adding parameters to the network. The operations work under the assumption that there is a similarity in depth between pixels belonging to the same class. Depth-Aware CNN showed a performance increase on the NYUD2 with late fusion of RGB and HHA. Fig-ure 3.2 illustrates how the predictions are summed.

Depth-Aware CNN is based on DeepLabv1 [3], but without implementing a conditional random field after the final convolutional layer. CRF is an important part of the DeepLabv1 architecture, since it enables predictions with detailed lo-cal structure and leads to a substantial performance boost, almost 4% mean inter-section over union on the Pascal dataset [10]. Chen et al. [3] also use multi-scale features and large field of view to further increase performance on the Pascal dataset, but these methods are also not utilized in Depth-Aware CNN.

(34)

20 3 Related Work Depth-Aware CNN Depth-Aware CNN

+

RGB-D Prediction HHA-D

Figure 3.2: Late fusion of Depth-Aware CNN, where RGB-D and HHA-D

predictions are summed.

3.3 Material Identification

Sean et al. [26] approached semantic segmentation of materials in three steps. First, they created a dataset consisting of patches from 23 different materials. Secondly, they trained a CNN to classify these patches. Lastly, they classified patches of an image together with a dense conditional random field to construct an image with every pixel labeled as a material class. They also created a dataset for semantic segmentation of materials with indoor images.

The dataset named GTOS made by Xue et al. [31] contains over 30 000 close-up ground terrain images and depth maps for image classification. They devel-oped a texture-encoded angular network, which combines two streams: one for an RGB image and the other for a differential angular image created from multi-ple viewpoints.

DeGol et al. [9] investigated how 3D geometry can improve material classifi-cation. They introduced the dataset GeoMat, consisting of close-up images and geometry data for isolated walls and ground areas. The geometry data is com-prised of segmented point clouds and normal vectors. From their results, they concluded that 3D geometry can improve mean classification accuracy. They also concluded that 3D geometry helps with categories that look the same visually but have different 3D geometry.

(35)

4

Methodology

In our work, we use semantic segmentation for identifying building materials. The dataset has been created and annotated by us together with Spotscale. Pub-licly available code for DeepLabv3+ and Depth-Aware CNN has been adapted to work with our data.

4.1 Dataset

Spotscale gave us access to drone imagery from different projects, captured mostly at a distance of 5–50 meters. To get a good distribution of materials in our dataset, we selected roughly ten images per building. We strived to have as large variation as possible, i.e., shots from different angles and distances.

4.1.1 Annotating Data

The different building materials had to be grouped into classes. In some cases, the same material can have many different appearances due to the building

tech-nique. For roofing, we decided on the classescurved roof tiles, seam metal roof,

corrugated sheet metal roof and asphalt roof. For walls, we decided on having the

classesbrick wall, flat facade, horizontal wooden panel, vertical wooden panel, seam metal wall and corrugated sheet metal wall. Together with unclassified, the number

of classes in the dataset was 11. Example patches of the materials can be seen in Figure 4.1. All colors used for ground truth and predictions in this report can be seen in Appendix A.

In semantic segmentation, each pixel of an input image is labeled with a cate-gory of a material. This means that we had to create ground truth images of the same size as the images we selected. We assigned a color to each of the materials mentioned above and then painted over the images. An example can be seen in

(36)

22 4 Methodology

Figure 4.1:Examples of the ten classes in the dataset. Roof materials:curved roof tiles, seam metal roof, corrugated sheet metal roof and, asphalt roofing. Wall

materials:brick wall, flat facade, horizontal wooden panel, vertical wooden panel, seam metal wall and corrugated sheet metal wall. The images are from

pix-abay.com, for illustration purposes only.

Figure 4.2. The surroundings, such as the sky and ground, were marked as

un-classified. An additional class, distant building, was used in cases where buildings

were far away or it was hard to determine the material because of the angle the

image was taken at. The classdistant building was then ignored when calculating

the loss function during training. Thus, it did not contribute to the input gradi-ent. To speed up the work, we used an annotation tool which divided the image into superpixels, i.e., larger areas with a similar color. This allowed us to annotate a roof with only a few clicks. A list of edge cases was created containing situa-tions that were difficult to annotate. This included cases like sparse occlusion by trees and window cases.

4.1.2 Depth Maps and Normal Maps

Depth maps and surface normal maps were generated from Spotscale’s 3D mod-els of the buildings. The depth maps contained areas with missing information, which were filled in using the code provided by Silberman et al. [27] in the tool-box for the NYU Depth Dataset V2. It is a modification of the colorization method by Levin et al. [20], which was originally developed to colorize grayscale images with a few color scribbles. Figure 4.3 shows how the parts with no depth infor-mation between the red buildings have been inpainted.

The normal maps were given in camera coordinate systems, so the normal of the same 3D point would look different depending on the rotation of the camera when the image was taken. The rotation R between the world coordinate system and each camera coordinate system was known, so all normal maps were rotated to the same coordinate system. This way, the normal components of walls and roofs were consistent across different images.

(37)

4.2 Implementation Details 23

(a)Image (b)Ground truth label

Figure 4.2:Image and ground truth label with the materialsvertical wooden

panel (light blue), curved roof tiles (red), brick wall (pink), flat facade (orange), distant building (dark blue).

(a)Image (b)Before inpainting (c)After inpainting

Figure 4.3: Inpainting fills in areas with missing depth information, i.e.,

areas with zero depth.

4.1.3 Train, Validation and Test Split

For the first small dataset, 191 images were annotated and the dataset was split into three parts: train, validation and test. First, three buildings with 25 images were picked out for the test set. These were kept unused until the very end of the project to determine how well the trained models generalized to images of previously unseen buildings. The remaining images were split randomly so that 70% of the images were in the training set and 30% in the validation set.

In a second step, more images were annotated and added to create an ex-tended dataset. The initial test set with 25 images was expanded with 25 addi-tional images. Histograms were used to get an even size distribution of material classes. The remaining images were added to the training and validation sets, so that the extended dataset had 244 training images and 106 validation images.

4.2 Implementation Details

In this section, we explain the technical details about how the code for training the networks was implemented.

(38)

24 4 Methodology

{

1200 px

{

600 px 500 px 750 px

{

500 px 750 px

{

Figure 4.4:Cropping out a region of 750 × 500 px from images with different

base height.

Hardware specifications

GPU: Nvidia Quadro P5000, 16GB CPU: Intel Core i7-8086K, 4.00GHz RAM: 64GB

4.2.1 Using High Resolution Images

The images in the dataset are of such high resolution that they are too large to be stored in the memory of the graphics card during training. To still get a high level of detail in buildings far away from the camera, we decided to apply scaling and cropping in two steps. First, the images were downscaled to a base height keeping its original aspect ratio. Afterwards, a random region with aspect ratio 3:2 was cropped out. The random crop was not applied during validation to have the validation set remain consistent. As seen in Figure 4.4, the red rectangle that would be used as input to the network can contain a small part of the object with high level of detail or a large part of the object with low level of detail depending on the base height that the image is downscaled to.

When scaling images, interpolation artifacts can appear due to aliasing [12]. This could lead to unwanted behaviors such as the arches seen on the roof in Figure 4.5. The corrugated roof has a wave pattern with higher frequency than what can be described with the sampled pixels. To combat this problem, we try to find a balance between scaling down the image as little as possible while keeping a representative part of the image as input. We experimented to find the combination of base height and crop size that worked best for the used networks.

4.2.2 DeepLabv3+

The implementation of DeepLabv3+ uses code from jfzhang95 [17] Github repos-itory. It uses ResNet-101 as backbone and output stride of 16. The following changes have been made to the code:

(39)

4.2 Implementation Details 25

Figure 4.5:Thecorrugated sheet metal roof is affected by arch-shaped artifacts

when the image is scaled down. For illustration purposes, the artifacts are not as apparent in the resolutions used for training and evaluation models.

• A new dataloader for handling Spotscale’s material images has been cre-ated.

• The training images are randomly resized and cropped as described in sub-section 4.2.1.

• The Nesterov SGD optimizer [13] with learning rate decay is replaced by the Adam optimizer with learning rate reduced on validation loss plateau. • Color shifting [30] has been implemented.

• Evaluation of validation images is now done on whole images.

• Jointly training two networks has been implemented in a similar way as the official code for Depth-Aware CNN[29].

4.2.3 Depth-Aware CNN

The implementation of Depth-Aware CNN uses the code made available by Wang and Neumann [29], which is a modified version of DeepLab with a VGG16 en-coder pretrained on ImageNet. The code uses α = 1 for the depth similarity function in Equation 2.5. The following alterations have been made to the code:

• A new dataloader for handling Spotscale’s material images has been cre-ated.

• The training images are randomly resized and cropped as described in sub-section 4.2.1.

• The SGD optimizer with learning rate decay is replaced by the Adam opti-mizer.

(40)

26 4 Methodology

• Loss is calculated separately for training and validation. • Minor bugs have been fixed, for example, regarding color jitter.

4.3 Hyperparameter Optimization

The hyperparameter optimization relied on grid search of a few manually se-lected hyperparameters at a time. Learning rate, image resolution and batch size were considered hyperparameters that would have a large impact on the training results. The first experiments focused on finding a suitable learning rate. The in-put image resolution and batch size was fixed and learning rates were varied on a logarithmic scale: {10−3, 10−4, ..., 10−7}_{. This initial coarse search was only} per-formed for a few iterations, as for most learning rates the training loss became too large. In a second round, a search was performed in a narrower range around the learning rates that previously showed the best decrease in training loss. The learning rates that showed the highest mean intersection over union on the vali-dation set were used in the search for image resolutions and batch sizes. It was always made sure that the best values were not on the border of the search inter-val, as this is an indication of the search being conducted in the wrong range and that there could be a more optimal hyperparameter setting outside the interval.

4.3.1 Selecting the Best Model from a Training Run

For both architectures, the toolkit Tensorboard [1] was used during training to track and visualize the training loss as well as validation loss and MIoU. The train-ing algorithms aim to minimize the loss on the traintrain-ing images. To get a model which generalizes well it is better to use the validation images to determine how long to train. Since MIoU is one of the most common metrics for benchmarking different datasets, our main focus during training was to maximize this metric.

The training code for DeepLabv3+ and Depth-Aware CNN included writing scalars to Tensorboard at the end of each epoch or a subset of epochs. It makes it possible to track the progress of the training in real time. The model which gets the highest MIoU on validation set is considered the best and a copy of the model is saved. That way, it is possible to use it for exporting the predictions as images after the training and compare the results of different experiments. As the valida-tion metrics can jump up and down between epochs, it is also of interest to look at the progress over time. Even if two experiments with different hyperparame-ters reach similar MIoU on the validation set after training ten hours, it might be better to use the hyperparameters which led to a faster increase of MIoU.

4.4 Augmentations

A problem that arises with limited training data is overfitting, i.e., when the net-work learns to perfectly model the images in the training set but performs poorly on new images. A well-established strategy to help models generalize better is

(41)

4.4 Augmentations 27

(a)Original (b)Color shift (c)Color shift

(d)Flip (e)Color jitter (f)Color jitter

Figure 4.6:Visualization of the data augmentations used during training.

to use data augmentation to enhance the size and quality of the dataset. In our work, we make use of geometric transformations (flipping, cropping, etc.) and color space augmentations (altering the colors of images) to add variation to the dataset without adding any new images. We also experiment with adding classi-fication images to the dataset, i.e., images which are cut out patches of only one building material.

4.4.1 Geometric and Color Space Augmentations

Figure 4.6 showcases the geometric and color space augmentations that we used during training. They are all small alterations which are applied to the images randomly in each iteration during the training. Figure 4.6d shows a horizontal flip. Random crop as described in subsection 4.2.1 is also considered an aug-mentation technique. Color shifting [30] is performed by adding or subtracting a random value from each of the RGB-channels with a certain probability. Fig-ures 4.6b and 4.6c show what happens when the red channel with values in the range 0 to 255 is increased or decreased by 20. Color jittering [23] is performed by converting the image from the RGB color space to HSV (Hue, Saturation, and Value). Then a random value is added to or subtracted from each of the HSV channels and the image is converted back to RGB. Figures 4.6e and 4.6f show two examples of random color jittering.

(42)

28 4 Methodology

(a) Classification image (b) Ground truth label

Figure 4.7:Example of classification image and corresponding ground truth

label. Image from pixabay.com.

4.4.2 Class Weights

PyTorch [23] supports setting weights on how much each class contributes to the prediction loss, which can improve results on a dataset which has varying class sizes. Paszke et al. [22] suggest using the weight formula:

wj =

1

ln (c +pixels of class j_{total pixels} ), (4.1)

where the number of pixels refer to the whole training set. Setting c = 1.02 restricts the class weights to be in the range 1-50. When using class weights, the cross entropy loss function in Equation 2.9 is multiplied by the weight term, and becomes Li = wyi         −_z_y i+ log         Ncls X j=1 ezj                 . (4.2)

This leads to a larger loss for pixels i where the correct class yi is infrequent in

the dataset.

4.4.3 Classification Images

It is a much faster process to label images classification images, which are patches of only one building material, rather than annotating real-life images of buildings that contain background, doors, windows, etc. To label a classification image, we made a new image of the same size and filled it with one color as in Figure 4.7.

4.5 Encoding of 3D information

One way to utilize 3D information in convolutional neural networks designed for 2D images, is to encode the 3D information into a HHA [14] (horizontal

(43)

dis-4.5 Encoding of 3D information 29

x

y

z

_n

Normal with sky Normal with sky Angle with gravity Angle with gravity

Angle with ground

Magnitude with ground Magnitude with ground

Height above ground Height above ground

{

Figure 4.8: Illustration of height above ground, angle with gravity, angle

with ground, normal with sky and magnitude with ground for a point on a roof.

parity, height above ground, angle with gravity) image. For any pixel represent-ing a 3D-point with a depth D, height h above ground, and a surface normal n = (nx, ny, nz)T, |n| = 1 in a coordinate system where the z-axis is pointing

to-wards the sky, the HHA encoding is calculated as

H ← 1

D, H ← h ,

A ← arccos(nz) .

(4.3)

The height above ground and the angle extracted from the normal parallel with gravity is illustrated in Figure 4.8. Then two networks can be trained — one with RGB and one with HHA — and their predictions can be summed [21, 29]. The networks are trained jointly with one loss for the summed prediction and one optimizer for the parameters of both networks.

The HHA encoding of depth maps is popular for the indoor dataset NYUD2 [27]. Height above ground in indoor scenes can be estimated from the lowest point visible in an image, as it is often the floor or the bottom part of a wall. It is also a reasonable measure, as TVs and lamps tend to be placed on similar heights in all rooms. In outdoor scenes and images of buildings, the estimation of height above ground is more problematic. The Spotscale dataset contains images where only facade or roofing is visible. This would lead to large inconsistencies if the height was estimated from the lowest point in each image. Furthermore, depending on how tall the buildings are, there are large variations in the height

(44)

30 4 Methodology

(a)Depth map (b)Normal map (c)HMN

Figure 4.9:A depth map and a normal map can be combined into HMN

us-ing Equation 4.5. Normal legend: X→red, Y→green, Z→blue. HMN legend: H→red, M→green, N→blue.

of which roofs are located at.

For this reason, we decided to replace HHA with two alternative encodings. We named the first encoding HMA, where the height above ground was replaced with the magnitude of normal parallel to the ground and the angle is calculated from the ground (orthogonal to the gravity direction), rather than from gravity di-rection. Following the prerequisites for HHA in Equation 4.3, the HMA encoding is calculated as H ← 1 D, M ← q nx2+ ny2 A ← arcsin(nz) . (4.4)

Calculating the angle from the ground leads to walls having a high magnitude value and low angle value, and vice versa for roofs. If there is no normal data, i.e.,

n= 0, both the magnitude and the angle will be zero.

We also experimented with another encoding, where the normal component parallel with gravity was used without calculating any angle. We call this encod-ing HMN: H ← 1 D, M ← q nx2+ ny2, N ← nz. (4.5)

For both HMA and HMN, all channels are scaled linearly to map observed values across the training dataset. Figure 4.9 shows a depth map and a normal map that have been combined into HMN.

(45)

5

Experiments

As annotation of images is a time-consuming process, we began experimenting with DeepLabv3+ and Depth-Aware CNN on a smaller dataset with 106 training images and 51 validation images. This allowed us to explore different hyperpa-rameter configurations with a faster training time per epoch. Once we had access to more data, we trained the networks using the hyperparameters determined from the small dataset, and made experiments to see if late fusion with normal maps and HMN improved the predictions.

5.1 DeepLabv3+

The search for suitable hyperparameters, performed on the small dataset, in-cluded a mixture of manual and grid search to reduce the number of times we needed to train the networks. The results from a grid search to pinpoint the best batch size and learning rate is seen in Table 5.1. It shows that the learning rate 7 · 10−5and batch size 4 were suitable choices with base height 1547 px and crop size 1153 × 769 px.

5.1.1 Augmentations

The grid search showed that the hyperparameters in Table 5.5 were well suited for DeepLabv3+. To improve the baseline results from the grid search, experi-ments with augmentations were made. The models from the epochs with highest validation MIoU for each augmentation are presented in Table 5.2. Random flip of the input images led to an improvement of the validation metrics, whereas color shift caused a deterioration.

Class weights were calculated for the training set using Equation 4.1, and then used in the loss function during training. The confusion matrix in

(46)

32 5 Experiments

Table 5.1: Grid search results on validation images from training

DeepLabv3+ for 70 epochs. The training images were resized to height 1547 px and cropped to 1153 × 769 px. The metrics are from the training epoch with the highest MIoU on validation set.

Hyperparameters Validation Metrics (%)

Batch Learning

Size Rate Acc Acc Class MIoU FWIoU

2 1 · 10−5 91.1 78.5 68.1 84.0 2 3 · 10−₅ 91.0 90.0 71.9 85.0 2 5 · 10−5 92.4 89.0 71.1 84.3 2 7 · 10−5 90.7 88.2 69.6 82.7 3 5 · 10−₅ 92.4 91.1 73.8 86.7 3 7 · 10−5 92.8 91.5 75.4 87.1 3 9 · 10−5 93.3 90.1 75.5 88.0 3 1.1 · 10−4 93.9 87.4 78.4 88.7 3 1.3 · 10−4 94.0 85.6 78.0 88.7 4 5 · 10−₅ 94.4 92.1 80.9 89.7 4 7 · 10−5 94.7 92.6 81.6 90.2 4 9 · 10−5 94.1 91.9 81.3 89.1

ure 5.1c shows that smaller classes, especiallyseam metal wall, got a higher

ac-curacy when class weights were used during training. The baseline model got a

higher accuracy for unclassified. Overall the metrics on the validation set were

similar between the baseline model and the model class weights. Class accuracy decreased by 1.4%, while the MIoU remained unchanged, and FWIoU and accu-racy increased slightly.

To investigate if augmenting the training dataset with classification images

could decrease the frequency of mixing-upbrick wall and horizontal wooden panel,

we added 36 stock photos from Pixabay1 to the training set. The classification

images were close-ups of the building materials, which was not the case for the other images in the dataset. Hence, the images were rescaled to a base height of 400 px to match the pixel height of brick blocks and wooden panels in the dataset. A comparison of the metrics on the baseline model and the models with added classification images is shown in Table 5.2. The model trained with added

classification images did not improve accuracy on neitherbrick wall nor

horizon-tal wooden panel, and it did not fix the mix-up between the classes either, see

Figure 5.1.

(47)

5.1 DeepLabv3+ 33

UnclassifiedCurved roof tilesSeam metal roofCorrugated roofAsphalt roofBrick wallFlat facadeHorizontal woodenVertical woodenSeam metal wallCorrugated wall

Predicted label Unclassified

Curved roof tiles Seam metal roof Corrugated roof Asphalt roof Brick wall Flat facade Horizontal wooden Vertical wooden Seam metal wall Corrugated wall True label .98 0 0 0 0 .01 .01 0 0 0 0 .01 .98 0 0 0 .01 0 0 0 0 0 .09 0 .88 .01 0 0 .02 0 0 0 0 .08 0 0 .92 0 0 0 0 0 0 0 .02 0 0 0 .98 0 0 0 0 0 0 .02 0 0 0 0 .97 0 0 0 0 0 .07 0 0 0 0 0 .92 0 0 0 0 .06 0 .02 0 0 .02 0 .87 .04 0 0 .05 0 0 .01 0 0 0 0 .92 0 .01 .14 0 0 0 0 .01 0 0 0 .85 0 .07 0 0 .04 0 .01 .01 0 .03 0 .84 0 2 4 6 8 1e7

(a) Baseline model

Curved roof tiles Seam metal roof Corrugated roof Asphalt roof Brick wall Flat facade Horizontal wooden Vertical wooden Seam metal wall Corrugated wall True label .97 0 0 0 0 .01 .01 0 0 0 0 .02 .94 0 0 0 .04 0 0 0 0 0 .11 .01 .82 .01 0 .01 0 .03 0 0 0 .02 0 0 .98 0 0 0 0 0 0 0 .02 0 0 0 .98 0 0 0 0 0 0 .02 0 0 0 0 .97 0 0 0 0 0 .11 0 0 0 0 .01 .88 0 0 0 0 .11 0 .01 0 0 .02 0 .85 .01 0 0 .05 .01 0 .02 0 0 0 0 .91 0 .02 .22 0 0 0 0 .01 0 0 0 .77 0 .07 0 0 .04 0 .01 0 0 .01 0 .87 0 2 4 6 8 1e7

(b) Model with added classification im-ages

Curved roof tiles Seam metal roof Corrugated roof Asphalt roof Brick wall Flat facade Horizontal wooden Vertical wooden Seam metal wall Corrugated wall True label .97 0 0 0 0 .01 .01 0 0 0 0 .01 .98 0 0 0 .01 0 0 0 0 0 .09 0 .88 .01 .01 0 0 0 0 0 .01 .06 0 0 .92 0 0 0 .02 0 0 0 .02 0 0 0 .98 0 0 0 0 0 0 .01 0 0 0 0 .98 0 0 0 0 0 .07 0 0 0 0 .01 .92 0 0 0 0 .04 0 .01 0 0 .02 0 .91 .02 0 0 .04 0 0 .01 0 0 0 0 .92 0 .02 .05 0 0 0 0 0 0 0 0 .95 0 .07 0 0 .02 0 .01 .01 0 .04 0 .86 Confusion matrix 0 2 4 6 8 1e7

(c) Model with class weights

Figure 5.1:Comparison of confusion matrices for DeepLabv3+ on the small

validation set. The blue color indicates the number of pixels for each combi-nation of predicted and true label. The colored classes highlight the classes classification images were added to.

(48)

34 5 Experiments

Table 5.2: Validation metrics for DeepLabv3+ after adding different

im-provements.

Acc Acc Class MIoU FWIoU

Baseline 95.9 91.9 84.5 92.3

Color shift 95.0 90.6 82.0 90.7

Flip 96.5 94.5 87.4 93.3

Flip + color shift 96.0 92.9 85.3 92.3

Weighted loss 95.8 93.3 84.5 92.0

Added classification images 95.2 90.3 82.9 90.9

5.2 Depth-Aware CNN

Experiments to find optimal hyperparameters were carried out on the small data-set. An initial coarse search where the height of the input images was 1000 px

and batch size was 4 indicated that a suitable learning rate was 5 · 10−5. The

result of a finer search with input image height 1000 px and random crop size of 570 × 380 px are shown in Table 5.3. The crop size was the largest possible for batch size 8 without running out of memory on the GPU. The learning rates 3 · 10−₅

and 5 · 10−₅

were used for all batch sizes and an additional experiment was conducted depending on which of the learning rates performed best. The configuration with learning rate 3 · 10−5_{and batch size 2 showed both the overall}

highest MIoU and a fast increase in MIoU during the first epochs of training. It was therefore considered a suitable starting point for further experiments.

Experiments with a grid search over different crop sizes {570 × 380, 750 × 500, 900 × 600} px showed that 750 × 500 px gave high MIoU over multiple epochs. Sticking with crop size 750 × 500 px and varying the height that the images were scaled down to {900, 1000, 1100, 1300, 1500} px showed only small variations in the validation metrics.

5.2.1 Augmentations

The first tested augmentation for Depth-Aware CNN was sampling the base height of the training images uniformly from the range 750–1250 px. In a second experi-ment, the training images were flipped randomly with probability 0.5, and finally, color jittering was added. As seen in Table 5.4, each of the augmentations led to improvements in the validation metrics.

At the end of each epoch, predictions were made on the validation images to monitor how the training progressed. Figure 5.2 shows the validation loss and MIoU as plotted in Tensorboard. Flipping and color jittering helped to speed up the training time since a low loss and a high MIoU was achieved after fewer train-ing iterations. All three augmentations were therefore kept as augmentations for later experiments with the extended dataset.

(49)

5.2 Depth-Aware CNN 35

Table 5.3: Grid search results on validation images after training

Depth-Aware CNN for 70 epochs. The training images were resized to height 1000 px and cropped to 570 × 380 px. The metrics are from the training epoch with the highest MIoU on validation set.

Hyperparameters Validation Metrics (%)

Batch Learning

Size Rate Acc Acc Class MIoU FWIoU

2 2 · 10−₅ 90.6 78.8 70.6 83.2 2 3 · 10−5 92.5 84.1 76.1 86.2 2 5 · 10−5 91.8 83.5 73.2 85.3 4 3 · 10−₅ 90.9 80.6 71.0 83.5 4 5 · 10−5 91.8 79.4 72.1 85.0 4 7 · 10−5 90.8 80.9 70.6 83.8 6 3 · 10−5 _90.5 _75.2 _68.1 _82.8 6 5 · 10−5 90.8 80.7 69.5 84.0 6 7 · 10−5 90.9 80.0 68.9 83.9 8 2 · 10−5 _89.8 _75.2 _66.6 _81.9 8 3 · 10−5 90.5 77.9 68.7 83.2 8 5 · 10−5 90.3 77.4 68.1 82.5 0 50 100 150 200 250 300 350 Epochs 101 2 × 101 3 × 101 4 × 101 6 × 101

Validation loss (log scaling)

Baseline Scale Scale+flip+jitter

(a)Validation loss

0 50 100 150 200 250 300 350 Epochs 65.0 67.5 70.0 72.5 75.0 77.5 80.0 82.5 85.0 mIoU (%) Baseline Scale Scale+flip+jitter (b)Validation MIoU

Figure 5.2: Training progress for Depth-Aware CNN as visualized in

Ten-sorboard. An exponential moving average has been applied to smooth the graphs.

(50)

36 5 Experiments

Table 5.4: Validation metrics for Depth-Aware CNN, comparing baseline

metrics with augmentations and disabled depth inpainting.

Acc Acc Class MIoU FWIoU

Baseline (Missing depth inpainted) 93.8 87.4 78.9 88.6

Scale 94.1 88.9 80.2 88.9

Scale + flip 93.9 88.0 80.6 88.6

Scale + flip + color jitter 93.8 89.2 81.1 88.3

Missing depth not inpainted 93.3 85.8 77.3 87.6

No depth maps 92.4 83.6 74.1 86.2

5.2.2 Impact of Depth Maps

To investigate the impact of the depth maps on the segmentation predictions by Depth-Aware CNN, experiments were conducted with different depth maps. In the first experiment, depth maps without any preprocessing were used. This means that there were areas with a depth value of zero where the depth informa-tion was missing. In a second experiment, the depth aware operainforma-tions were re-placed with RGB-only operations. The results in Table 5.4 show that using depth maps gives better segmentation results and that preprocessing the depth maps helps. Figure 5.3 demonstrates the predictions for an image where the depth data is missing at the very left side of the images. All three predictions look sim-ilar and there are no obvious errors in the predictions without inpainting in the areas with missing depth data. There were no clear explanations to the decline in performance found by visually inspecting the predictions.

5.3 Extending the Dataset

When access was given to more annotated images, the dataset was extended so the images that had previously been in the training and validation sets remained there. The extended dataset contained 244 training images, 106 validation im-ages and 50 test imim-ages. The experiments from hyperparameter optimization on the small dataset showed that the configurations in Table 5.5 were the most suit-able. For DeepLabv3+, only flipping was used as augmentation, as color shifting, class weights in the loss function and extending the dataset with material patches had not shown to bear fruit. For Depth-Aware CNN the augmentations flipping, color jittering and random scaling were used. The setups were used for training for at least 200 epochs and serve as a baseline for improving the results with late fusion.

(51)

5.3 Extending the Dataset 37

(a)Image (b)Ground truth segmentation

(c)Raw depth map (d)Prediction with raw depth map

(e)Inpainted depth map (f)Prediction with inpainted depth

(g)Prediction without depth data

Figure 5.3:Depth-Aware CNN gives similarly looking predictions regardless

Semantic Segmentation of Building Materials in Real World Images Using 3D Information

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2021