Automatic map generation from nation-wide data sources using deep learning

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2020 | LIU-IDA/STAT-A--20/018--SE

Automatic map generation from nation-wide

data sources using deep learning

Gustav Lundberg

Supervisor : Annika Tillander Examiner : Fredrik Lindsten

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

The last decade has seen great advances within the field of artificial intelligence. One of the most noteworthy areas is that of deep learning, which is nowadays used in everything from self driving cars to automated cancer screening. During the same time, the amount of spatial data encompassing not only two but three dimensions has also grown and whole cities and countries are being scanned. Combining these two technological advances en-ables the creation of detailed maps with a multitude of applications, civilian as well as military.

This thesis aims at combining two data sources covering most of Sweden; laser data from LiDAR scans and surface model from aerial images, with deep learning to create maps of the terrain. The target is to learn a simplified version of orienteering maps as these are created with high precision by experienced map makers, and are a representation of how easy or hard it would be to traverse a given area on foot. The performance on different types of terrain are measured and it is found that open land and larger bodies of water is identified at a high rate, while trails are hard to recognize.

It is further researched how the different densities found in the source data affect the performance of the models, and found that some terrain types, trails for instance, benefit from higher density data, Other features of the terrain, like roads and buildings are pre-dicted with higher accuracy by lower density data.

Finally, the certainty of the predictions is discussed and visualised by measuring the average entropy of predictions in an area. These visualisations highlight that although the predictions are far from perfect, the models are more certain about their predictions when they are correct than when they are not.

(4)

(5)

Acknowledgments

No work is that of a single man, and a number of people have been instrumental to this thesis. I’d like to express my gratitude towards Peter Svenmark of FOI for giving me the oppor-tunity to do my thesis with them, as well as my supervisor at FOI, Maria Axelsson, for all the patience, advice and thorough commentary on my work.

My supervisor and examiner at Linköping University, Annika Tillander and Fredrik Lind-sten, also deserve a thank you for the discussions and ideas during the making of the thesis.

To all the classmates in the Statistics and Machine Learning programme, and in partic-ular my opponent Hector Plata for sharing all of his knowledge! It’s been a couple of very interesting years and I’ve learned a lot from you all. Faton, Alexander and Maximilian also deserve a special mention for being sounding boards and getting me through this.

To Axel for the motivational talks. And to mom, and dad for all your sacrifices, I wouldn’t be where I am without you! To my wife, Martina, for all your support, love and patience with weekends and evenings lost over the last couple of years. Finally, to my girls at home, Stella and Klara for always having a hug to spare.

(6)

1 Introduction 1 1.1 Motivation . . . 1 1.2 Proposed solution . . . 2 1.3 Aim . . . 3 1.4 Research questions . . . 4 1.5 Delimitations . . . 4 1.6 Ethical considerations . . . 4 1.7 Organisation of report . . . 4 2 Data 5 2.1 Types of data . . . 5 2.2 Data Sources . . . 6 2.2.1 Orienteering maps . . . 6 2.2.2 Laser data . . . 9 2.2.3 Surface model . . . 9

2.3 Joining the data sets . . . 11

2.3.1 Adding information from surface model . . . 11

2.3.2 Adding labels from orienteering map . . . 11

2.4 Output format . . . 12

2.4.1 Structural similarity to S3DIS . . . 12

2.4.2 Test and validation data sets . . . 13

3 Theory 15 3.1 Multinomial regression and Multilayer perceptrons . . . 15

3.1.1 Logistic regression . . . 15

3.1.2 Multinomial logistic regression . . . 15

3.1.3 Weight estimation . . . 16

3.2 Multilayer perceptron - automating feature engineering . . . 16

(7)

3.2.2 MLP as feature engineer . . . 18 3.2.3 Weight estimation . . . 18 3.2.4 Loss functions . . . 18 3.2.5 Activation functions . . . 19 3.2.6 Batch normalisation . . . 19 3.2.7 Dropout . . . 20

3.2.8 Skip link concatenation . . . 20

3.3 Convolution . . . 20

3.3.1 Convolutional neural networks . . . 21

3.3.2 Convolution in three dimensions . . . 21

3.3.3 Pooling . . . 21

3.4 Performance metrics . . . 22

3.4.1 Accuracy . . . 22

3.4.2 Intersect over union . . . 22

3.5 Voxelisation . . . 22

3.5.1 Alternatives to voxelisation . . . 23

4 Method 25 4.1 Multinomial logistic regression . . . 25

4.1.1 Evaluation of baseline performance . . . 26

4.2 PVCNN++ . . . 26

4.2.1 Network structure . . . 26

4.2.2 Loss function weights . . . 29

4.2.3 Evaluation of model performance . . . 29

4.3 Evaluation of point density effect on performance . . . 30

5 Results 31 5.1 Multinomial logistic regression . . . 31

5.2 PVCNN++ . . . 32

5.2.1 High resolution voxels . . . 32

5.2.2 Medium resolution voxels . . . 34

5.2.3 Low resolution voxels . . . 36

5.3 Model comparison . . . 38

5.4 Point density effect on accuracy . . . 39

6 Discussion 41 6.1 Results . . . 41

6.1.1 Performance for different terrain types . . . 41

6.1.2 Performance for different voxel sizes . . . 42

6.1.3 Deep learning improvement over baseline . . . 43

6.1.4 Effect of point density on model performance . . . 44

6.2 Data . . . 45

6.2.1 Choice of predictors . . . 45

6.2.2 Creation of ground truth . . . 45

6.3 Method discussion . . . 45

6.3.1 PVCNN++ for terrain classification . . . 45

6.3.2 Use of uncertainty maps . . . 45

6.3.3 Gridding . . . 46

6.3.4 Sources of error . . . 46

7 Conclusion 49 7.1 Research question . . . 49

(8)

Bibliography 51

Appendicies 53

A Point distributions in training data per area . . . 53 B Entropy limits and performance metrics for PVCNN++ . . . 54

(9)

Notation

Below are some of the mathematical notation used in the thesis

x A vector containing elements x1, x2¨ ¨ ¨xn

ta, b, cu A set of elements a, b and c

a ¨ b The dot product of the K-length vectors a and b, calculated as

řK

k=1(akbk)

||x|| The norm or length of a vector x, calculated as?x ¨ x

txu, rxs The lower and upper bound of x, in essence the integer part of x and the integer part of x+1

N_d(x) The neighbourhood of x, i.e. the area with radius d around x.

I(a) Indicator function evaluating a. Returns 1 if a is TRUE and 0 otherwise.

(10)

(11)

Glossary

Below is a non-exhaustive list of terms, abbreviations and notation that may be helpful to have when reading the thesis.

• Pixel A pixel represents a unit on a two dimensional raster. On top of the coordinates of the pixel, it holds at least one piece of information, or channel or feature. To create a colour image, either three or four channels are used for different intensities depending on the colour model. Additional features can for instance denote category or opacity. • Voxel The basic unit in a three dimensional grid, basically a pixel with an added

coordi-nate for the third dimension.

• Spatial data Data where some of the features represent coordinates in a geometric space. This also implies that there is a distance metric that can be used to denote the similar-ity of two observations in that space, the distance metric may however differ between different geometries (spherical, euclidean etc). Such data can be described as x =txiu = (ci, fi)( with ci being the coordinates and fi the features associated with the ith

observation.

• Euclidean space The space of all K-tuples of real numbers(x1, x2, ¨ ¨ ¨ xk), collectively

re-ferred toRK. R3is the three dimensional space in which the physical position of real world objects can be defined.

• Euclidean distance The shortest distance between two points a and b in a K-dimensional Euclidean space. Calculated as distance=

b řK

k=1(ak´bk)2

• RGB-space A closed, three dimensional space in which an observation’s val-ues/intensities on the red, green and blue colour channels can be found. Normally this is[0, 1]3if infinite resolution is assumed, or[0, 255]3or[0, 65535]3of discrete 8-bit or 16-bit resolution is used.

• Expected value The expected value of a function f(x), with x being a random variable distributed according to some distribution p(x), is the mean value f(x)will have when x is drawn from p(x). For discrete x it is a sum Ex=řxp(x)f(x)while for continuous

x it is an integral Ex=ş p(x)f(x)dx.

• SWEREF99-TM Geo referencing system used when the entirety of Sweden should be ref-erenced in the same system. For further information see Lantmäteriet

• .las/.laz files Standardised format for point clouds. Mandates X, Y and Z-coordinates and can also hold intensity, classification and colour information of each point.

• Features The properties gathered for each individual observation, for instance coordinates, colour and intensity. Also refered to as variables

(12)

List of Figures

2.1 Orienteering map example . . . 6

2.2 Artefact cleanup in orienteering maps . . . 7

2.3 Colour codes for orienteering maps . . . 7

2.4 Category distribution in orienteering maps . . . 8

2.5 Laser data example . . . 10

2.6 Surface model example . . . 10

2.7 Fuzzy join overview . . . 11

2.8 Flowchart of data manipulation and modelling . . . 12

2.9 Segment splits . . . 13

2.10 Point distribution in per split . . . 13

3.1 Logistic and multinomial regression schematic . . . 16

3.2 MLP schematic . . . 17

3.3 Deep learning schematic . . . 17

4.1 Schematic of PVCNN++ . . . 28

4.2 Branch detail of PVConv . . . 28

4.3 Åkerbo SV point density . . . 30

5.1 PVCNN++ prediction, c=1 . . . 33

5.4 Proportion correct grid units per terrain and density . . . 39

6.1 Certain road prediction . . . 41

6.2 Trail or forest . . . 42

6.3 Open land predicted as building . . . 42

6.4 Uncertain trails . . . 43

6.5 Missing waters . . . 44

(13)

List of Tables

2.1 Map neighbour weights . . . 8

2.2 Laser data example . . . 9

2.3 Surface model example . . . 10

2.4 Output data format . . . 12

3.1 Confusion matrix example . . . 22

4.1 Class specific loss weights . . . 29

5.1 Confusion matrix for multinomial regression evaluated on test samples . . . 31

5.2 Confusion matrix for PVCNN++, c=1 evaluated on test data points . . . 32

5.3 Confusion matrix for PVCNN++, c=1 evaluated on test data grid units . . . 32

5.4 Confusion matrix for PVCNN++, c=0.5 evaluated on test data points . . . 34

5.5 Confusion matrix for PVCNN++, c=0.5 evaluated on test data grid units . . . 34

5.6 Confusion matrix for PVCNN++, c=0.25 evaluated on test data points . . . 36

5.7 Confusion matrix for PVCNN++, c=0.25 evaluated on test data grid units . . . 36

5.8 Comparison of model Accuracy . . . 38

5.9 Comparison of model IoU . . . 38

5.10 Comparison of model prediction rate . . . 38

(14)

(15)

Introduction

1

Over the last decade, artificial intelligence has surpassed human performance in a multitude of applications, and while there is an argument to be made that humans can reason about new information in ways that computers can’t, the computers can perform their tasks at superior speeds for days on end. This is in no small part thanks to the advances made in artificial neural networks in recent years. While their foundations were theorised almost 80 years ago, the deep networks with millions of components of 2020 can be used to recolour old videos, drive cars and analyse human behaviour to fight terrorism.

Computer vision is a key component in all of the above examples, and an area where deep learning is particularly useful. From deciding whether a picture is a dog or a muffin to identifying where the pedestrians around a car are, the algorithms have evolved to make use of not only flat images but also three dimensional data. And as the amount and availability of three dimensional data grows, so does the number of algorithms that can use them and the number of fields that employ these algorithms.

The Swedish military is no exception to this rule, and amongst other applications, they wish to look into how deep learning can be utilised in the planning of military operations in general and in terrain analysis in particular. This thesis was conducted at the C4ISR division of the Swedish Defence Research Institute (FOI).

1.1 Motivation

In Military terrain assesment, a crucial part in the planning of a logistic operation is knowing what areas can be used in the manoeuvre. If the operation involves light, wheel-based vehi-cles you are mainly confined to roads or other hard and flat surfaces, whilst using threaded vehicles allows you to traverse marshy areas as well as small ditches and trenches. Soldiers on the other hand can navigate rather dense forests, but can get stuck in low vegetation that heavier vehicles may not be concerned by. Furthermore, steep elevation may be traversable in some directions but not other, i.e. going down a slope may be possible but not going back up.

Military terrain assessment is commonly done more or less by hand; either by looking at map layers one by one marking areas of interest and then combining these into a suggested path1or by weighing different areas on different map layers by their driveablility2. These approaches can combine personal knowledge about certain areas with standardised doctrine and be tailored to the vehicle types at hand resulting in a complete plan for a manoeuvre. On the other hand, manual methods are time consuming, require highly trained and experienced staff and are vulnerable to situation changes.

In discussions with military decision makers, being able to superimpose information of which areas are possible to traverse and not onto a conventional map is something that is

1_{As explained in discussions with battalion officers}

(16)

1.2. Proposed solution

reported to significantly decrease the amount of time spent on planning a manoeuvre. This of course requires some technical equipment but also that the terrain in question has been analysed beforehand since computational power is generally not available at the forward command posts. Which areas should then be analysed? Preferably all of Sweden since it can-not be known beforehand what areas will need to be accessible at what moment. Assigning staff to do this analysis is of course not time well spent since a lot of the work will never be used, and what is used runs the risk of being outdated when it is needed. Thus it would prob-ably be better to let a computer do the analysis, something that the last decade’s advances in machine learning could be useful for given the vast areas that need to be processed.

1.2 Proposed solution

For a machine to do terrain assessment of an entire country, a couple of things are needed: • A nation wide data source, preferably one that can be refreshed without requiring a lot

of manpower.

• A ground truth so that the method knows what it should learn.

• A machine learning method that can handle the large amounts of data both when it is learning the task and when it is used to predict unseen areas.

Source of data The first requirement can be solved by using data from the Swedish mapping,

cadastral and land registration authority (henceforth referred to by its Swedish name Lantmä-teriet). In particular, Lantmäteriet provides two sources of data that cover most of Sweden and are collected using airplanes: airborne LiDAR scans [10] and aerial images [19]. These data are in turn used by Lantmäteriet to create a wide range of other maps with varying degrees of human interaction, like cadastral and topographic maps.

Ground truth The second requirement is harder since the plans created by military

per-sonnel are not created in an easy to use digital format, nor are they available to the public. Therefore, this thesis will not present a direct solution to the problem of nation wide terrain assessment for military use, but it will investigate a similar problem by using a ground truth made for a similar task that is available to the public: orienteering maps.

In the making of orienteering maps, aerial images have long been used to create a basic map that is then fine tuned by sending a map maker out into the field. The latter part can take a large amount of time but generally results in maps with a high amount of detail on the types of obstacles to be found. Another benefit of the orienteering maps is that they are created in standardised digital formats and thus serve as an accessible source of subjective interpretation made by experienced professionals.

In more recent years, aerial LiDAR scans have been utilised to create orienteering maps of higher detail. However this still requires manual work. An example of this can be found in the MapAnt project [17] where LiDAR scans were used to create a rough orienteering map of most of Finland. This project relies on man made maps for some features like buildings and roads, but still showcase the use of LiDAR data for use in terrain analysis. The underlying software Karttapullautin also relies on the input LiDAR data to be classified in terms of what terrain types the individual points belong to. This use of LiDAR data in the creation of orien-teering maps should support the notion that such data can be used to create maps detailing the possibility to traverse terrain.

Choice of method The last of the requirements above is that of an efficient method capable

of handling the source data. While having access to large amounts of training data is usually beneficial for deep learning models, the fact that the data is of a spatial nature means that

(17)

1.3. Aim

a large amount of spatial relations are to be found in the data. Learning and making use of these relations can be time consuming and require significant computational power.

A class of deep learning models that make use of spatial relations are Convolutional Neu-ral Networks (CNNs). CNNs have been used with great success on in various computer vision tasks like object detection and image classification in 2D. Applying CNNs to 3D data can be done in a similar fashion as in 2D, provided the data is of a regular structure, just like the pixels of a 2D image. 3D point clouds are however not of a regular nature and therefore traditional CNNs cannot be directly applied to them. Solutions to this issue has involved projecting the point clouds to 2D or voxelizing them into a regular 3D grid and then applying convolutional networks. These methods can learn the spatial context of an area but have the apparent drawback that some information is lost while at the same time being computation-ally heavy.

In 2016, Qi et al proposed a third approach where a deep learning model works directly on the point clouds and named it PointNet [16]. PointNet learns a set of features common for all points in a scene and combine this with the features of the individual points to enable per-point classification. PointNet has since been the base of several other methods, many of which are listed in a survey by Guo et al. [6] released just prior to the start of this thesis.

PointNet was recently challenged by a network design combining the detailed information gained by using the raw point cloud and the information about a neighbourhood achieved in voxel-based methods. The network, aptly named Point Voxel Convolutional Neural Network or PVCNN was designed by Liu et al in 2019 [12], and has shown state of the art results on indoor scene segmentation. These results are both in terms of performance metrics and also from a standpoint of computational efficiency, making PVCNN a good candidate for a model that is to be run on a nation wide scale. This thesis is the first known attempt to use PVCNN on terrain analysis.

As a baseline in assessing PVCNN’s performance, a multinomial logistic regression model will be used as this is one of the statistical go-to models when doing classification. More complex statistical models like Gaussian Markov random fields are used for creating models of the ground elevation (DEMs) [1] while Niemayer et al used conditional random fields (CRFs) for LiDAR point classification [13], albeit using a higher point density than the data used in this thesis, and some handcrafted features for each point.

There are other deep learning methods that have been tested on outdoor data, but focused on urban areas rather than terrain while also using point clouds of higher density than the ones that are available on a nation wide scale, for instance Huang & You [8], and Zhao et all [21]. Since the nation wide LiDAR scans do have some areas of higher point density, the thesis will also research if the performance of the method varies depending on the point density.

Solution summary In short, the thesis will utilise 3D point clouds from LiDAR scans and

surface model from aerial images to learn the underlying terrain type from orienteering maps. To do this, a deep learning method called PVCNN will be used as it aims to combine com-putational efficiency with accurate predictions. This is done by combining the fine grained features of each point and coarse grained features of its surroundings. A graphic of the pro-cess is found in figure 2.8

1.3 Aim

The aim of this masters thesis is to evaluate whether the main types of underlying terrain can be learned by a deep neural network by feeding it data from nation wide sources provided by Lantmäteriet. It will also research the effect of point density in the input data.

(18)

1.4. Research questions

1.4 Research questions

This thesis seeks to answer the following questions:

1. Can deep learning using 3D point clouds as input be used to construct a 2D map of terrain types? Is a level where a majority of each class is correctly predicted, i.e. 50 percent accuracy, obtainable?

2. Do deep learning models using point clouds result in higher accuracy maps than a multinomial regression using voxelised features?

3. Does the accuracy differ between areas with different resolution in the nation-wide data sources?

1.5 Delimitations

The thesis will focus on public data sources since they are open and available for research. The thesis will focus on researching the generation of simplified orienteering maps. It does not seek to result in a map showing where every kind of vehicle can drive, nor through what places you can actually move a battalion of vehicles without the last ones getting stuck in meter-deep mud. The thesis rather aims to acts as a precursor to such work.

1.6 Ethical considerations

The two datasets from Lantmäteriet are open for use in academic research, and raster images of them may be included in for example thesis reports as background information. The ex-ample tables 2.2 and 2.3 cannot be considered to contain any useful amount of information and thus cannot be considered a publication of vector data.

Permission for using the orienteering maps were granted by Marcus Henriksson of Linköpings Orienteringsklubb.

The data does not contain any information about individuals or otherwise sensitive infor-mation, and thus no further special considerations need to be taken.

The commissioner has not attempted to influence which results should be presented in the thesis.

1.7 Organisation of report

The data used are detailed in chapter 2, along with how they are combined to create a data fed to the models. The models themselves and relevant theoretical concepts are outlined in chapters 4 and 3 respectively, while the results of applying the models to data are presented in chapter 5.

Chapter 6 contains a discussion of the results and chapter 7 contains the conclusions of the thesis and also some suggestions of future work.

(19)

Data

2

The following chapter details the data used in the thesis and how it is organised before it is used as input to the models. All data management and transformation is done in Rstudio 1.2 running R 3.6 if not stated otherwise. While there are plenty of libraries used for this task, data.tableand lidR were crucial as they provided functionality that sped up the data management significantly.

2.1 Types of data

This section details the raw data sources used in the thesis; their structure, peculiarities and how they are collected.

There are two types of data used in the final data set; data in the form of a regular 2D grid and irregular data in 3D space, also know as a point cloud.

The regular 2D data has exactly one set of information for every point in an evenly spaced grid within its boundaries. It can thus be represented as an X ˆ Y ˆ K-dimensional matrix where X and Y are the number of grid units along the two spatial axes and K is the number of features associated with each point. Another way of representing such data is an image where each data point, or pixel, is coloured by one or more of the K features.

An irregular 3D data doesn’t necessarily have only one set of information per point in space, but can also have multiple or none. These data are therefore represented as a table with columns for the spatial axes as well as for the features for each point of data.

A point cloud such as the one used in this thesis has a number of properties that put some demands on the algorithms that utilise it [16, 15]:

• Unordered: The point cloud is an unordered set of points in Euclidean space. Unlike a model designed for handling a pixel or voxel array, a point cloud model needs to be invariant to the order of the points.

• Interaction: Since a distance - and thus a similarity measure - can be defined between any two points, the points cannot be considered to be isolated. The neighbourhood of a point is a space whose properties needs to be captured by the model, as is the interaction between neighbourhoods.

• Non uniformity: The density of points may not be uniform across the entire space and thus the model need to handle neighbourhoods representing similar volume but being represented by different number of points.

3D data can also be represented in a regular grid by a matrix that, compared to the 2D reg-ular data, has an additional dimension Z, resulting in an X ˆ Y ˆ Z ˆ K-dimensional matrix. The individual points are then known as voxels, and the process of transforming an irregular point cloud to regular 3D-data is known as voxelisation and is covered in section 3.5.

(20)

2.2. Data Sources

2.2 Data Sources

2.2.1 Orienteering maps

In order to annotate the input data, a ground truth to be learned by the model is needed. The choice fell on using digital orienteering maps as these are constructed with a high level of detail and accuracy, while at the same time being constructed to show how easy it is to run through the terrain. Such maps are created using a combination of map maker expertise and data such as LiDAR scans and aerial images by most orienteering clubs and as such there is a lot of material that could be used, provided the clubs allow it. The orienteering maps used in this thesis have been provided by Linköpings Orienteringsklubb and are selected using criteria such as reliable georefence as well as having been created reasonably close to when the laser data was gathered.

Quite a lot of the symbols used are however unlikely to be detected by any model cur-rently available fed only with aerial data (significant boulders and trees, fences etc) and are hence discarded. Further symbols are differentiated by features that don’t register on aerial photos or LiDAR-scans (passable or impassable water stream, river or lake or sea, road or paved area etc) and are hence merged.

In cases where two symbols overlap (for instance swampy ground in a forest) the one that is furthest from the ground is represented in the final material as combinations of all kinds of symbols would lead to too many possible categories, some of which would have a very rare occurrence. An example of an exported map is found in figure 2.1

Figure 2.1:Example of an orienteering maps used in the training. This is the south western part of the Åkerbo map, slightly north of Linköping, referred to in the data sets as Area 3.

The pre-processing of the maps is done by hand in OpenOrienteeringMapper 0.9.1 [14] and then exported as .png-files with a resolution of 150 dpi. Each pixel of the exported image corresponds to an area of 2.54m ˆ 2.54m. The result is an image with smoothed transitions between colours, which - while pleasing too look at - is a problem when the colours are

(21)

2.2. Data Sources

to represent discrete classes and not continuous values. This is resolved by replacing the value for each pixel with the one that is closest among the true label colours used. Since each pixel is represented as a point in the three dimensional space made up of its values in each colour channel, this is done by calculating euclidean distance from the pixel’s colour to all true colours. After each pixel is assigned one of the true label colours there are still some artefacts on edges between categories. These are resolved by a second pass over the pixels that counts the categories of pixels within a neighbourhood of two steps from the current pixel and ensuring that at least wkneighbours are of the same kind. If not, the counts

of neighbouring pixels’ categories are weighed and the category with the highest weighted count is assigned to the pixel as per equation 2.1. The process is visualised in figure 2.2.

category(xi) = $ ’ & ’ % category(xi) if Nkěwk argmaxk N_k wk otherwise (2.1) Nk= ÿ xjPN (xi) I(category(xj) =k)

Figure 2.2:Correcting artefacts after export of orienteering maps. From left to right: raw export with gradient, all pixels are one of the true colours but artefacts on borders, final result

The weights used are found in in table 2.1 and were determined empirically. Colour codes and corresponding categories are found in figure 2.3.

#000000 #64F0FF #1478C8 #96FF8A #FB3299 #E31A28 #FF7F00 #FFFFFF 0: roads 1: water 2: marsh 3: openground 4: building 5: trail 6: mediumforest 7: forest

Figure 2.3:Colour codes and corresponding labels used in the modified orienteering maps

The distribution of terrain types in the different maps used can be found in figure 2.4. Note that there are a total of 14 areas created from five main maps as indicated by the names (with valla being part of the Linköping map). This is done in part since the unprocessed orienteering maps are neither square nor the same size but also since parts of certain maps were drawn in lower quality or not close in time to the laser data collection used in the analysis. Naturally, the extent of the 14 areas are also used to delimit what areas of other data sources are used.

(22)

2.2. Data Sources

Table 2.1:Weights used in the recolouring of the orienteering maps

roads water marsh openland building trail medforest forest

2.6 2.6 3.0 7.0 3.6 2.6 7.0 8.0 7463 29687 18145 57594 820 12398 62697 227296 15302 42916 37702 75029 195 18091 105282 276025 11110 18999 12085 141819 4276 37600 91905 276038 7999 28437 47331 36941 29 12342 58681 265554 6579 3428 13062 21306 153 3643 24471 153731 6276 13692 33852 34833 301 10082 93157 279181 11315 15843 29495 36214 308 12040 102541 308776 12828 14117 11795 111436 6840 69137 27804 256013 9917 26401 34384 30315 163 8381 43100 293447 16605 4096 1126 145535 2950040701 15481 70132 6141 39733 7390 94857 1839 13599 37879 186466 10143 34726 3160 202410 1703 16873 59971 252262 1018620780 4626762870 546 30514 54188 260169 1380 1397 36526₂₆₉₇₂ 34 5407 17905 146319 sodero_sv valla

prasttomta_v sodero_mitt sodero_nv

linkoping_sv linkoping_vidingsjo prasttomta_nv

grytstorp_nv grytstorp_v linkoping_s

akerbo_nv akerbo_o akerbo_sv

roads water marsh openlandbuilding

trail

medforestforest roads water marshopenlandbuilding trail

medforestforest

roads water marsh openlandbuilding trail medforestforest 0e+00 1e+05 2e+05 3e+05 0e+00 1e+05 2e+05 3e+05 0e+00 1e+05 2e+05 3e+05 0e+00 1e+05 2e+05 3e+05 0e+00 1e+05 2e+05 3e+05

Distribution of terrain types per area

(23)

2.2. Data Sources

2.2.2 Laser data

The main data set is point clouds derived from aerial LiDAR scans delivered as standardised laz-files with coordinates in SWEREF99-TM-format [20]. The LiDAR scans are collected on behalf of Lantmäteriet by a low and slow flying airplane and is available with varying res-olutions for most of Sweden [10]. A rough cleaning of the data is performed in conjunction with the collection process but outlier points are still present in the downloaded data. For this reason, only points in[´5, 170]metres1above sea level are retained. For the areas used, the resolution is such that a 2.5 ˆ 2.5 km tile consists of roughly eight million data points.

The points are not aligned on any type of grid but are somewhat irregularly spread out according the scan pattern with of the laser beam in the LiDAR. The collection process also has a certain overlap between parallel flight passages, resulting in areas with almost twice the resolution. In the acquired data that overlap is roughly 30 percent meaning that 40 percent of the data has normal resolution and 60 percent has double resolution in a direction roughly parallel to the Y-axis of data.

Large bodies of water pose a problem for the LiDAR since water is very efficient at ab-sorbing the frequencies used in airborne LiDAR. This means that lakes and wide rivers will not be fully represented in the data, and waves along shorelines may cause anomalies.

Finally, there can be multiple points on the same planar coordinate, with different height-values. This is due to the fact that certain materials allow the laser beam to pass through them and hit another object further away from the LiDAR. Both hits will be registered by the receiver but will have different height values and possibly different intensity values. The intensity values are scaled from[0, 255] to [0, 1] to simplify the implementation of models sensitive to values in different ranges, should such models be wanted or needed. The nor-malisation is also done in the original implementation of PVCNN.

Table 2.2:Example and specification of variables in laserdata

Name X Y Z I

SWEREF99-easting coordinate

SWEREF99-northing coordinate

Height in meters ac-cording to RH2000

Intensity of laser return

[0, 1] 1 539324.4 6487500 33.71 0.0902 2 539387 6487500 33.72 0.0235 3 539379.5 6487500 33.76 0.2157 .. . n-1 537508.5 6490000 102.2 0.2235 n 537501 6490000 79.15 0.6784

2.2.3 Surface model

Lantmäteriet also creates and supplies a surface model data as well [19], is delivered as stan-dardised laz-files with coordinates in SWEREF99-TM-format. The surface model point cloud has a higher resolution than the laser data; a 2.5 ˆ 2.5 km tile consists of roughly 21 million data points.

A surface model is created using two or more aerial images of the same coordinates to create a stereoscopic view of what is visible from the air. It is not a true three dimensional data, but rather 2.5D since only the top of every object is represented. Due to the process used, some anomalies are present in the data, mostly due to reflections in bodies of water resulting in a higher elevation of the point than it should have. This is thought to have a rather small effect since the data augmentation described in section 2.3 is only done using the XY-coordinates.

(24)

2.2. Data Sources

Figure 2.5:Example of laser data at an angle of roughly 45 degrees from the ground, coloured by the intensity of the laser return. On the left side is an area of overlap between two flights having twice the resolution of the right hand side. The dent in the upper center of the picture is a lake without any data. Forest canopy generally has lower intensity and thus more blue colour while ground and roofs are more yellow. Source: Laser data NH, © Lantmäteriet [10]

The surface model comes in two varieties; either with conventional red-green-blue colour data, or with near infrared-red-green colours. The data used is the latter although the vari-ables are still named R, G and B in the actual data. The values for the colours are given on a scale of[0, 65536]which are normalised into[0, 1]by dividing the values all values by 65536 for the same reason as the intensity-normalisation of the laser data.

Figure 2.6:Example of surface model data, channel order RGB (which equals near infrared, red and green colours) and then all channels together visualised as RGB. The general area is the same as in figure 2.5. Objects like roads are clearly visible in the red and green channels, while the infra red channel varies for different types of vegetation. Source: Surface model from aerial images, © Lantmäteriet [19]

Table 2.3:Example and specification of variables in surface model.

Name X Y Z R G B SWEREF99-easting coordinate SWEREF99-northing coordinate Height in me-ters according to RH2000 Near infrared value

Red value Green value

1 529605.2 6494156 91.68 0.2952 0.1391 0.1979 2 529605.2 6494156 91.58 0.2383 0.1128 0.1683 5 529605.2 6494157 91.53 0.1916 0.0924 0.1459 .. . n-1 532098.8 6495630 105.97 0.3657 0.1863 0.2334 n 532098.8 6495630 105.94 0.3696 0.1848 0.2334

(25)

2.3. Joining the data sets

2.3 Joining the data sets

Since the points in the data sets do not share the exact same XY-coordinates, augmenting the laser points with information from the surface model and orienteering map cannot be done with a conventional join operation. Instead, a fuzzy join is performed where the points are joined if they are reasonably close to each other. While there exists libraries for doing such joins in R, they proved to be unable to cope with the large data sets used in this thesis. For this reason, the fuzzy join was built specifically for the data used.

For performance reasons, these joins are not done in the full target data. The source data are subset into squares of 120m ˆ 120m and the target data are subset into squares of(120+ θ)m ˆ(120+θ)m where θ is a distance within which any point in the source data is expected to have at least one point in the target data.

Next, for each source data point, the target data is subset so that only points within θm away in both horizontal and vertical from the source point coordinates are retained. Within this subset, the distance from the source data point to each target data point is calculated and the R, G and B values from the closest point are added to the source data point. If no points are found within θm from the source points, the subset is increased to 2θm and then 3θm. If there are still no target points found, the source point is not included in the final result. The process is visualised in figure 2.7. Details can be found on the in the Github-repo of the thesis2. R,G,B c b a d d d d d d*

Figure 2.7:Fuzzy join of source data (green points) and target data (blue squares). In a the data is subset to segments of 120m. In b the target data is again subset to all points within θm from a source data point. Finally in c the distance d to all target data points is calculated and the features from the closest target (at distance d* away) point is added to the source point.

2.3.1 Adding information from surface model

The fuzzy join of the laser data and surface model is done by looking up the closest point in the target data and joining these two, retaining the XY-coordinate from the laser data only. As such, the laser data is the source data and the surface model the target data. The laser data is chosen as source data as it is three dimensional. Selecting the surface model as source data would have meant that the distribution of points along the Z-coordinate would have been lost.

The value of θ is set to 2 in when joining the laser data and surface model.

2.3.2 Adding labels from orienteering map

With the result of the join of laser data and surface model as source data, the orienteering maps are next used as target data. Due to the lower resolution of these maps, θ is assigned a value of 5.

(26)

2.4. Output format

2.4 Output format

The result of the data join operation is an augmented version of the laser data point cloud that is split into segments of 120 ˆ 120m. The segments are further split into blocks of 20 ˆ 20m and the points shuffled and resampled in such a way that the blocks all contain the same number of points (which for some blocks mean the same point is represented twice). The blocks are also created with a 10m offset in both X and Y to help ensure that the relative positions of objects in a block are not defining features for the object (i.e. buildings are always found in the centre etc.). The block split routine is the same as is used in [12, 11].

Table 2.4:Example of the output data format

X Y Z Intensity R G B Classification 1 539005 6489092 65.29 0.173 0.313 0.176 0.233 7 2 539005 6489406 58.83 0.447 0.484 0.197 0.272 7 3 539005 6489502 62.03 0.718 0.550 0.223 0.292 3 .. . n-2 540790 6489636 93.36 0.043 0.397 0.137 0.198 6 n-1 540790 6489670 101.44 0.204 0.519 0.193 0.280 6 n 540790 6489719 106.41 0.000 0.246 0.104 0.166 6

The entire data manipulation pipeline, along with the model and output is displayed in figure 2.8

Laser

data Surface model

Fuzzy join O-map 2.5-3.5 km² .png export 8 colours Neighborhood

majority Fuzzy join X,Y,Z, intensity R,G,B X,Y Category Segments 120m × 120m Blocks 20m × 20m CropCropCrop

Point shuffling PVCNN++ Pointwise predictions + entropy Gridding Prediction map Uncertainty map

Figure 2.8: Visualisation of the entire modelling process. Input data are dark blue, output is yellow. The text on arrows denote what features are sent to the next step. The teal objects are from [12] while the remaining is part of the thesis work.

2.4.1 Structural similarity to S3DIS

The main method of the thesis described in 4.2 is built for handling the Stanford Large-Scale 3D Indoor Spaces Dataset or S3DIS [2] for short. This data contains point clouds from indoor scans of six main areas of a large office building further divided into a number of rooms.

In order to mimic the hierarchy used in the S3DIS-data set, the segments created in the data augmentation step are retained and used in place of the rooms level of S3DIS. The 14 areas created from the five different orienteering maps are used in place of the building areas in S3DIS. Using the same data structure as in [12] means that the entire data management procedure doesn’t need to be reinvented. The correlation of traits found in rooms from the same area in S3DIS is also reflected in the fact that segments from the same orienteering map will have similar traits, although they are topographical rather than architectural.

(27)

2.4. Output format

2.4.2 Test and validation data sets

The segmented data is split into three disjoint sets; a training set, a test set and a validation set. This is done by randomly assigning each segment to one of the three sets with probabilities 0.70, 0.15 and 0.15 respectively. Details on the number of segments from each area assigned to each set can be found in figure 2.9 while the proportion of terrain points per split is found in figure 2.10. These splits were empirically chosen so that the test and validation areas do not contain far more segments than the largest training areas and at the same time have roughly the same distribution of terrain types as the training set.

The training set is used to train the PVCNN++ models, while the validation set is used to select the best training iteration. The test set is then used to compare these best iterations between the models and select the best model.

0 50 100 150 akerbo_n v

akerbo_oakerbo_sv_{grytstorp_n} v

grytstorp_vlinkoping_slinkoping_sv linkoping_vidingsjo

prasttomta_n v

prasttomta_vsodero_mittsodero_n v sodero_sv valla No. of se gments train valid tests

No of segments per area and split

Figure 2.9:Details on number of segments per split and area

0.01 0.6 0.05 0.13 0.16 0.01 0.02 0.02 0.01 0.62 0.04 0.11 0.17 0.01 0.02 0.02 0.01 0.6 0.05 0.12 0.18 0.01 0.02 0.02 Training Validation

roads water marsh_{openlandbuilding} trail

medforestforest roads water marshopenlandbuilding trailmedforestforest roads water marshopenlandbuilding trailmedforestforest 0.0

0.2 0.4 0.6

Proportion

Distribution of terrain types per split

Test

(28)

(29)

Theory

3

This chapter goes through the theoretical concepts used in the thesis.

Although the reader is expected to have some training in statistical concepts, some theory is repeated since it is used as a basis for more advanced methods.

3.1 Multinomial regression and Multilayer perceptrons

3.1.1 Logistic regression

Logistic regression is used to model the probability of a binary variable y taking the value 1 given a vector of explanatory variables x and corresponding weight vector w. The probability for each individual y is modelled as [9](p 571):

P(y=1|w, x) =σ(w ¨ x) = 1

1+e´w¨x (3.1)

where σ(¨)is referred to as the sigmoid function.

The probability of y not being 1 is then, due to the law of total probability:

P(y=0|w, x) =1 ´ P(y=1|w, x) (3.2) The explanatory variables should contain a 1 to act as an intercept or bias; this is inter-preted as the overall difference in probability between the two classes given that all other variables are fixed.

3.1.2 Multinomial logistic regression

Extending logistic regression to allow for a non binary response with K possible outcomes results in multinomial logistic regression where there are K sets of weights instead. The prob-ability of each outcome is modelled similarly to logistic regression, with the exception of the sigmoid function being replaced by its multiclass counterpart so f tmax(¨).

P(y=1...K|w, x) =so f tmax(wx) (3.3) where w is a matrix formed from the weight vectors of the K classes:

w=        wt₁ wt₂ .. . wtK       

(30)

3.2. Multilayer perceptron - automating feature engineering

so f tmax(¨)takes a vector of numbers and transforms them to a probability distribution; i.e. all values are in [0,1] and their sum is equal to 1, and is formulated as

so f tmax(wx) = $ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ & ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ % ew1¨x řK k=1ewk¨x ew2¨x řK k=1ewk¨x .. . ewK¨x řK k=1ewk¨x , / / / / / / / / / / / / . / / / / / / / / / / / / -= $ ’ ’ ’ ’ ’ & ’ ’ ’ ’ ’ % P(y=1) P(y=2) .. . P(y=K) , / / / / / . / / / / / -(3.4)

A schematic for both logistic and multinomial logistic regression is found in figure 3.1.

1 𝑥1 𝑥2 𝑧 = 𝒘 ⋅ 𝒙 𝑤0 𝑤1 𝑤2 𝝈(𝑧) 1 𝑥1 𝑥2 𝑧1= 𝒘1⋅ 𝒙 𝑤0,1 𝑤1,1 𝑤2,1 _𝑧 𝑘= 𝒘𝑘⋅ 𝒙 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒛)

Logistic regression Multinomial regression

⋮

Figure 3.1: Logistic and multinomial regression model for an input of two variables plus an intercept/bias term. Output of logistic regression is the probability of y being class 1 while for the multinomial regression the output is a vector with the probability y being of each of the k classes.

3.1.3 Weight estimation

Finding the optimal weights in binary and multinomial logistic regression is done by opti-mising a loss function since there is no closed form solution. In the simple case, this loss function is a maximum likelihood function, but it can be extended by using regularisation on the weights.

If each of the K classes are given their own response variable yik for observation i such

that yik=1 if yi=k and 0 otherwise then the log likelihood is [9]:

log(P(yi, ...yn)) = n ÿ i=1 K´1 ÿ k=1 (yikwk¨xi)´log 1+ K´1 ÿ k=1 (wk¨xi) !! (3.5)

and the optimal set of weights are those that maximise the log-likelihood.

3.2 Multilayer perceptron - automating feature engineering

Not all relationships between y and x are linear, and although this can be remedied by trans-forming the inputs into new features in various ways, the number of possible transformations quickly becomes a problem as the dimensionality of x increases. Manually determining the weights of each input variable into the transformed ones is another problem. So what if there was a way of automatically creating new features from input variables and have them weighted at the same time? Adding a layer between x and y whose inputs is weighted sums of x and whose output is a non-linear transformation of these sums results in the multilayer perceptron, a.k.a an MLP or a neural network.

(31)

P(y=1...K|θ, x) =so f tmax(w(2)¨z) = (3.6) =so f tmax(w(2)¨g(w(1)¨x)) (3.7) where θ is the set of weights for each layer!w(1), w(2))and g(¨)is a differentiable non linear function, commonly referred to as an activation function. The middle layer holding the transformed inputs is commonly referred to as a hidden layer, while the units within the hidden layer are called neurons or simply hidden units. An overview of the MLP structure is found in figure 3.2. 1 𝑥1 𝑥2 𝒂𝟏= 𝒘1⋅ 𝒙 𝑤0,1 𝑤1,1 𝑤2,1 𝒃𝟏= 𝒈(𝒂𝟏) 𝒂𝒏= 𝒘𝑛⋅ 𝒙 𝒃𝒏= 𝒈(𝒂𝒏) Multilayer perceptron

⋮

𝑐𝟏= 𝒗1⋅ 𝐛 𝑐𝒌= 𝒗𝑘⋅ 𝒃 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒄) 𝑣0,1 1 𝑣1,1 𝑣𝑛,1

⋮

Figure 3.2:Overview of an MLP with one hidden layer. Input is two variables (and a 1 to for bias/intercept), hidden layer is n units wide with activation function g(¨)while output is k classes. For clarity, the set of weight matrices θ is denoted as w and v.

3.2.1 Deep learning

While an MLP with a single hidden layer in can be used to approximate any function f : x ÝÑ y provided an infinite amount of units in the hidden layer, it is usually easier an requires less computational power to add an additional hidden layer instead [5]. Adding more hidden layers is said to increase the depth of the model, creating a deep neural network capable of learning to approximate a function; hence the term deep learning.

The same rules apply for all the hidden layers as for the lone one in the single layer case; it weights and sums up the outputs of the previous layer and applies a non-linear and dif-ferentiable function on the sums. A schematic for a deep neural network is found in figure 3.3 1 𝑥1 𝑥2 𝑧11= 𝑔(𝒘1⋅ 𝒙) 𝑧1𝑚 𝑚 𝑧𝑛1 1 _{= 𝑔(𝒘} 𝑛⋅ 𝒙) 𝑧𝑛𝑚 𝑚 Deep Learning

⋮

𝑐1 𝑐𝑘 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒄) 1

⋮

𝑧12 𝑧𝑛2 2

⋮

⋯

1

Figure 3.3:Overview of a deep neural network with two input variables (plus a bias/intercept), m hidden layers, k classes. Each of the m hidden layers has nmunits where a non linear activation function is applied to a weighted sum

of the previous layer’s output. Nether the number of units not the activation function needs to be the same between the layers.

(32)

3.2.2 MLP as feature engineer

Since the hidden layers of an MLP transforms the inputs into new variables, it can be used as input to other models as well. It doesn’t have to end with a logistic or multinomial regression layer, but can instead be used as input to a clustering algorithm for instance. It is also possible to run multiple networks in parallel on the same input data and combine their outputs just before the final classification layer. This is the core principle of PVCNN [12] where one MLP is run on the area surrounding a point and another on the point itself, the two combined at the end to provide both individual and surrounding features to the classifier.

3.2.3 Weight estimation

The weights of an MLP are still estimated by optimising a loss function just as in logistic regression. Since all functions involved are differentiable, the effect that each weight layer

w(i) has on the loss functionLcan be found by calculating the partial derivative ofLw.r.t.

w(i). This effect is then used to update the weights for each layer so that the weights of layer i at optimisation iteration j are

w(i,j)=w(i,j´1)´ η BL

Bw(i,j´1) (3.8)

where η is a learning rate that determines how large steps are taken towards the optimal weights in each optimisation iteration. Since the loss is propagated backwards through the network, the process of calculating the gradients is know as back propagation, while the whole process of weight update is known as gradient descent.

3.2.4 Loss functions

The loss function commonly used in classification problems is cross-entropy. Cross-entropy is the sum of the entropy in a distribution P(x)and the Kullback-Leibler distance between that distribution and another distribution Q(x)(assumed to be for the same random variable x). As such it is a measurement of how similar P and Q are. Entropy is a general measure-ment of how much uncertainty there is in a probability distribution, taking the value 0 for deterministic processes and higher values the more uncertainty there is. For binary random variables, the 2-logarithm is normally used giving maximum value of 1. For variables taking more than two values, the natural logarithm is used and the maximum entropy for a variable taking eight different values is roughly 2.08 [5, 7].

Entropy: H(P) =´Ex[log(P(x))] =´

K ÿ k=1 (P(xk)¨log(P(xk))) (3.9) KL-distance: D(P||Q) =Ex log P(x) Q(x) = K ÿ k=1 P(xk)¨log P(xk) Q(xk) (3.10)

Cross Entropy: H(P, Q) =´Ex[logQ(x)] =´

K

ÿ

k=1

(P(xk)¨log(Q(xk))) (3.11)

With a bit of algebra, well explained in [7], it is easily shown that H(P, Q) = H(P) + D(P||Q).

When used as a loss function, P(x)is the true probability of each class (i.e. a one for the true class and zeros for the rest) and Q(x)is the estimated probabilities for each of the K classes output by a softmax-function. P(x)is thus a vector of length K with zeroes in all places except for the true class where there is a one, meaning there is no uncertainty about what class x actually belongs to. The goal is then to minimise the distance between the true

(33)

distribution P(x)and the distribution Q(x)returned by the model. This is in fact the same thing as using Maximum likelihood estimation of the weights, and the two methods arrive at the same solution, again described well in [7].

Finally, it should be mentioned that the implementations used in this thesis allow for ad-justing the loss for each observation depending on the actual class. This reweighting helps overcome the issue of different classes having different number of occurrences in data, some-thing that may occur when trying differentiate roads from forests. One way of determining the class weights is to use the relative frequencies of each class in training data:

Class specific loss weight: wk = 1 nk řK

i=1n1_i

(3.12) where nkis the number of observations of class k in training data.

3.2.5 Activation functions

In order for the MLP to actually perform a linear transformation at each layer, a non-linear function needs to be applied to the input. These functions are referred to as the acti-vation function of the layer. The sigmoid function mentioned in section 3.1.1 can be used as activation function, however its derivative will get closer to zero the further from zero the input is. This in turn will cause the back propagation to fail since it uses the derivatives in the chain rule, an issue known as vanishing gradient.

Another commonly used activation function is the Leaky Rectified Linear function or leaky ReL for short. Leaky ReL is defined as:

Leaky ReL(x) = #

x i f x ą 0 αx otherwise

α ‰1 (3.13)

Special cases are α =0 which gives the Rectified Linear function or ReL and α=´1 which returns the absolute value of x. The fact that ReL has a simple derivative (either 1 or α) means that the hidden unit for which it is computed can be considered either to be on or off. ReL is also faster to compute than for instance sigmoid since it doesn’t involve the exponential function. A hidden unit using ReL as activation function is commonly referred to as ReLU, or Rectified Linear Unit [5].

3.2.6 Batch normalisation

Normalisation takes a set of observations in k variables and transform each of them so that the ithobservation’s value for variable k is:

ˆxi,k =

xi,k´ µk σk

(3.14) where µkand σkare the mean and standard deviation of the kthvariable for all

observa-tions. The output of the batch normalisation undergoes a second transformation so that its the actual output is yi,k = γ ˆxi,k+βwhere γ and β are layer specific parameters learned during training (thus common to all batches). This is done so that the activation function following the batch normalisation layer does not have to act on values constrained to a small portion of the real domain. With the appropriate γ and β, the entire batch normalisation layer can actually be equivalent to the identity function.

If an MLP is trained in batches of observations, normalising each batch is referred to as batch normalisation. Since the batch is a random subset of all observations, the mean and standard deviation will vary randomly between batches. This introduces noise to the model and forces a more robust learning. The main advantage of batch normalisation is however

(34)

3.3. Convolution

that even though it adds parameters that the network has to learn, it speeds up the learning by decreasing the number of iterations needed for the network to converge [5].

3.2.7 Dropout

A dropout layer takes a vector of inputs and randomly multiplies each input with either zero or one. The probability of a zero is referred to as dropout rate. Dropout layers effectively turns the MLP into an ensemble model since each observation goes through a slightly different network of units.

Dropout is normally only used during the training phase, and turned of once the network is evaluated. It can however be used in the prediction phase as well and if all samples are sent through the network multiple times this results in a distribution of predictions for each sample. As such the network can be used as a bayesian approximation as proved by Gal and Ghahramani in [4].

3.2.8 Skip link concatenation

Skip link concatenation is a way of using information from several layers away in the net-work. For example, the coordinates of all points before a downsampling layer are needed so that an upsampling layer can recreate the original points and add the information from after the downsampling to them. For a networkG(¨)taking x as inputs and generating output y, a skip link from the input to output layer would result in the input features being concatenated with the features learned inG(¨):

y= (G(x), x) (3.15)

3.3 Convolution

The previous models have worked under the assumption that the observations are indepen-dent from one another. This assumption can sometimes be violated without any major issues, but since the lack of independence implies that one observation holds information relevant to the next it is often useful, if not necessary, to take the dependence into account. One method of doing this is convolution.

Convolution is a mathematical operation on two functions that in discrete space can be formulated as:

y_i= ÿ

xjPNd(xi)

K(ci, cj)ˆF (xj) (3.16)

where the set of observations txu = (ci, fi)( with cibeing the coordinates and fithe features

associated with the ithobservation. F (¨) is a vector valued function outputting a range of values from its inputs.

The convolution is iteratively centred on each of the n inputs, indexing its neighbours in N_d(xi), and convolving their associated featuresF (xi)with the kernelK(xk, xi). The kernel K(¨)can be any function that returns a value that can be used to weigh the observations in the neighbourhood. For geometric data it is normally calculated using the coordinates only, but in principle it may also calculate weights using other features. If a stride s ą 1 is used on a regular input, the convolution does not centre on every input, but on every s input.

The result is a new data point y_iwith the outputs ofF (¨)weighed and summed, including a set of coordinates.

(35)

3.3. Convolution

3.3.1 Convolutional neural networks

A convolutional neural network is an MLP where at least one layer employs convolutions instead of the normal weight matrix multiplication. The output of such layers can be either the same as their inputs (but with feature values calculated from neighbouring points thus smoothing out values of the inputs), smaller than the input (downsampling the input feature values) or larger than the input (interpolating the feature values). Changing the dimension-ality involves a stride parameter s that determines the increase or decrease in dimensiondimension-ality, a padding parameter p denoting the number of units outside the input that are filled with zeroes (so that the convolution actually starts at the first data point and not further in due to the kernel size), and the size of the kernel k.

||output||= [ ||input||+2p ´ k s _ +1 (3.17)

Convolution with same size in- and outputs can also be seen as away to increase the perceptive field of a point, adding information about its surroundings. The same is true when having an output smaller than the input, as one unit of output then represents the information contained in a larger amount of input units.

Depending on the kernel, dimensionality reduction may also be applied to the feature-dimensions of the input, for instance by reducing channels representing red, green and blue values into a single value.

3.3.2 Convolution in three dimensions

Convolution in three dimensions can be done in a couple of different ways;

• 2D convolution uses a kernel that along one axis has the same dimension as the input of that dimension. This results in a flattened image (i.e. projecting 3D to 2D).

• 3D convolution with stride=1 results in data that is still in three dimensions but where each data now has an increased field of perception. This is the only type of 3D convo-lution used in this thesis.

• 3D convolution with stride ą 1 results in data in three dimensions but with a decreased resolution; one unit of output represents multiple units of input.

When working with unordered point clouds, establishing the neighbourhood of a point N (xk)requires some form of nearest neighbourhood calculation which is a computationally

expensive operation. Furthermore, the computation of the kernel cannot be done in an effi-cient manner since the relative position of each neighbour is unpredictable with points spread out inR3.

In voxel based convolution, the neighbours are only found at an offset of t´1, 0, 1u on each axis. This means that there is a known number of neighbours with known positions and collecting their features requires only a single scan of data.

3.3.3 Pooling

Another type of layer used in convolutional nets are so called pooling layers. These layers are normally used to perform a downsampling of their inputs and do not use any padding, resulting in an output to input size relation

||output||= [ ||input|| ´ k s _ +1 (3.18)

(36)

3.4. Performance metrics

3.4 Performance metrics

Evaluating a number of models requires metrics with which they can be compared. For clas-sification problems the models are evaluated by comparing the actual clasclas-sification of the observations to those predicted by the model. Tabulating these in a two way table with pre-dictions in the rows and actual classes in columns results in a confusion matrix which for each class can be reduced to four types of values:

• True positive (TP): observation is positive and predicted as such

• True negative (TN): observations is not positive and not predicted as such • False positive (FP): observation is not positive but predicted as such • False negatives (FN): observation is positive but not predicted as such An example is shown in table 3.1

Table 3.1:Example of confusion matrix. TP for the road class is 120, TN is 191+384=575, FN is 12+19=31 and FP is 21+17=38. Thus accuracy for the road category is 120/(120+21+17)«0.76 and IoU is 120/(120+12+

19+21+17)«0.63. Overall accuracy is 713/782 « 0.91.

Actual

Road Water Forest

Prediction

Road 120 12 19

Water 21 191 6

Forest 17 12 384

3.4.1 Accuracy

Accuracy is calculated as the proportion of observations on the diagonal of the confusion matrix:

Accuracy= TP+TN

TP+FP+FN+TN (3.19)

3.4.2 Intersect over union

Intersect over union, abbreviated IoU, is commonly used in segmentation problems and cal-culates the proportion of overlap between prediction and actual classification to their sum, or in other words; their intersection over their union. IoU is equivalent to the Jaccard index.

IoU= TP

TP+FP+FN (3.20)

IoU is better for problems where there is an imbalance between classes and is related to the F1 or Dice score, but gives less weight to true positives.

3.5 Voxelisation

Conversion of an unordered point cloud to a regular grid is known as voxelisation. Using the notation of spatial data for observation i, (ci, fi)( with ci = (cx,i, cy,i, cz,i) being the

Automatic map generation from nation-wide data sources using deep learning

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2020 | LIU-IDA/STAT-A--20/018--SE

Automatic map generation from nation-wide

data sources using deep learning

Gustav Lundberg

Upphovsrätt

Copyright

Acknowledgments

Contents

Notation

Glossary

List of Figures

List of Tables

Introduction

1

1.1

Motivation

1.2

Proposed solution

1.3

Aim

1.4

Research questions

1.5

Delimitations

1.6

Ethical considerations

1.7

Organisation of report

Data

2

2.1

Types of data

2.2

Data Sources

2.2.1

Orienteering maps

2.2.2

Laser data

2.2.3

Surface model

2.3

Joining the data sets

2.3.1

Adding information from surface model

2.3.2

Adding labels from orienteering map

2.4

Output format

2.4.1

Structural similarity to S3DIS

2.4.2

Test and validation data sets

Theory

3

3.1

Multinomial regression and Multilayer perceptrons

3.1.1

Logistic regression

3.1.2

Multinomial logistic regression

⋮

3.1.3

Weight estimation

3.2

Multilayer perceptron - automating feature engineering

⋮

⋮

⋮

3.2.1

Deep learning

⋮

⋮

⋮

⋮

⋯

⋯

3.2.2