Extraction of DTM from Satellite Images Using Neural Networks

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2016

Extraction of DTM from

satellite images using

neural networks

(2)

Master of Science Thesis in Electrical Engineering

Extraction of DTM from satellite images using neural networks

Gustav Tapper LiTH-ISY-EX--16/5017--SE Supervisor: Gustav Häger

isy, Linköpings universitet

Pelle Carlbom

VRICON SYSTEMS

Examiner: Klas Nordberg

isy_{, Linköpings universitet}

Division of Automatic Control Department of Electrical Engineering

(3)

Abstract

This thesis presents a way to generate a Digital Terrain Model (dtm) from a Dig-ital Surface Model (dsm) and multi spectral images (including the Near Infrared (nir) color band). An Artificial Neural Network (ann) is used to pre-classify the dsm_{and multi spectral images. This in turn is used to filter the dsm to a dtm.}

The use of an ann as a classifier provided good results. Additionally, the addition of the nir color band resulted in an improvement of the accuracy of the classifier.

Using the classifier, a dtm was easily extracted without removing natural edges or height variations in the forests and cities. These challenges are handled with great satisfaction as compared to earlier methods.

(4)

(5)

Acknowledgments

I would first like to thank my supervisor, Pelle Carlbom, at Vricon. Pelle gave me the ability to ask questions and discuss problems as soon as problems appeared. To be able to continuously have a open dialogue about the work helped the work-flow, and I really appreciated that.

Secondly, I would like to thank all other employees of Vricon, who showed great interest in my work. Therefore, I would thank Leif Haglund who gave me the opportunity to work on something meaningful and relevant to their current work.

Also, I would like to acknowledge the examiner, Klas Nordberg, who has been great on sharing information and be quick in answer.

Last but not least, I would like to show my gratitude to my girlfriend who supported me throughout the whole work.

Linköping, August 2016 Gustav Tapper

(6)

(7)

Abbreviations

Abbreviation Full text

2d Two-dimensional 3d Three-dimensional ann Artificial Neural Network

aoi Area Of Interest

asm Angular Second Moment dsm Digital Surface Model dtm Digital Terrain Model

evi _{Enhanced Vegetation Index} glcm _{Gray Level Co-Occurrence Matrix}

mse _{Mean Squared Error}

ndvi Normalized Difference Vegetation Index nir _{Near Infra-Red wavelength band} pca _{Principal Component Analysis} sgd Stochastic Gradient Decent

vari Visible Atmospherically Resistant Index

(10)

(11)

1

Introduction

Two, or more, images of the same object can be used to reconstruct a three-dimensional (3d) object. Utilizing satellite images for the reconstruction makes it possible reconstruct the Earth at global scale. To be able to visualize the world in 3dhas many uses, such as infrastructure planning, military operations and scien-tific analysis. The 3d model, generated using multiple satellite images, describes the surface of the world as seen by the onboard satellite cameras. This surface is a model which includes all the objects on the ground, such as buildings, cars, trees, bridges, bushes, etc. In many cases, these objects need to be excluded and replaced with the ground level and modeled as separate objects. A model which excludes and replaces these objects, or all objects which do not belong to the surface, with the ground level is a 3d model describing the terrain. One exam-ple where only the terrain needs to be considered is in flooding analysis, where the analysts are only interested in what would stop the water from spreading. Since the surface model describes the surface seen by the onboard camera the water will not be able to spread under trees or buildings. The trees and buildings would then act as a barrier in the model and therefore have to be replaced with ground level. Another application for terrain consideration is when calculating height curves for map generation. To be able to model the objects above the ter-rain as separate objects is necessary for example in flight simulators. To model these objects separately makes it possible to make visual effects when interacting with them.

Digital Surface Models (dsm) are commonly used to describe the 3d surface based on aerial images. The dsm is a 2d height map which includes the objects on the ground. The terrain model, excluding objects from the dsm, is called Digital Terrain Model (dtm) and the dtm describes the ground level. The dtm mainly excludes buildings, man-made objects, trees and other vegetation. The terrain model keeps the data points which belong to the ground and the bare-earth, in

(12)

2 1 Introduction

(a) 3dmodel or dsm with texture

(b) dsm (c) dtm

Figure 1.1: Example of 3d model, dsm and dtm of Linköping University area.

other words it excludes points which belong to non-ground objects and replaces the points with ground level. One example of a 3d model, dsm and dtm are shown in Figure 1.1. Since analysis of terrain today is done at a global scale, this requires automatic methods for reducing the height for objects above ground in the dsm in order to calculate an accurate and useful dtm.

A dtm filter is a filter which reduces the height of the dsm surface to create a dtm. Most dtm filters used today only consider the dsm. A common flaw in these filters is that large objects are preserved. Forests are one example where the used filters flatten the surface of the trees at the edge of the forest and only flatten the tree tops further in, this can be seen in Figure 1.2. This problem is similar for large storage units and other buildings that are larger than the filter size. Another common flaw is cutting off the cliff edges and mountain tops which are detected as objects instead of terrain. Human interference, either by editing the surface by hand or by manually setting parameters, is today most often necessary to obtain a good dtm. The methods have different approaches to classify which points in the dsm that are ground points and which are not. In this work, a method using classification of different objects was developed to obtain information of how the surface should be filtered. An Artificial Neural Network (ann) was implemented as classifier and the result was used to create a state of the art dtm.

Normalized Difference Vegetation Index (ndvi), which is further explained in Section 2.2, can be calculated from a Near-infrared channel (nir). nir is the invisible color spectrum with a frequency between 700nm to 1mm (visible wave-lengths ends at 700nm). The vegetation index, ndvi, is commonly used to obtain information of the density of vegetation within an area and can be used to seg-ment vegetated areas. The ndvi-images and nir-images are a storage burden. Since multiple rgb-images are saved for every object or area in order to create the 3d-model, the nir-images would increase the total storage significantly. The storage size would approximately increase to 4/3 compared to only storing the rgb-images (rgb-images contains three color-bands and nir-images contain one). Also, nir-images are not always captured by the on board satellite camera. Since the storage and the accessibility of the images are a problem, the use of nir-images and ndvi was evaluated.

(13)

1.1 Motivation 3

(a) dsmwith texture (b) dsmwithout texture (c) dtmwithout texture

Figure 1.2: Example of dtm in a dense forest. As can be seen in (c) the trees at the edge of the forest are flattened. Further in, artifacts of the forest are still present due to the flawed filtering.

model of the earth. This 3d model contains a fully textured dsm. This master thesis was carried out at Vricon and the algorithm development was done to im-prove the existing dtm algorithms.

1.1 Motivation

Natural disasters are more frequent than ever and analysis suggests that they will become more common in the future [6][11]. Flooding is one natural disaster which affects tens of millions of people every year. A report from the Organi-zation for Economic Cooperation and Development found that coastal floodings cost $3 trillion in damages worldwide every year [11]. To be able to predict flood-ing and the spreadflood-ing of the water is therefore a challenge which could save many lives. Hence access to a 3d model which describes the ground level is of interest. By analyzing the ground level, a model of how the water will spread could be cal-culated. This would give a more accurate measurement than using a dsm, since the water in the model would be able to spread regardless of whether trees or buildings are in the way or not. Therefore, the dtm is necessary for this kind of analysis.

Since the human impact on the environment has become global, the analysis needs to be done globally as well. To create a dtm of the world by manually editing the dsm would be too time consuming. Therefore, an automatic method to create a dtm from a dsm is needed.

1.2 Aim and goals

The aim of this master thesis project is to develop a state of the art dtm filter using satellite images as input. The resulting filter should be general enough to make it useful in a wide range of areas without human fine-tuning for the specific region. To handle this, the use of an ann, as classifier, is investigated. The clas-sification step should be an improvement to the filter and make the filter more suitable for various terrain types, since information of the terrain is extracted. Since the storage and accessibility of nir-images can be a problem, an

(14)

investiga-4 1 Introduction

tion of how much the ndvi and nir as input data affects the result is also carried out. The result of the work should be an improvement compared to a commonly used dtm filters. In this work the resulting method is compared with robust hierarchal interpolation.

1.3 Problem formulation

This report will answer these questions:

• Is an ann of greater use for classification of ground-points and non-ground objects than methods used today in dtm generation?

• Do nir and ndvi contain useful information for the classification problem? • Is the dtm from the method developed in this project better than other dtm

generation methods used today?

1.4 Limitations

This work was carried out only using data provided from Vricon. Addition-ally, the problem formulation stated that ann had to be used to classify at least ground-points and non-ground objects and that the problem should be solved using rgb-images, nir-images and a dsm.

(15)

2

Theory

This chapter contains useful background information, related work and theory behind the methods used in this work.

2.1 Related work

This section describes related work, both for dsm-to-dtm filters and work of clas-sification methods and aerial images. It is divided into two parts, one concerning dtmfilters and one concerning classification of aerial images.

2.1.1 Filter methods

Perko et al. [12] present a way to generate a dtm by only using a dsm, which in turn was generated from airborne LiDAR. The proposed method used slope extraction by using a 2d Gaussian filters with different directions. The filtered surface, from each filter direction, was subtracted from the original dsm. This filtering technique was used to extract the slopes from the dsm in order to classify which points were on the ground and which were not. A point was labeled as a ground point if the point was within a certain distance from the filtered surface. The slope extraction and classification were used in eight different directions, all voted for which class each point belonged to. If more than five directions considered the point as non-ground, the point was considered to be a non-ground point. This resulted in a good dtm for all objects smaller than the Gaussian filter and the used threshold. Larger objects, like forests, were not possible to flatten in a reliable way.

Bretar et al. [3] propose a method using ndvi to extract forests and vegetation in a robust way from the dsm. The result from the proposed method shows that the use of nir, to calculate ndvi, was useful when creating masks for vegetation.

(16)

6 2 Theory

The low vegetation, like grass, was excluded from the masks by assuming that any vegetation needs to be higher than one meter compared to a surrounding neighborhood.

C. Bauerhansl [2] uses a hierarchical robust linear interpolation to generate the dtm. This was done by only selecting the lowest point in a neighborhood and assuming that it was a ground point. The other points were weighted by linearly predicting a confidence for the specific point to be non-ground. The confidence, or weight, was used to interpolate a new surface. Any point which departed from the intermediate surface by a certain distance was removed. By doing this for a successively smaller neighborhoods, a dtm was obtained. This proved to be a robust method in urban areas, but showed problems for larger forests and vegetated areas.

D. Tóvári [15] uses a region growing method to segment areas which belonged to the same linear plane. By comparing the differences between the normal vec-tors for all neighboring points, points were grouped together and considered to belong to the same surface. If the difference of the normal vectors and the dis-tance to any candidate point was within certain limits, the candidate was added to the segment. The segments where used to calculate weights for all the points by computing the signed distance to a pre-estimated surface. The computed weights were used to interpolate the dsm to obtain the resulting dtm. This method gave a good result for points along edges, since these points only depended on the points within the same segment. Segments which were considered as non-ground could have zero as weight and were therefore eliminated.

E. Elmqvist [5] managed to use active shapes for calculating a dtm. By calcu-lating internal and external forces from the dsm the active surface was modeled to obtain a dtm. These forces made the surface stick to the contour, which in this case would correspond to the ground surface. The result proved to be good for hills and slopes but failed in areas with large forests.

2.1.2 Classification methods

In aerial image classification, neural network is a leading approach. For most machine learning approaches, finding appropriate training data with labeled ground truth is a major issue. Also, extracting features which are sufficient enough to separate the classification data are necessary. [14]

A.P. Dal Poz [13] evaluates different features as input for a neural network for classifying aerial images. The evaluated input were different combinations of rgb_{, a local height reduced dsm and a filtered intensity image. The result showed} that rgb data was not enough as input for the classifier and that the combination of all the features gave the best accuracy, which was fairly good with a accuracy of about 90%. They proved that the height difference improved the classification.

Volodymyr Mnih and Geoffrey E. Hinton [9] uses Principal Component Anal-ysis (pca) on rgb images as input to the neural network to classify roads in aerial images. pca is a method to describe information within a dataset optimally, by using the components with highest variance and no covariance. This approach reduces the dimensionality of the data while retaining most of the important

(17)

2.2 Vegetation index 7

structure within an image. The result of the classification proved to be at least 10% better than the compared road detectors for all test areas.

Another road detector was developed by M. Mokhtarzade et al. [10], where they use texture features. The features were extracted using gray level co-occurrence matrices. Features like energy and homogeneity of the texture proved to contain useful information for classification. The result of the classifier was above 90% correctly classified pixels.

There are also classification methods which are not based on ann. One was presented by M. Bandyopadhyay et al. [1]. They use hard classification, separat-ing classes with thresholds, for creatseparat-ing a buildseparat-ing and vegetation segmentation. A region-growing method based on flatness and height thresholding was used to create a building mask. Like Bretar [3], they used ndvi to create a vegetation mask. This resulted in a classifier which had about 90% overall accuracy for the vegetation mask.

2.2 Vegetation index

M. Bandyopadhyay et al. [1] present a way to calculate ndvi using the nir wave-length band. nir wavewave-lengths are reflected by the chlorophyll within leafs and grass and can be measured by an ir camera. By utilizing the red channel in a regular rgb image the ndvi can be calculated as a normalized difference by the channels as expressed in Equation (2.1).

N DV I = N I R − RED

N I R + RED (2.1)

In Equation (2.1) RED and N I R is the intensity for each color band. [1]

2.3 Artificial neural network

annis a machine learning method which can be used to classify data of any kind. The neural network is inspired by a biological system, such as a brain. ann uses nodes to describe neurons and weights to acquire the likeness of the axons within human nerve cells. In the artificial case, a node gets a signal as a weighted sum of all its input data. The strength of the signal activates the node using an activation function that generates the output from the node. The output of the node is multiplied by the weight to calculate the input to the next neuron. A schematic view of a neuron can be seen in Figure 2.2 and is further explained in Section 2.3.1.

The ann is fed with all the acquired and extracted data from the object, which is to be classified. The input data is separated to one node for each extracted input feature, therefore the input layer must have as many nodes as the number of extracted/used features (and commonly one extra node as bias input, usually set to 1). The number of output nodes are equal to the number of classes in which the data are to be labeled, in this case buildings, vegetation, bare-earth and so on. A generalized example of an ann can be seen in Figure 2.1.

(18)

8 2 Theory

Figure 2.1:Schematic example of a artificial neural network with three input nodes, two output nodes and four hidden neurons. A hidden node is a node which is not within the output or input layer. This type on ann is called a three layer ann. The input nodes contain raw input which is multiplied with the weights (arrows) to be forwarded to the next nodes. Depending on the input to a hidden node a new signal is transmitted to the output nodes.

2.3.1 Theory

The ann uses as many input nodes as the user wants and it depends of how many features that are to be extracted from the data. The input features are normalized into the interval minus one to one. The nodes within the hidden layer of the annreceive the sum of the input nodes multiplied with the specific weight for the particular node (arrows in Figure 2.1). The nodes use an activation function which often is a sigmoid-function, Equation (2.3).

xtot = X i ωijxi (2.2) S(xtot) = 1 1 + e−_x_tot (2.3)

In Equation (2.2) ωij is the weight for node i and xi is the input from node i.

j is the index of the node of interest. The output of a neuron, as can be seen in Equation (2.3), is a function of the sum of the inputs xi (Equation (2.2)) and are

schematically shown in Figure 2.2.

Backpropagation and momentum

Backpropagation is an algorithm for optimizing the weights within a ann to ob-tain a given output from a given input. The backpropagation algorithm adjusts the weights within the ann proportionally to the error between the output of the ann and the correct result. The calculated errors are propagated backwards through the layers in the ann in order to update the weights to achieve the cor-rect output. [14]

(19)

2.3 Artificial neural network 9

Figure 2.2: Schematic view of the output from the j:th node in a neural network. The output is computed as described in Equation (2.2) and (2.3). The node output is the output of the activation function. The input to the activation function is the sum of the weighted input, Xtot.

Backpropagation is the most common technique for training, and a useful tool for the learning process. The main goal for the backpropagation algorithm is to solve the credit assignment problem by iteratively adjusting the weights to minimize the error. Here the credit assignment problem refers to finding the optimal set of weights to solve a specific problem and minimize the output error. [14]

One way to calculate the weight adjustment during learning process is to use stochastic gradient descent (sgd). The gradient descent method computes how much the weights will be adjusted corresponding to the error. The error can for example be computed using the Mean Squared Error (mse), Equation (2.5). The adjustment of the weights is propagated backwards using the backpropatation al-gorithm. The sgd uses the computed gradient descent for each training example as a noisy example of the average gradient descent. This procedure is repeated for every training sample several times until the measured output error no longer decrease. [8]

The backpropagation algorithm uses a given learning rate and momentum. The learning rate specifies how much of the error from a given input will be used to adjust the weights, in other words, how much the ann will learn on each training sample. The momentum is how much of the previous adjustment that will be used to update the weights for the current training sample. The momentum describes how much the learning will accelerate or decelerate upon learning and is used to avoid local minimum in the credit assignment problem. The learning rate is denoted η and the momentum as α in Equation (2.4). The δ in Equation (2.4) is derived from the output error utilizing the chain rule in Equation (2.6). In [14] [14] is referred to as the propagated error.

∆ωij(n) = ηδjoij+ α∆ωij(n − 1) (2.4)

(20)

10 2 Theory

the output from neuron ij.

To compute how much the output differ from the suggested output, the mse may be used. The mse is calculated as

 = 1 2 N X i=1 (ti−oi)2, (2.5)

where tiis the target output (suggested output) and oiis the actual output for

the output node i. N is the number of output nodes, which is equal to the amount of classes used. [14] The derivatives of the error are used by the backpropagation algorithm to adjust the weights. The derivative of the error to the given input can be expanded as in Equation (2.6). [8][14] ∂E ∂yinput = ∂E ∂youtput ∂youtput ∂yhidden ∂yhidden ∂yinput (2.6) In Equation (2.6) _∂y∂E

output = youtput

−_{t and t is the target. y}_n _{is the output for the} specified layer, n. This expansion of the derivatives of the error is used to prop-agate the error backwards through the layers in the backpropagation algorithm. [8][14]

Overfitting

Overfitting is a phenomenon where the network weights have been adapted to give a very good result on the training data, but have poor generalization to data not present in the training set. This may occur if the amount of training data is too small compared to the use case or if the variation in the training data is lower than the general case. Fitting the ann to this kind of training data will lead to poor generalization of the net. To avoid overfitting, the data can be divided into two batches, training and validation data. Using two separate sets of training and validation data prevents overfitting and improves the generalization and the performance of a neural network. Validation data is used to evaluate the ann during training. When the ann starts adapting to the training data too much, the error for the validation data generally increases. This could be used to stop the training at an earlier stage and prevent the ann from overfitting to the training data. This method of stopping the training is often referred to as "early stop" [4]. [14]

Network design

The network design refers to the structure of the network, for instance the num-ber of layers and nodes in each layer within the ann. An ann with many hidden layers is generally harder to train than a network using fewer layers. Kevin L. Priddy et al. [14] wrote that a good network designer should be able to solve al-most any problem using just one or two hidden layers, if good features for solving the problem are extracted. The most important is the input data; if the mapping

(21)

2.3 Artificial neural network 11

is simple, the network learns the best. A good rule of thumb, when using one hid-den layer, is to use twice the number of nodes in the hidhid-den layer as the number of input. [14]

One issue using three or four layer neural networks, compared to deep-learning, is that they require hand-crafted features that are sufficiently powerful for solv-ing the problem. The advantage of deep-learnsolv-ing is that the network can be fed with raw pixel values. The multi layer network will learn to distinguish features from the pixel values and learn to solve the problem without using a feature ex-tractor. An problem with deep-learning networks is that the computational time of the evaluation increases, which could be a problem when using a large set of data. [8]

(22)

(23)

3

Method

This chapter describes the methods presented in this theses. The methods in-clude a classifier in the form of an ann and two main dsm-to-dtm filters; one filter for vegetation and one filter for man-made objects. The classifier is devel-oped to use four output classes; ground, man-made objects, vegetation and water. The "ground" class contains points which describe true ground. The "man-made" and "vegetation" classes contain points which should be filtered using the corre-sponding filter.

The chapter is divided into sections corresponding to the different subsystems. The system can be divided into three main subsystems: feature extraction, clas-sification and dtm filters. At the end, there is a description of the data used for training, validation and testing.

3.1 System overview

The system is divided into two main modules; the classifier and the dtm gener-ator. The classifier consists of a trained ann and uses features extracted from the rgb, nir and the dsm as input. The output of the classifier is a labeled map describing which data points belong to a specific class. The input to the dtm generator is the output from the classifier and the dsm. The raw output from the classifier is used to compute masks where different filter methods are used to filter the dsm to generate the dtm. The classification module is separated in a training and testing module, this is discussed further in Section 3.3. A flowchart of the system is shown in Figure 3.1.

(24)

14 3 Method

Figure 3.1: Schematic view the system. The features of the classifier are extracted from the rgb image, ndvi and the dsm. The extracted features are fed into the classifier which produces a labeled map. The dsm-to-dtm filter, of dtm generator, uses the output of the classifier to create the resulting dtm_.

3.2 Feature extraction

The features are computed for each pixel. The features are divided into three types: height features, texture features and vegetation indices. All evaluated features are listed in Table 3.1. The features are extracted using the input data of rgb, nir and dsm. The input data for a specific feature is presented in Table 3.1. The features, shown in Figure 3.3 to 3.5, were extracted from an aerial image over Linköping, Sweden. The inputs to the feature extraction are shown in Figure 3.2. The extracted features are a joint set of features from the literature adjusted to fit the purpose of this work. The customer/outsourcer wanted to evaluate these features, as well as the set of nir dependent features.

3.2.1 Height features

Height dependent features were extracted from the dsm. The extracted height features are divided into two types: features depending on the height relative to a certain neighborhood and features depending on the texture of the dsm within a local area.

Relative height

The relative-height was considered to be useful since both [12] and [3] compared the height of the point of interest to a low-pass filtered surface. Also, by selection the lowest point within a neighborhood [2] generated satisfying results.

(25)

3.2 Feature extraction 15

Table 3.1:List of evaluated features

feature extracted from

rgb(three features) rgb

lowpass rgb (three features) rgb variance of rgb (three features) rgb

relative height 60 m dsm

relative height 120 m dsm

relative height max/min 60 m dsm

height variance dsm flatness dsm nir nir ndvi nir+ rgb evi nir+ rgb vegetation from rgb rgb vari rgb

angular second moment (asm) rgb

energy rgb

dissimilarity rgb

homogeneity rgb

(a) rgb (b) nir (c) dsm

Figure 3.2: Input data of Linköping, Sweden. The nir image is a gray scale image with values in the interval [0, 255]. The dsm is a gray scale image expressing the actual height in meters as the intensity value.

(26)

16 3 Method

(a) Relative height within 60 meters

(b) Relative height within 120 meters

(c) Relative height to max/min in 60 meters

Figure 3.3: Three different height features extracted from the dsm. Figures 3.3a and 3.3b were computed according to Equations (3.1). Figure 3.3c was computed according to Equation (3.2). As can be seen, in Figures 3.3a and 3.3b, the varying window size results in two different images. A greater window size reviles the height of larger objects to compared using a smaller window size.

Three relative-height features are extracted from the dsm, two that depends on the local mean using different window sizes and one feature which is scaled to the difference between the local maximum and local minimum. The two features depending on the local mean are calculated using Equation (3.1) using a window size of 60x60 and 120x120 meters. The max/min scaled feature is calculated according to Equation (3.2). h = hx,y −_h_m a (3.1) h = hx,y −_h_m hmax−hmin (3.2) In Equations (3.1) and (3.2), hm is the mean height in the local area and a is

a constant used to ensure the feature was within [−1, 1]. In this work a = 100 is used, since hx,y−hm rarely exceeds 100 meters. The result of these features are

visualized in Figure 3.3.

Flatness and height variance

The flatness and height variance features are extracted from the dsm using the correlation matrix for all height values within a certain neighborhood. The corre-lation matrix is calculated as described in Equation (3.3).

C = 1 N2 x+N₂ X i=x−N₂ y+N₂ X j=y−N₂          xij−xm yij−ym zij−zm          xij−xm, yij−ym, zij−zm (3.3)

In Equation (3.3) x and y are the current position and z is the local height for a certain position. The mean for each variable is subscripted with m. The

(27)

(a) Flatness (b) Height variance

Figure 3.4: Two height structure features extracted from the dsm. The flat-ness image is computed according to Equation (3.4) and the height variance image is computed according to Equation (3.6). The flatness image has a low value for all flat surfaces, whether of not the surface is tilted or not. The height variance image has a high values where the height of the dsm are shifting.

flatness was then calculated, according to Bandyopadhyay et al. [1], using the eigenvalues of the correlation matrix as described in Equation (3.4). The flatness is low for any flat surface regardless of the horizontal angle of the surface. Since the rooftops are a tilted flat surface, their flatness would still be low, even though the height variance is high. Therefore, the flatness parameter would in an optimal case separate vegetation from buildings.

f latness = λ0 λ0+ λ1+ λ2

(3.4)

λ0≤λ1≤λ2 (3.5)

In Equation (3.4) λnis an eigenvalue of the correlation matrix.

The calculated flatness feature is visualized in Figure 3.4a. As can be seen, the areas of the fields are generally brighter, indicating a higher value of flatness, than the forest areas. The height variance is equivalent to the element [3, 3] in the correlation matrix C from Equation (3.3).

height variance = 1 N2 x+N₂ X i=x−N 2 y+N₂ X j=y−N 2 (zij−zmean)2 (3.6)

In Equation (3.6) x and y are the current position and z is the local height for a certain position.

3.2.2 Color and texture features

Color and texture features are extracted from the rgb image. The raw pixel val-ues of the rgb image are used as three independent features. For each color band

(28)

18 3 Method

the mean intensity in a certain neighborhood, the variance in the same neighbor-hood and texture features computed using the gray level co-occurrence matrix (glcm) are extracted.

Gray level co-occurrence matrix

In order to extract texture features from an image, a gray level co-occurrence matrix (glcm) is used. Since [10] used glcm to classify roads with satisfying result this was considered as a necessary texture extractor. The glcm contains information of how neighboring pixels are related to each other within a certain neighborhood. The glcm is calculated by counting the number of pixel pairs with same intensities within a block of pixels. One pixel pair is formed by two adjacent pixels or two pixels with a certain distance between each other. The glcmhas the same size as the amount of gray levels squared, eg. if comparison is done using 255 gray levels the size of the glcm is 255 times 255. In this work the images are scaled to 64 gray levels since this was not too computational demanding. The glcm consists of all existing pixel pairs. If one pixel has gray level N and the compared pixel has gray level M then a one is added to the glcm at position (M, N ). Equation (3.7) describes the above as a sum of all existing pixel pairs. A better explanation is given by [7].

glcm(M, N ) = xmax X x=xmin ymax X y=ymin

((M == im(x, y)) ∗ (N == im(x, y + 1))) (3.7) In Equation (3.7) im is a gray scale image and (x, y) is the image position.

Various texture features are then extracted using the glcm. The extracted features are shown in Equations (3.8) to (3.11). These are four features which are easily extracted using the glcm and, since the impact of the features are evaluated, the correlation between the features are neglected. asm is sometimes referred to as energy and is calculated according to [7]. When calculating the glcm, the pixel pairs are formed using four directions, 0, π/4, π/2 and 3π/4. The four resulting glcm matrices are normalized by dividing the matrix with the total number of pixel pairs. The four matrices are added together and divided by four to form one normalized glcm. In Equation (3.8) to (3.11), ai,jis equal to the

glcmat position (i, j). asm= N −1 X i=0 N −1 X j=0 a2_i,j (3.8) energy = v u u tN −1 X i=0 N −1 X j=0 a2_i,j (3.9) dissimilarity = N −1 X i=0 N −1 X j=0 ai,j|i − j| (3.10)

(29)

(a) asm (b) Energy

(c) Dissimilarity (d) Homogeneity

Figure 3.5: Texture features extracted from the glcm as described by Equa-tion (3.8) to (3.11). For these images a window size of 5x5 pixels were used.

homogeneity = N −1 X i=0 N −1 X j=0 ai,j 1 + |i − j| (3.11)

3.2.3 Vegetation and

NIR

features

Vegetation and index features are extracted using both the nir and rgb color band. Visible Atmospherically Resistant Index (vari) and a ratio index (ri) be-tween green and blue was used by Bandyopadhyay [1] as validation against ndvi. These two indices are here used as features to the ann and can be seen in Fig-ure 3.6. They are calculated according to Equations (3.12) and (3.14). ndvi and Enhanced Vegetation Index (evi) were calculated using the nir color band along-side to rgb. evi is calculated according to Equation (3.13) and ndvi is calculated according to Equation (2.1) in Chapter 2. ndvi is within the interval [-1, 1] where 1 describes dense vegetation. Since Icolor is within [0, 255], the ri is normalized

(30)

20 3 Method

(a) ndvi (b) evi

(c) Ratio Index between green and blue

(d) vari

Figure 3.6: Index features extracted from nir and rgb according to Equa-tions (3.12) to (3.14) and Equation (2.1).

ndvibut uses the blue color band to enhance the index. vari is often used as a substitute to ndvi when nir is not available.

RI = Igreen Iblue (3.12) EV I = 2.5 ∗ IN I R−Ired IN I R+ 6 ∗ Ired−7.5 ∗ Iblue+ 1 (3.13) V ARI = Igreen −_I_red Igreen+ Ired−Iblue

(3.14)

3.3 Classification

The classifier is implemented as an ann. The ann is constructed to have three layers. The number of nodes of the hidden (middle) layer is twice the number of

(31)

3.3 Classification 21

Figure 3.7: Schematic view of the layers within the neural network. If all features are used the input layer has 23 nodes, 46 hidden nodes and 4 output nodes. Except these nodes one bias node where used for both the input and the hidden layer. The output of a bias node was set to 1.

nodes of the input layer, as suggested by [14]. This is a simple design that pro-vides sufficient performance for the application described here. Also, training and evaluation was not too computationally demanding with this network de-sign. Therefore, no other network design was evaluated. There are 23 extracted features and four output classes are used; ground, man-made objects, vegetation, and water. Thus, the network was designed to have 23 input nodes, 46 hidden nodes and 4 output nodes as can be seen by Figure 3.7. The classifier was devel-oped to classify images per pixel and to use features computed within a neigh-borhood of the pixel of interest. To avoid overfitting, the data was divided into training and validation data. After each training epoch (training on all training samples once) the ann was evaluated on the validation data. The ann which performed best, with respect to the mean squared error (mse) of the output layer compared to the correct output, was saved. This ensured that only the best ann was stored and reduced the effect of overfitting on the training data. The training was designed to have two user input parameters: the learning rate (η) and the momentum (α), see Equation (2.4). These parameters were evaluated to fit the training for the used training data.

Before training, features are computed for all pixels/data points within the dsm_{and texture images. Then the data is separated evenly between training and} validation. The ann was fed by a set of hand-labeled ground-truth maps, which were only labeled for data points that the user considered as certain to belong to a specific class. Uncertain areas, which are not labeled, are typically edges of buildings, shadows and shorelines. In these areas, the calculated dsm often differs from the reality and therefore is not reliable as training data. The trained and validated ann is then tested on a completely different area, to get a result from data which did not correlate to the training data at all. The testing is also done visually to get a better understanding of the result from the training. These

(32)

22 3 Method

steps can be seen schematically in Figure 3.8.

To evaluate the training of the ann, the mean square error (mse) between the calculated output and the suggested output is used. This gives a more general approximation of the quality of the output than to binary compute how many points that were classified correctly [14].

The network is implemented only as a three layer ann. Mostly, this is be-cause a deeper neural network is more complex to implement and has longer computational time during training. Using a three layer network gives sufficient performance for this task. While a deeper network might increase the classifica-tion performance, the increase in computaclassifica-tional requirements is therefore hard to motivate.

To classify a pixel, all output nodes within the ann are assigned to a specific class. For all classes, except the "ground" class, the output node which has the largest output is used as classification. If the "ground" class output node has the largest output, the output needs to be at least 70% of the total output from all output nodes, otherwise the classification is unset. More of this is found in Section 3.4.

3.3.1 Training and validation data

The ann needs training data for each class. The training and validation data should contain roughly the same number of examples for each class [14]. If there is imbalance in the training data, then the ann will learn this imbalance. The trained ann will learn to classify more data to belong to the over-weighted train-ing class. To handle this data imbalance, the number of examples used from each class was always the number of examples in the class with the least exam-ples. For example, if the labeled data contained ten ground-labeled data points and five vegetation-labeled data points the training used five from each class and neglected five ground-labeled data points.

The labeled data was separated into training data and validation data to re-duce the effect of overfitting on the training data. The used method randomly selects a specified amount of the labeled data into training data and the rest into validation data. This ensures that both the training and validation data were well spread throughout the labeled images.

3.4 DTM generation

The dtm is generated depending on the output of the classifier and the dsm, as can be seen in Figure 3.1. Depending on which class a point is assigned to by the classifier, it is handled differently by the dtm generator. The generator consists of two different dsm-to-dtm filters; one for vegetation and one for man-made objects. The class "ground" is the one used to calculate the height for the other areas. Therefore, the classified ground points are filtered to only contain true ground points. For the classification of the other three classes, the maximum output from the ann is used. The classification is then morphologically altered

(33)

3.4 DTM generation 23

Figure 3.8: Schematic view of the training of the neural network. The training (upper gray box) calculates features for the training area with corre-sponding hand-labeled image. The feature object was separated point-wise between training and validation points. The ann which performs best on the validation data was tested on a completely different test area to get a reliable result of how good the trained ann was.

(34)

24 3 Method

to make a more robust dtm generator. The parameters for the morphology are evaluated in Section 4.2.1.

3.4.1 Ground-point filter

Since the ground points where used in to calculate the dtm-height, it is of most importance that the mask contains ground points only contained points describ-ing the true ground. To add a confidence to the classification the raw ann output from the output layer is used. The raw output from the classifier needs to be at least 70% of the total output from all output nodes, needs to fulfill Equation (3.16), to be classified as a ground point. The confidence function can be seen in Equation (3.15) as a function of the sum of all output nodes.

conf idence(oi) = oi PN n on (3.15) conf idence(oi) > 0.7 (3.16)

In Equation (3.15) oi is the output of node i in the output layer. To reduce

edge effects within the dsm, the ground mask is eroded with a certain distance in all directions in order to only contain true ground points.

The dsm is not a perfect model of the true height for all areas. Areas more unlikely to correspond to the reality are points near edges of buildings or other high objects. Also, points covered by shadows, often between two high buildings, are uncertain. To reduce the effect of miscalculated shadows, the ground points with a gray level intensity lower than 70 are removed from the ground mask, i.e. when (R + G + B)/3 < 70.

3.4.2 Man-made filter

The height of the dtm in the area of buildings and man-made objects are calcu-lated by just considering the neighboring points which are included in the ground mask. A flat surface is linearly interpolated using these points and set as the dtm points in the building areas.

3.4.3 Vegetation filter

The most common flaw of dsm-to-dtm filters is for large vegetation areas. Since the terrain height may vary within a forest, it is hard to know the actual terrain height. This is impossible for filters which only consider the dsm. The classifica-tion allows a way to estimate the actual terrain height.

To be able to keep the variation of the terrain in vegetated areas, the height of the vegetation relative to the ground is estimated from the dsm and subtracted from the dsm to obtain the dtm. For any vegetation point, the height difference between this point and the lowest certain ground point within 10 meters is calcu-lated. The vegetation height map is linearly interpolated for all areas where the vegetation height could not be computed, due to long distances to any ground point. The dsm is then subtracted with the vegetation height map for the areas

(35)

3.5 Data 25

of vegetation to obtain the dtm points. The result was is low-pass filtered to re-move the remaining effect of the varying vegetation height. A schematic view of the vegetation filter is presented in Figure 3.9.

The vegetation height map calculation assumes that the height of the vegeta-tion cannot be negative or higher than 30 meters. This is done to remove errors where grass is mislabeled as vegetation and is lower than the lowest classified ground point or where the tree height is too large due to elevation of the ground.

3.4.4 Post processing

When the dtm height is calculated for all building and vegetation areas, the dtm is post-processed to reduce artifacts in the filtered, or adjusted, areas. In the areas where the classifier is uncertain (the classification is unset), the height is interpo-lated using the calcuinterpo-lated heights and the certain ground points. Afterwards, the dtm_{is low-pass filtered for all areas which have been changed from the dsm (not} points included in the ground mask). This is done to reduce edge effects near classification borders and to reduce the influence of mis-calculated heights. The dtmthen goes through a correction step that checks if the dtm is higher than the dsmat any point. If this is the case, the height of the dtm point is set to be equal to the height of the dsm point.

3.5 Data

The training data consisted of three different maps of geographical areas: Rio de Janeiro, Brazil; Boden, Sweden and Udaipur, India. The training and validation data were hand-labeled using the open source program GNU image manipula-tion program (also known as GIMP). The three different areas were labeled into four classes: man-made-objects, vegetation, water and ground/bare-earth. The labeled pixels were pixels which truly belong to a specific class. Vegetation in-cluded trees, bushes and similar things that had a height higher than the ground surface (grass was not included in the vegetation class). Ground, or bare earth, were pixels where you can see the height of the true ground, like grass fields, roads on ground (not bridges), mountain cliffs and other things that should be included in a dtm. The water class contained all kinds of water, like pool areas, lakes, rivers and sea water.

The three labeled images and corresponding original images can be seen in Figure 3.10. The labeling is done using the colors red, green, blue and yellow, where red means man-made objects, green means vegetation, blue means water and yellow means ground. The number of labeled pixels for each class can be seen in Table 3.2. The table shows that if no data is duplicated into training, and if data is split evenly between training and validation, there are 1902403/2 ∗ 4 ≈ 3.8 million pixels within the training data.

The labeling was done only labeling pixels which are certain to belong to a specific class. Therefore, pixels near edges of buildings, vegetation water areas etc. where not labeled. This was done to ensure that no data was mis-labeled,

(36)

26 3 Method

(a) Color labeled dsm. Yellow is ground, red is buildings or man-made objects and green is vegetation.

(b) Vegetation height map. Green is the measured height and gray is the estimated height.

(c) Resulting dtm for the vegetation areas

Figure 3.9: Schematic view of the vegetation dsm-to-dtm filter. First, the vegetation height map is calculated using the classifications and the corre-lated dsm. The classified dsm is visualized in Figure 3.9a, with red as build-ing, green as vegetation and yellow as ground. The measured height in the vegetation height map is displayed in green in Figure 3.9b. The height for all unknown values are calculated by interpolating the known values. The interpolated height is displayed in gray in Figure 3.9b. The vegetation height map is subtracted from the dsm to generate the dtm (Figure 3.9c).

(37)

3.6 Parameter optimization and evaluation 27

Table 3.2:Number of labeled data patterns for each area. area ground man-made vegetation water Rio de Janeiro 2241830 548329 608188 5060718 Boden 1609235 891738 1425627 2505193 Udaipur 765049 462336 345416 427880 summation: 4616114 1902403 2379231 7993791 Table 3.3:Number of labeled data patterns for each type of area, total is the sum of all labeled patters.

area ground man-made vegetation water total Sydney 1330802 616321 215008 7009432 9171563 London 961665 1118367 196736 7420633 9697404 Dubai 2483534 804119 351317 66116 3705086

Makhwah 4529961 56888 36120 0 4622969

and reduce the effect of errors within the dsm near edges. Labeling was done keeping a distance of approximately three meters to any edge.

All data samples were extracted from images which had a sampling distance of 0.5 meters and was of size ≈ 8100x8100 pixels which roughly corresponds to 16km2.

3.5.1 Additional data

To evaluate the trained ann, four other areas were labeled. The additional data sets were labeled data from: Sydney, Australia; London, Great Britain; Dubai, United Arab Emirates and Makhwah, Saudi Arabia. The evaluation was also done visually on areas of Linköping, Sweden; Tripoli, Libya and Chicago, USA, which were also used to evaluate the result of the dtm.

3.6 Parameter optimization and evaluation

The tests and evaluation of the method were done in three steps. First, the pa-rameters for the classification were evaluated and optimized. The classification parameters is window size and features used. Secondly, the morphological pa-rameters for the dtm filter were optimized. Last, the set of papa-rameters, from the previous optimization steps, were used to evaluate the resulting dtm generator. The evaluation was done in two ways, one visual comparison and one by using a reference dtm as ground truth and comparing the generated dtm model against this reference model.

The feature evaluation was done using a t-test. A t-test is a hypothesis test to check the null-hypothesis. The null-hypothesis is described by H0 = H were H0 is a reference model and H is the evaluated model. The ann was trained multiple times with one of the features excluded and compared with a ann trained with

(38)

28 3 Method

(a) Rio de Janeiro, Brazil (b) Rio de Janeiro, Brazil

(c) Boden, Sweden (d) Boden, Sweden

(e) Udaipur, India (f) Udaipur, India

Figure 3.10: Hand-labeled images for training and validation data. The labeled data uses yellow as ground, red as buildings, green as vegetation and blue as water. To the right is the corresponding original images

(39)

3.6 Parameter optimization and evaluation 29

(a) Sydney, Australia (b) London, Great Britain

(c) Dubai, United Arab Emirates (d) Makhwah, Saudi Arabia

Figure 3.11: Additional hand labeled data for testing and training the final ann

(40)

30 3 Method

all features. This was done since there exists a random selection of training and validation data samples within the training. This evaluation was done to see if any feature significantly worsens the result.

The parameters from the parameter optimization were used when evaluating whether a second ann might perform better if all labeled areas were used for training. The second ann was trained using all seven areas mentioned in Section 3.5. Since all labeled areas were used in training, the evaluation was done visually. The ann which resulted in the best classification was used in the following dtm filter optimization step.

The performance of the dtm generation algorithm was evaluated against both a robust hierarchical interpolation, as described by [2], and a reference model. The reference model is a LiDAR and SAR (Synthetic Aperture Radar) measured model, has been modified by hand to create a very accurate dtm. This model was created by the United States Geological Survey (USGS). The model used for the evaluation is a model of Newark, USA belonging to the National Evaluation Dataset (NED). The result of this evaluation is described in Chapter 4.

(41)

4

Results

This chapter presents the result of both training the classifier and evaluating the resulting dtm. The results in the dtm section are built upon the resulting classi-fier.

4.1 Classification

This section evaluates features and parameters for the ann. The section is di-vided into one feature evaluation section and one section on the parameters used in the learning.

4.1.1 Feature evaluation

This section presents the result of tests regarding features. First, the classifica-tion pipeline is evaluated with and without nir data, as well as with a range of window-sizes. Last, there is a test of feature dependency.

NDIV and NIR

This test evaluates the impact of using or not using nir data as input to the fea-ture extraction. All nir dependent feafea-tures were set to zero when training with-out nir. Both the ann trained with nir and the ann trained withwith-out nir were trained on the same areas. The nir dependent features can be seen in Table 3.1, where they are denoted nir in the right column. The results from the training are shown in Table 4.2, where "% correct" is the percent number of correct validation data pixels and mse is is calculated on the validation data according to Equation (2.5). The parameters used for this test are shown in Table 4.1. The number of epochs was set to 2130 to ensure that the training error no longer decreases, while

(42)

32 4 Results

Table 4.1:Parameters used for the nir test. number of epochs 2130

data extraction randomly 50/50

features of use all incl/excl. nir, ndvi and evi window size 5x5 pixels (2x2 meters)

learning rate 0.001 momentum 0.005

Table 4.2:Result of the validation during training from the nir test. % correct mse

with nir 98.04% 0.016400 without nir 97.36% 0.022541

the computational time is not longer than a weekend. The result from the eval-uation images are shown in Table 4.3. The result is presented as percent correct classified pixels. As can be seen, the London result is an outlier which is a result of a mis-labeled river within the image. To visualize the differences between the nirand no-nir classification two difference images were created, which can be seen in Figure 4.2. Figure 4.3 is a zoomed image of one area in the lower right corner in Figure 4.2 to point out the major differences when using or not using nir_{data. The result of the test show that there is an improvement of using the} nir_{color spectrum. nir is considered important and is therefore used in further} tests.

Window size

This test evaluates the result of using different window sizes. The features which depend on the neighborhoods are: lowpass RGB, RGB variance, height variance, flatness, asm, energy, dissimilarity and homogeneity. The parameters used for the test can be seen in Table 4.4. The number of epochs was set to 200 to decrease the computational time, as compared to the nir test, and therefore a greater por-tion of labeled data was used for training. This was done to speed up the learning process. The learning rate was lowered and the momentum was set higher to

sta-Table 4.3:Result of the ann on the evaluation data. with nir without nir

Sydney 94.06% 96.80% London 89.76% 24.16% Dubai 86.89% 88.62% Makhwah 70.15% 87.49% Overall: 87.49% 68.20%

(43)

(a) with nir (b) without nir

Figure 4.1: Comparison of using nir or not when classifying the area of Linköping. Red means man-made object, yellow means ground, green means vegetation and blue means water.

Table 4.4:Parameters used for the window size test. number of epochs 200

data extraction random 75/25 features of use all

learning rate 0.0001

momentum 0.5

bilize the training process. Sizes which were tested lie within the interval [3, 17]. The upper limit was set considering the computational burden of larger window sizes. All sizes tested were of odd number to keep the current position as the middle sample. The result, in Table 4.5, show that the larger the window, within this range, the better result.

Impact of features

The feature impact test was done simply by setting one feature at the time to zero before training. The ann was trained using all other features unaltered. The parameters used can be seen in Table 4.6. The test was performed multiple times in order to get a more accurate result. A t-test was done to evaluate if there was any significance in the result of removing one feature compared to not removing any features. For this evaluation, H0, as described in Section 3.6, is the result from using all features. The training was repeated four times for each excluded feature. The result from all four trained ann was used in the t-test for each

(44)

34 4 Results

(a) Original RGB image of Linköping, Sweden.

(b) The classification using nir is colored in the areas where the classification from the two classifiers differ.

(c) The classification without nir is col-ored in the areas where the classification from the two classifiers differ.

Figure 4.2: Difference image for classifications using nir or not. The images are black where the classifications of Figure 4.1a is equal to the classification in Figure 4.1b. In other areas it displays what the classifications are for the specific classifier. Let the two classification results from Figure 4.1 be image A and B. If A = B, then image is black in both cases, else use value from either A or B. This shows that where the two classifiers do not agree, the classifier without nir wrongly labels pixels to man-made objects more often than the other. As can be seen in the lower right quarter of (c) the classifier mis-labels both vegetation and ground as man-made objects (colored in red), cut-out of this is shown in Figure 4.3.

(45)

(a) Original RGB image

(b) Difference of classification using the nirclassification.

(c) Difference of classification using the no nir classification

Figure 4.3: Zoom of area in the lower right corner from Figure 4.2. As can be seen, the ann trained without nir classifies more ground regions as man-made objects (colored with red), as compared to the ann trained with nir. If (c) is compared to the original rgb image, it can be seen that nearly no areas within this cut-out should be labeled as man-made objects. (b) and (c) indicate that the classification using nir labels more areas correctly than the classification without nir.

(46)

36 4 Results

Table 4.5:Result of using varying window size. window size in pixels % correct mse 3x3 (1x1 m) 95.77% 0.033777 5x5 (2x2 m) 96.06% 0.030130 9x9 (4x4 m) 96.70% 0.025145 13x13 (6x6 m) 97.07% 0.022286 17x17 (8x8 m) 97.41% 0.020351 Table 4.6:Parameters used for the feature evaluation test.

number of epochs 100

data extraction random 50/50 features of use all except canceled learning rate 0.0005

momentum 0.5

window size 13x13 (6x6m)

model. The result from training using all but one feature can be seen in Table 4.7. The result of the test shows that some features are improving the result and that no feature significantly, at 5% significance level, worsen the result.

4.1.2 Learning evaluation

This section describes how the training parameters have been evaluated. The test was done by first keeping the learning rate fixed and varying the momentum pa-rameter. Then, the momentum was held fixed and the learning rate was varied. This was done for 200 training epochs and then four parameter sets were eval-uated for 1000 training epochs to be able to get a better understanding of the influence of the parameters. The parameters used can be seen in Table 4.8. As can be seen in plots (Figures 4.4 to 4.7), the use of a larger momentum increases the stability of the learning process. The use of an increased number of training epochs and a low learning rate was also proven to be useful for the learning pro-cess, as compared to using a high learning rate and fewer training epochs. This set of learning parameters ensures a stable learning even for a large amount of training epochs.

4.1.3 Final parameters

This section presents the result of using optimized parameters, from the previous test, as compared to the initial parameters. Also, an ann trained using all images (including the additional test images) was evaluated against the ann that was only trained on the three initial training areas with optimized parameters. The parameters used can be seen in Table 4.9. The epochs was set to the maximum amount of epochs that did not have a computational time linger than 60 hours, to ensure completeness during a weekend. The comparison of using optimized and

(47)

Table 4.7: Result from the feature impact test. The base, maximum and minimum mse is highlighted with bold text.

t-test of significant difference canceled feature % correct mse _{@ 5% level} none (base for t-test) 97.63 0.0198 -rgb(three features) 97.46 0.0210 yes low-pass rgb (three features) 97.15 0.0235 yes variance of rgb (three features) 97.56 0.0201 no relative height 60 m 97.63 0.0200 no relative height 120 m 97.48 0.0207 yes relative height max/min 60 m 97.43 0.0213 yes

height variance 97.49 0.0207 yes

flatness 97.58 0.0202 no nir _97.58 _0.0202 _yes ndvi _97.56 _0.0204 _yes evi _97.71 _0.0196 _no vegetation from rgb 97.61 0.0201 no vari 97.66 0.0197 no asm 97.65 0.0198 no Energy 97.59 0.0200 no Dissimilarity 97.49 0.0208 yes Homogeneity 97.54 0.0206 yes

Table 4.8:Parameters used for the learning parameters evaluation. number of epochs 200 (1000)

data extraction random 75/25 features of use all

learning rate 0.0001/varied momentum varied/0.5

(48)

38 4 Results

Figure 4.4: Plot of mean squared error on validation data using varying momentum.

Figure 4.5: Plot of mean squared error on validation data using varying learning rate.

(49)

Figure 4.6: Zoom of Figure 4.5 to analyze if the learning is stable. As can be seen, the purple line struggles to lower the error, while the blue and green decreases more stably.

Figure 4.7:Plot of mean squared error on validation data using varying mo-mentum and learning rate for 1000 training epochs.

(50)

40 4 Results

Table 4.9: Final parameters from above results and the initial parameters within parentheses.

number of epochs 3100 features of use all

learning rate 0.0005 (0.001) momentum 0.5 (0.005) window size [px] 17x17 (3x3)

Table 4.10: Result of the resulting ann, using optimized features and pa-rameters, on the test data. The result of the optimized using three areas for London, was a result of partly mis-classification of the river in the image.

parameters initial optimized optimized trained on three areas three areas all seven areas training pixels 3804804 3804804 6356824 validation data 97.84% 98.93% 99.13 % mseon validation data 0.017900 0.009477 0.007701

Sydney 97.90% 99.08% 99.79%

London 95.90% 42.17% 99.56%

Dubai 89.11% 85.02% 99.68%

Makhwah 62.20% 90.43% 99.70%

unoptimized parameters can be seen in Table 4.10, and the comparison between using three or all images for training can be seen in Figure 4.8 to 4.10. The results show that the optimized parameters perform better than the unoptimized. The visual examination of Figures 4.8 to 4.10 suggests that there is an improvement using all labeled areas for training and validation. In these figures, it can be seen that the ann only trained on three areas has far more mis-labeled clutter, while the ann trained using all images generated a more homogeneous classification. The ann using all areas (including additional data) will be used henceforth for dtmfilter optimization and evaluation.

4.2 DTM

This section presents the result of using the above trained classifier, the ann whose result was presented in Figure 4.8c, 4.9c and 4.10c. The results are com-pared with the hierarchical method [2], which only consider the dsm in the gen-eration of the dtm.

4.2.1 Parameter optimization

This section evaluates visually the effect of eroding and dilating the classes from the resulting ann classification. This morphological operation includes or ex-cludes areas at the border of a specific class by a certain distance. Table 4.11 to

(51)

4.2 DTM 41

(a) original image

(b) three training areas

(c) all training areas

Figure 4.8: Comparison of using three or all areas for training on Linköping, Sweden.

(52)

42 4 Results

(a) original image

Figure 4.9: Comparison of using three or all areas for training on Tripoli, Libya.

(53)

4.2 DTM 43

(a) original image

Figure 4.10: Comparison of using three or all areas for training on Chicago, USA.

(54)

44 4 Results

Table 4.11:Parameter set 1 for post processing the classification as input for the dtm generation. erode [m] dilate [m] ground 2.5 0 man-made 0.5 1.5 vegetation 1 0 water 0 0

Table 4.12:Parameter set 2 for post processing the classification as input for the dtm generation. erode [m] dilate [m] ground 5 0 man-made 0.5 1.5 vegetation 1 0 water 0 0

4.13 display three different parameter settings, and the result of using these pa-rameters are shown in Figure 4.12 and 4.14. The original dsm:s are shown with and without texture in Figure 4.11 and 4.13. The parameter set 3 was considered the best parameters after visual evaluation of Figure 4.12 and 4.14.

4.2.2 Evaluation of the

DTM

result

This section compares the result between the final dtm result and the hierarchi-cal method [2]. The dtm generation is based on the parameters from Table 4.13 and all parameters for the final system setup can be seen in Table 4.14. The re-sults are visually presented in Figure 4.15 to 4.17.

Newark referenceDTM

An additional evaluation area, Newark, USA, was used to compare the result against a reference dtm. The test displays gray scale images of the dsm, the ref-erence dtm, both of the generated dtm:s and two diffref-erence images between the reference and the two generated dtm:s, all in Figures 4.18 to 4.19. The statistical result can be seen in Table 4.15.

Table 4.13:Parameter set 3 for post processing the classification as input for the dtm generation. erode [m] dilate [m] ground 2.5 0 man-made 1 5 vegetation 1 0 water 0 0

Extraction of DTM from Satellite Images Using Neural Networks

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2016