Make it Meaningful : Semantic Segmentation of Three-Dimensional Urban Scene Models

(1)

Make it Meaningful:

Semantic Segmentation of

Three-Dimensional Urban

Scene Models

(2)

Scene Models

Johan Lind LiTH-ISY-EX–17/5103–SE Supervisor: Hannes Ovrén

isy, Linköpings universitet

Mikael Jonsson

Spotscale AB

Examiner: Per-Erik Forssén

isy_{, Linköpings universitet}

Computer Vision Laboratory Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden

(3)

Semantic segmentation of a scene aims to give meaning to the scene by dividing it into meaningful — semantic — parts. Understanding the scene is of great in-terest for all kinds of autonomous systems, but manual annotation is simply too time consuming, which is why there is a need for an alternative approach. This thesis investigates the possibility of automatically segmenting 3D-models of ur-ban scenes, such as buildings, into a pre-determined set of labels. The approach was to first acquire ground truth data by manually annotating five 3D-models of different urban scenes. The next step was to extract features from the 3D-models and evaluate which ones constitutes a suitable feature space. Finally, three su-pervised learners were implemented and evaluated: k-Nearest Neighbour (knn), Support Vector Machine (svm) and Random Classification Forest (rcf). The clas-sifications were done point-wise, classifying each 3D-point in the dense point cloud belonging to the model being classified.

The result showed that the best suitable feature space is not necessarily the one containing all features. The knn classifier got the highestaverage accuracy over

all models — classifying 42.5% of the 3D points correct. The rcf classifier man-aged to classify 66.7% points correct in one of the models, but had worse perfor-mance for the rest of the models and thus resulting in a lower average accuracy compared to knn. In general, knn, svm, and rcf seemed to have different bene-fits and drawbacks. knn is simple and intuitive but by far the slowest classifier when dealing with a large set of training data. svm and rcf are both fast but difficult to tune as there are more parameters to adjust. Whether the reason for obtaining the relatively low highest accuracy was due to the lack of ground truth training data, unbalanced validation models, or the capacity of the learners, was never investigated due to a limited time span. However, this ought to be investi-gated in future studies.

(4)

(5)

First I would like to thank all colleagues at Spotscale for continuously providing interesting and pleasant conversations at coffee breaks and for all the help I have been given — especially from my supervisor Mikael Hägerström.

Secondly I want to thank my examiner Per-Erik Forssén and my supervisor Hannes Ovrén at Linköpings universitet for the discussions, feedback and support. Last but not least I want to thank my family and friends for being supportive and showing such interest in my work. Everyone already knew the basics of buildings and houses — roof, wall, windows, door etc. — and while explaining the rest, all at least pretended to understand some of it.

Linköping, August 2017 Johan Lind

(6)

(7)

Notation ix 1 Introduction 1 1.1 Motivation . . . 1 1.2 Goals . . . 2 1.3 Approach . . . 2 1.4 Limitations . . . 3 1.5 Related Work . . . 3 2 Theory 5 2.1 3D reconstruction . . . 5 2.2 Features . . . 6 2.2.1 Normal feature . . . 6 2.2.2 Height feature . . . 6 2.2.3 Colour features . . . 7

2.2.4 Fast point feature histogram . . . 7

2.3 Classification . . . 9

2.3.1 k-Nearest Neighbour . . . 9

2.3.2 Support vector machine . . . 10

2.3.3 Random classification forest . . . 12

3 Method 17 3.1 System overview . . . 18 3.2 Data acquisition . . . 18 3.2.1 The models . . . 18 3.3 Feature extraction . . . 21 3.4 Classification . . . 22

3.4.1 k-Nearest Neighbour as benchmark . . . 22

4 Result 25 4.1 Generating the ground truth . . . 25

(8)

4.2 Feature space evaluation . . . 26

4.3 Parameter tuning . . . 27

5 Discussion 33 5.1 Approach . . . 33

5.1.1 Generating the ground truth data . . . 33

5.1.2 Choosing the training data . . . 34

5.1.3 Evaluating feature spaces . . . 34

5.1.4 Parameter tuning . . . 35

5.2 Results . . . 35

5.2.1 Generating the ground truth data . . . 35

5.2.2 Feature space evaluation . . . 35

5.2.3 Parameter tuning . . . 36

5.3 Further work . . . 37

5.3.1 Investigating the effect of more training data . . . 37

5.3.2 Feature space evaluation . . . 37

5.3.3 Using training data smarter and re-do parameter tuning . . 38

6 Conclusion 39 A Complementary result figures 43 A.1 Feature space evaluation . . . 45

A.1.1 All features . . . 45

A.1.2 Without the height feature . . . 46

A.1.3 Without the normal feature . . . 47

A.1.4 Without the Hue, Saturation and Value feature . . . 48

A.1.5 Without the CIELab feature . . . 49

A.1.6 Without the FPFH feature with radius 2 meters . . . 51

A.1.7 Without the FPFH feature with radius 0.2 meters . . . 53

A.2 Result of parameter tuning . . . 54

A.2.1 Random Classification Forest . . . 54

A.2.2 Support Vector Machine . . . 64

Bibliography 67

(9)

Abbreviations

Abbreviation Meaning

sfm _{Structure from motion} mvs Multi-View Stereo rgb Red, Green, Blue hsv Hue, Saturation, Value lab CIEL*a*b Color Space pfh Point Feature Histogram

spfh Simplified Point Feature Histogram fpfh _{Fast Point Feature Histogram}

svm _{Support Vector Machine} knn _{k-Nearest Neighbour}

gui _{Graphical User Interface} rcf _{Random Classification Forest}

(10)

(11)

1

Introduction

Spotscale AB is a company that reconstructs buildings and urban environments in

3D from drone imagery. Looking at a 3D model of a building, a human typically has the capability of understanding what it represents: What parts of the model belongs to the roof of the building? What parts of the scene do not belong to the building at all but to vegetation? A question that comes to mind is whether a computer can be taught to recognize and differentiate the most common parts of buildings (e.g. walls, door, roof, windows etc.) by generalizing from other building models as training data.

1.1 Motivation

Semantic segmentation of a scene aims to give meaning to the scene by dividing it into semantic parts. Each part is given a label from a predetermined set of labels that can occur in the scene. This is a well known computer vision prob-lem and highly relevant in the times of machine learning advancement. It is of great interest to understand the scene, not least for companies such asSpotscale AB, but also for all kinds of autonomous systems. From Spotscale AB, there is a

wish to automatically identify, for example, all windows in a building. The use of that could be to easily replace them with other variants in the models. There might also be other benefits of a semantic understanding of the scene during the reconstruction of the 3D-models. For instance, it might ease the meshing prob-lem, going from point cloud to mesh, having labelled 3D-points.

However, manual hand-labelling of 3D-models is simply too time consuming, why there is a need for an alternative approach for semantic segmentation of a 3D scene. This master thesis will focus on segmentation of urban scene mod-els, such as buildings, provided bySpotscale AB. The classification is done on 3D

(12)

points belonging to a dense point cloud. Since classification directly in 3D-space is somewhat less explored than the common image classifications, it seems inter-esting to investigate its possibilities.

1.2 Goals

The objective of this master thesis is to investigate the possibility of automatic segmentation of a 3D-model by means of state-of-the-art technology, such as machine learning methods, into a predetermined set of labels. More precisely,

Spotscale AB can provide both a dense rgb 3D-point cloud from their

reconstruc-tion pipeline and the complete 3D model, i.e. a textured mesh. The main goals are:

• To extract, evaluate and select suitable features from the input point cloud. • To implement and evaluate different supervised machine learning methods

using the selected feature space.

The segmentation part is hence done outside the reconstruction pipeline, classi-fying the 3D points in the dense point cloud.

1.3 Approach

Along the way there is a number of sub-goals that have to be fulfilled:

• The machine learning methods that are to be investigated are of the type supervised learning. That is, there has to be ground truth data in order to train and evaluate the classifiers. Hence, as part of the master thesis, the first major task will be to acquire the ground truth data from the 3D models. This will be done by manually hand-labelling the 3D models into the requested classes.

• The second major task will be to extract appropriate features from the points in the point clouds. These features will then be combined into fea-ture vectors, so called descriptors, spanning the feafea-ture space. An inves-tigation of suitable feature spaces for the purpose of classification will be done.

• The third major task will be to create the classifiers and train them with the descriptors as input.

• Finally, when the classifiers are trained, the system will be evaluated on other 3D models, not used for training, also with ground truth labels. Thus, enabling a quantitative evaluation of the classification result.

(13)

1.4 Limitations

The system should to the widest possible extent be generalizable and able to clas-sify any urban scene model. However, since the manual annotation of the models is rather time consuming, it is not possible, at least not in this thesis work, to generate a large variety of labelled models. Furthermore, the set of labels are lim-ited to a finite set and consequently the classifiers will only be able to learn those labels.

The set of features that is used in the feature space evaluation is limited by that set. That is, there might be other features performing better that are not eval-uated at all. Furthermore, due to lack of time and resources the feature space evaluation is not done for all possible combination of features.

In order to compare the different classifiers they should preferably be tuned such that each one of them performs at their best possible capacity. However, this re-quires optimal parameter tuning for each one of them, which is impossible to achieve given the time limitations and the fact that there are unlimited combina-tions of parameter values. The comparison will therefore be based on a limited set of parameter values and possible trends that can be seen.

1.5 Related Work

Segmentation of scenes in 3D is not as common as segmentation in 2D where large annotated training data sets already exist [22]. As a result many articles [14, 17] talk about semantic segmentation of facades with images as input. Xie et al. discuss how the current semantic segmentation methods seem limited by the lack of ground truth training data rather than the capacity of the model and refer to this as ”the curse of dataset annotation”. They propose to generate la-belled 2D training data by manually doing the annotations in 3D and transfer-ring them to 2D with the knowledge of structure from motion, sfm, etc. [26]. Richter et al. [20] suggest to use synthetic data as training data as an alternative to the method in [26].

Further, there are many different approaches to the learning and classification step. Martinovic et al. [16], semantically segment facades completely in 3D us-ing point-wise classification by extractus-ing features such as spin-images, different colour spaces, normals, height over ground plane etc., and use them to train a Random Forest classifier. Kalogerakis et al. [13], do polygon-wise classification of 3D meshes into a predetermined set of labels. From each polygon in the mesh, they extract geometric features such as estimated multi-scale surface curvature of nearby polygons, Principal Component Analysis of local shape, shape diame-ter, shape context, spin images, orientation and contextual labels, yielding a high dimensional feature vector. By using a JointBoost classifier in combination with a conditional random field approach the system can select the best features as

(14)

in-put for the specific model. Adjacent polygons and their most probable class are also taken into account when annotating a specific polygon of the mesh, resulting in a minimization problem [13].

Rusu et al. [24] propose a method to locally describe the nearby geometry for a point in a point cloud. Within the local region of interest, the objective point and its normal is compared to the other points within the region by calculating different scalar products of the normals as well as relative positions. The result is a histogram roughly describing the geometry that the points constitute.

(15)

2

Theory

This chapter aims to give a brief overview of the theory used when reconstructing 3D models. Further it introduces the theory of features and classifiers, used in the method chapter, section 3.

2.1 3D reconstruction

3D reconstruction is the process of modeling real world objects by obtaining the shape and appearance of it. A reconstruction pipeline typically consists of image acquisition, structure from motion (sfm), multi-view stereo (mvs), and genera-tion of a surface mesh [9]. In general, reconstructing 3D objects from images requires a relatively large set of images taken from different views of the object. The idea is that 3D points visible in two or more images can be estimated, by means of triangulation, knowing the position of the cameras.

Finding 2D correspondences in the images and triangulating the 3D points while simultaneously positioning the cameras in the same 3D space is known as struc-ture from motion. This process typically generates a rather sparse 3D point cloud since uncertainty in the 2D matching will introduce noise why preferably only robust 2D feature points such as, for instance, SIFT [15] or ORB [21] are being tracked.

However, what is desired is generally a dense point cloud – something that can be generated knowing the camera poses obtained from sfm. One way to generate a dense point cloud is by solving the mvs problem where the goal is to assign a 3D point for each image point in each image.

Finally there is often a meshing step in the reconstruction pipeline, going from a

(16)

dense point cloud outputting a polygon mesh. There is a large variety of meshing techniques and the meshing problem is considered to be hard. It is possible that semantic meaning of the 3D points could ease and improve the meshing process.

2.2 Features

A feature is a numeric representation of a certain property. If a feature should describe, for example, a colour property, the feature could for instance be the red channel value of a 3D point with rgb triplets. In this work the input is a 3D rgb point cloud and the features are thus extracted from each 3D-point. The idea is that a combination of relevant features can be descriptive enough to separate 3D points belonging to objects with different characteristics.

Important characteristics are geometric properties, colour information and rel-ative location. The features used in this thesis are therefore the following:

• Normal vector

• Hue, saturation and value (hsv) • CIELab (lab)

• Height relative to the ground plane • Fast Point Feature Histogram (fpfh) .

2.2.1 Normal feature

The normal vector belonging to a surface is the vector perpendicular to that sur-face. For a 3D point cloud one can estimate the normals for each 3D point by solving the least squares plane-fitting problem in a local neighbourhood of that query point as in [18]. However, the point cloud provided bySpotscale AB already

contains estimated normals produces by their reconstruction pipeline.

2.2.2 Height feature

A ground plane, P , can be defined by manually selecting polygons in the mesh (lying on the ground of the model) and solving the least squares plane-fitting problem for their positions. Given P of the model, the height, di, of each point yi

could be calculated as the signed distance from the point to the plane by repre-senting the point and the plane with homogeneous coordinates and applying the scalar product [19].

(17)

where ha, bi denotes the scalar product of a and b, where normP(v) is the point

normalization of a homogeneous point v ∈ Rndefined as, normP                 v1 v2 .. . vn                 =                 v1/vn v2/vn .. . 1                 , vn, 0, (2.2)

and where normD(p) is the dual line normalization of a homogeneous plane, p ∈

Rm, defined as, normD                 p1 p2 .. . pm                 = q −sign(pm) p2₁+ · · · + p2_m−1                 p1 p2 .. . pm                 . (2.3)

2.2.3 Colour features

Describing colour is difficult since colour is subjective and personal [8]. As a con-sequence there are lots of different colour descriptors — colour spaces. A colour space can specify, create and visualize colour. Some colour spaces are linear while others are not and a colour space may either be device dependent or device inde-pendent. A device independent colour space will always produce the same colour for the same parameter value no matter what equipment that is used to visualize it.

The colour spaces used in this work were hsv and lab which both can be ob-tained by conversion from the rgb colour space as in [8].

2.2.4 Fast point feature histogram

Fast point feature histogram (fpfh) [24], is a simplification of the point feature histogram (pfh) [25]. pfh is a descriptor that tries to encode the geometrical properties in a point’s k-neighbourhood by generalizing the mean curvature around the point using a multi-dimensional histogram of values. The pfh of a point

pq is based on the estimated surface normals of the surrounding 3D points

en-closed within a radius r from the point pq, see figure 2.1. For every pair of points

pi and pj (i , j) and their normals ni and nj, in this neighbourhood, the

rel-ative difference is computed. This is done by defining a Darboux frame [25] (u = ni, v = (pj−pi) × u, w = u × v) and computing the angular variations of

niand nj together with a distance vector yielding four features,

f0= hv, nji (2.4) f1 = pj−pi (2.5)

(18)

f2= h_{u, (p}_j−_p_i_)i pj−pi (2.6) f3= arctan(hw, nji, hu, nji) (2.7)

Figure 2.1:An influence region of the pfh computations on a query point pq

(red). Green points are the k-neighbours enclosed within the circle (sphere in 3D) with radius r. All points are connected in a mesh.

The actual histogram Fqfor each point, pq, is achieved by calculating an

in-dex value based on the four features for all k points in the neighbourhood. The histogram has 34= 81 bins (3 subdivisions for each feature) and the bin in which the point-pair falls within is increased with 1. When all point-pairs are calculated, each bin in the feature vector is normalized by the total number of point-pairs to achieve point density invariance.

Furthermore, in order to reduce the computational complexity of the pfh algo-rithm the simplified fpfh was introduced. Instead of using all four features it uses only f0, f2, f3see equations 2.4, 2.6 and 2.7. A simplification of the feature

histogram is made by simply concatenate each individual feature histogram al-lowing more subdivisions. With 11 subdivisions the final feature histogram has 3 × 11 = 33 dimensions. The simplification proceeds by computing a Simplified Point Feature Histogram, spfh, of every query point pq. The spfh of a point pq

includes only the relationship between itself and its neighbours, see figure 2.2, not the relationship among the neighbours as in pfh.

(19)

Figure 2.2:An influence region of the spfh computations on a query point

pq(red). All points enclosed within radius r constitutes a point-pair with the

query point pq.

Then in a second step, the fpfh for a point p is achieved by using the spfh values of the enclosed k neighbours to adjust the final histogram:

FP FH(p) = SP FH(p) +1 k k X i=1 1 wi · SP FH(pi) (2.8) where wi = p−pi

. The dimensionality of thefpfhfeature is 33.

2.3 Classification

Classification is the process of recognizing, differentiating and understanding ob-jects. The automatic classification task is considered a typical machine learning task where the objective is to specify which of k categories input data belong to [10]. The task can be solved with supervised learning methods that classify new unseen data based on a generalization learned from labelled training data.

2.3.1 k-Nearest Neighbour

One of the simpler classification techniques is called k-nearest neighbour (knn) [4]. Since each point has an associated descriptor vector, consisting of M features, all points can be represented in the N -dimensional feature space, N ≥ M. La-belled training data is distributed in the feature space and new, unlaLa-belled data, will end up in the same feature space, see figure 2.3. Determining the label of the new data is done by simply looking at the label of the k nearest neighbours and having them vote. The Euclidean distance is commonly used as distance measurement.

(20)

Figure 2.3: knn. Two classes of training data in a 2D feature space. Choos-ing k to be 3 will give the red query point a green label. If the number of neighbours, k, is set to 7 it will instead get a blue label.

2.3.2 Support vector machine

Another classification method is called support vector machine (svm) [12]. It is often used for the binary classification problem, i.e. when only 2 classes are present. There are, however, multi-class support vector machines as well. The most typical variants are the one" approach and the "one-against-rest" approach.

In the binary case, let vi, i = 1, .., k be a set of k training sample descriptors

and y the corresponding label vector.

yi =        1, if vi belongs to class 1 −_1, _{if v}_i _{belongs to class 2} (2.9) The goal is to find the hyperplane wTv+ b = 0 which is able to separate the classes as,        (wTvi) + b ≤ 1, if yi = 1 (wT_v i) + b ≥ −1, if yi = −1 (2.10) and at the same time maximizing the margin between the decision boundary and the support vectors, see figure 2.4. For the so called support vectors, the equations in (2.10) will be satisfied with equality:

       (wTv_i) + b = 1, if yi = 1 (wTvi) + b = −1, if yi = −1 (2.11) Hence maximizing the margin becomes equivalent to minimizing the Euclidean norm of the weight vector w, thus resulting in the following optimization prob-lem: min w 1 2w T_w _(2.12) subject to yi(wTvi+ b) ≥ 1 (2.13)

(21)

for all i = 1, .., k.

Figure 2.4:The goal is to find the hyper plane wTv+ b = 0 that maximizes the margin between the two classes support vectors. The size of the margin becomes k_wk2 given the equations for the boundaries

Obviously the data must be linearly separable for this to work, otherwise it will never find a solution — a hyper plane that perfectly separates the classes. However, if the data is not linearly separable it can be mapped to a higher di-mensional feature space, where it actually may be linearly separable. Let φ be a mapping function to a higher dimensionality and let K(vi, vj) = hφ(vi), φ(vj)i,

be a kernel function that returns the scalar product of vi and vj in the higher

dimensional feature space.

Now in order to prevent overfitting, a slack variable ξ is introduced as a trade off variable between simpler decision boundaries and fitting the data exactly. Fur-thermore is an error penalty constant C > 0 also introduced. All breaks down to the quadratic programming problem:

min w,ξ 1 2w T_w_{+ C} k X i=1 ξi (2.14) where ξi ≥0, i = 1, .., k (2.15)

(22)

subject to,

yi(φ(w)Tφ(vi) + b) ≥ 1 − ξi (2.16)

for all i = 1, ..k.

OpenCVs [1] svm implementation is based on the LibSVM implementation which in turn uses the "one-against-one" approach for multi-classification [3]. If n is the number of classes, then n(n − 1)/2 classifiers are constructed and each one trains data from two classes. Each binary classification acts as a vote and the final pre-diction of a data sample is the class that gets maximum number of votes.

Parameters

The input parameters to the svm are the following: • Kernel type

– Linear

– Radial Basis Function (RBF) – Polynomial

– Sigmoid – Chi2

• Penalty constant C • Kernel parameters.

According to [3], it is recommended to begin with the RBF kernel:

K(vi, vj) = e(−γkvi

−_v_jk₎2

, γ > 0 (2.17)

That gives only two parameters to tune - C and γ. In this thesis RBF is the only kernel tested.

Parameter C, controlling the cost of miss-classifications on the training data, will give low bias and high variance if set to a large value, and vice-versa. For the kernel parameter γ, a high value will lead to high bias and low variance models, and vice-versa.

2.3.3 Random classification forest

A decision tree consists of split nodes and final leaf nodes in a hierarchical fash-ion, see figure 2.5a. What makes a decision tree is the fact that each split node stores a test function for the incoming data and that the leaves store an answer (a prediction). A random forest is an ensemble of many randomly trained decision trees, so called week predictors, resulting in one strong predictor, see figure 2.5b. This section is based on [5] and [2].

(23)

(a)Decision tree. (b)Random forest.

Figure 2.5: Figure (a) shows a decision tree with 3 split nodes and 4 leaf nodes, containing 4 different classification probabilities p(c|v). Figure (b) shows a random forest of N randomly trained trees during testing. A feature vector v falls through each tree resulting in N classification probabilities

p(c|v). The final output of the forest is the average of the classifications.

The training phase of the decision trees happens "off-line" by optimizing the parameters θ = (φ, ψ, τ) in the binary split node functions (one for each node j),

h(v, θj) ∈ {0, 1}. (2.18)

ψ determines the geometric primitive, which may be an axis-aligned hyperplane

or a general surface etc., to separate the data. τ is the parameter that holds thresh-olds for the inequalities used in the binary test and φ is a filter function selecting features of choice from the entire feature vector v.

The optimization of θ is done by maximizing an information gain objective func-tion: θ∗_j= arg max θj Ij (2.19) with Ij = I(Sj, SjL, SjR, θj). (2.20)

Here Sj is the set of samples before the split, SjL is the set of samples sent to the

left child node and S_jRthe set of samples sent to the right child node. Furthermore, the information gain at node j is defined as,

Ij = H(Sj) − X i∈{L,R} |Si j| |S_j|H(S i j) (2.21)

where | · | denotes the cardinality of a set — the number of elements present in the set — and where H is the Shannon entropy defined as,

H(S) = −X

c∈C

(24)

p(c) is the normalized histogram of labels of the training samples in S. Thus, the

information gain can be seen as the difference between the entropy of the parent and the weighted sum of entropy of the children.

During testing, see figure 2.5b, a feature vector v falls through a tree, t, by split-ting right or left depending on the outcome of each split node function (2.18), finally ending up in a leaf and thus a class prediction. The prediction is repre-sented as a stored posterior pt(c|v), where c ∈ C and C is the set of all possible

classes.

The random forest

The combination of multiple decision trees, trained independently and slightly differently, is known as a classification forest. Each tree, t, predicts a classifica-tion, pt(c|v), and the classification forest simply outputs an average of the tree

classifications, p(c|v) = 1 T T X t=1 pt(c|v) (2.23)

where T is the total number of trees in the forest.

There are different approaches to inject randomness in the model. The general idea is that randomness reduces overfitting and increases generalization. Accord-ing to [2], a combination of random input (BaggAccord-ing) and random feature selection can suggestively, for large data sets, produce a lower error rate than only injecting one type of randomness.

Bagging, or Bootstrap aggregating, is the procedure of selecting only a subset S_t_{of the whole training data, S, to constitute the training data for each} individ-ual tree, t, in the forest. By doing so the variance can be reduced and overfitting avoided.

The random feature selection is simply to consider a random subset Tj of all

features T when optimizing the splitting parameter θ at each node j.

Parameters

When tuning a random forest there are 4 parameters that can be adjusted. They are:

• Number of trees, T .

• Maximum depth, D, of a tree.

• Minimum number of samples, n, required to arrive to a node in order for the tree to perform a new split from that node.

(25)

According to [5], the effect of adding more trees, higher T , should increase the performance of the forest as it produces smoother class posteriors. The maximum depth parameter, D, also affects the posteriors. Deeper trees tend to increase the overall prediction confidence at the expense of overfitting tendencies while too shallow trees produce low-confident predictions.

The minimum number of samples to perform a new split, n, controls the forests’ robustness to noise. Allowing very few samples to perform a new split may in-troduce noise. Finally the amount of randomness, ρ, regulates the correlations among the individual trees. No randomness at all, ρ = T , will produce T identi-cal trees while a low value of ρ will yield a low prediction confidence instead.

(26)

(27)

3

Method

The system was written in the programming language C++ using the OpenCV [1], PCL [23] and Eigen [11] libraries. For the parameter tuning a Python script was made in order to execute multiple consecutive runs with different parameter inputs.

Figure 3.1:System overview

(28)

3.1 System overview

The system consists of four main modules; data acquisition, feature extraction, training of the classifiers, and classification. An overview of the system can be seen in figure 3.1.

3.2 Data acquisition

With no available labelled training data, the first step was to gather ground truth data, i.e., manually labelling the 3D models. Spotscale AB was able to provide a

guiprogram that made it possible to annotate each polygon in a 3D model with-out spending too many laborious hours per model. Having colours representing the different labels, the annotation was done by colouring each polygon with the correct colour. The output of the program was vectors, one for each label, contain-ing polygon IDs. The set of labels to choose from when annotatcontain-ing the models was the following: Other/Unknown, Window, Wall, Roof, Chimney, Door, Ground, Grass, Tree, Bush, Stairs, Sign.

The next step was to transfer the labels of the polygons in the mesh to the dense point cloud. This was done simply by giving a point the label of the closest lo-cated polygon. Each 3D point in the point cloud was compared to each polygon in the mesh computing the distance, since no optimized search method was known. The distance from a point to a polygon was calculated as in [6], formulating it as a minimization problem, finding the point on the triangle closest to the 3D point. Dealing with large models obviously resulted in long runs. However, optimizing the speed was not considered necessary since the program only had to be run once per model.

3.2.1 The models

There were 5 different models used in this thesis work, all in the same scale. Each model has a textured mesh and a dense rgb point cloud. Figure 3.2 shows the dense point cloud of each model. All models are models of buildings/houses in Sweden.

Nkpg01

The first model is called Nkpg01 and it has a dense rgb point cloud consisting of about 9 million 3D points. In the model there are two large buildings made in concrete, coloured beige/white. The roof is some kind of black sheet metal. In the model the following labels are present: window, ground, roof, wall, door, bush, chimney, sign, stairs and other.

(29)

Nkpg02

The second model is from the same city as Nkpg01 and therefore rather similar. It has about 1 million 3D points in the rgb point cloud and the following labels are present in the model: window, ground, roof, wall, door, chimney and other.

KvPolisen

The third model has about 3 million 3D points. It is a dark brown building made in bricks with a flat roof. On top of the roof there is some sort of storage room shaped as a box. There is vegetation in the shape of trees and bushes surrounding the building. The labels present in the model are: window, ground, roof, wall, door, tree, bush, grass, chimney, stairs and other.

Persby

The fourth model is a red wooden summerhouse located in the countryside in Sweden. The roof is black roofing tile and the surrounding ground is a green lawn with two large trees. Thus it is rather different from the inner city models. It has about 12 million 3D points in the point cloud. The following labels are present: window, ground, roof, wall, door, tree, bush, grass, chimney, stairs and other.

Vasallen

The last model, Vasallen, is the largest model seen to the area it covers. It has about 12 million 3D points. It is a yellowish/beige concrete building with red roofing tiles. The following labels are present: window, ground, roof, wall, door, tree, bush, grass, chimney, stairs and other.

(30)

(a)Dense rgb point cloud of Nkpg01

(b)Dense rgb point cloud of Nkpg02

(c)Dense rgb point cloud of KvPolisen

(31)

(d)Dense rgb point cloud of Persby

(e)Dense rgb point cloud of Vasallen

Figure 3.2:(continued) The rgb point clouds of the models.

3.3 Feature extraction

The feature extraction was done point-wise and represented as an N × M matrix where N is the number of points and M the number of feature dimensions. The following features were extracted: height (h), normal (n), lab, hsv and two dif-ferent fpfh with radius 2 and 0.2 meters, respectively. See section 2.2. Hence a descriptor for each point, i, in the cloud yields 76 feature elements using all features. d_i 1×76 = [ hi 1×1 n_i 1×3 lab_i 1×3 hsv_i 1×3 fpfhr=2 i 1×33 fpfhr=0.2 i 1×33 ] (3.1)

After all the features were extracted, each dimension in the feature matrix was variance normalized. That is, every dimension was scaled by its variance over the N points in order to have a balanced magnitude of importance across the

(32)

dimensions — regardless of the unit measure.

3.4 Classification

Three different classifiers were implemented, trained and tested. They were all implemented using OpenCVs [1] machine learning module. The first classifier was knn, see section 2.3.1, and it was primarily used as a benchmark for evalu-ating different feature spaces, which is one of the major tasks in this thesis. The best performing feature spaces could then be used when training and testing the other more advanced classifiers such as Support Vector Machines, svm, and Ran-dom Classification Forest, rcf.

All classifications made throughout this thesis were cross-validated by always us-ing trainus-ing data from 4 models while classifyus-ing on the fifth model. The amount of training data was restricted to a fixed size of 1 million training descriptors and the labels in each model were simplified such that there were only 7 labels; all labels occurring in all models. Furthermore, the training data had to be balanced, why an equal amount of descriptors of each label was used to constitute the train-ing data. Fulfilltrain-ing these requirements, the 1 million samples finally chosen to constitute the training data were otherwise picked randomly from the feature matrices of the training models.

3.4.1 k-Nearest Neighbour as benchmark

In general there are no features that are suitable for all kinds of classification problems. What makes a good feature is how well it separates the classes, which is dependent on the specific type of classification problem. A subset of available features might actually be able to separate the classes better than the whole set of features. It is also preferable to have as low dimension of the feature space as possible, without decreasing performance if possible, in order to reduce unneces-sary complexity.

A finite set of feature vectors, generated from a wide variety of models, can be used to train a knn classifier. The performance can be validated on a validation model, Mval, by extracting the same type of features from Mvaland use it as

in-put to the classifier. This process can be done multiple times but with a different subset of features (by removing some of the feature elements from the feature vectors) and seeing how the performance differs on the validation model. Testing different feature subsets and comparing their performances makes it possible to find a good feature space for the specific classification purpose; in this case clas-sifying urban scene models.

At first, all features were used and the descriptors looked as in equation 3.1. Per-forming model-wise cross validation — using training data from 4 models while classifying on the fifth, then rotate which model becomes the validation model —

(33)

resulted in 5 knn runs for each feature space. After having evaluated the feature space containing all features the next step was to repeat the procedure but with one of the 6 features removed (hsv for example). Since it is possible to remove 1 from 6 in 6 different ways, this resulted in another 6×5 = 30 knn runs and a total of 30 + 5 = 35 knn classifications. The next step would be to remove 2 features, then 3 features and so on but due to lack of time and the fact that it would result in another 280 knn runs it was decided to be satisfied with evaluating the feature space containing all features and the feature spaces with one feature removed.

3.4.2 Random classification forest

Based on the result of the feature space evaluation, see section 4.2, an adequate feature space was decided upon, namely all features but fpfh with radius 2 me-ters. An rcf classifier was implemented, taking the following parameters:

• Number of trees, T .

• Maximum depth, d, of a tree.

• Minimum number of samples, n, required to arrive to a node in order for the tree to perform a new split from that node.

• The size of the feature subset, ρ.

According to theory, see 2.3.3, the performance should increase with the number of trees T — the more the better. However, more trees linearly slow down run time. Therefore T was kept constant throughout the testing of the other parame-ters, choosing T = 7, which is a relatively low number, in order to keep the run time down. With only 3 parameters to tune, a 3D grid of possible parameter values was constructed.

• d = [3, 5, 7, 9, 11, 13, 15, 17, 19]

• n = [2000, 1500, 1000, 750, 500, 250, 150, 50] • ρ = [3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31]

Testing all parameter combinations for all validation models could give a hint of what parameters are suitable. Once a suitable combination is found for a certain model, it could be fine tuned in that local neighbourhood.

Choosing 10 combinations at random and re-run them for different values of

T could possibly confirm the theory of how performance is increased with the

(34)

Table 3.1:Random triplets for increasing tree size evaluation Random triplets min samples max depth number of features

triplet 1 1000 9 31 triplet 2 50 13 13 triplet 3 2000 5 9 triplet 4 1000 19 3 triplet 5 1500 9 5 triplet 6 750 7 25 triplet 7 250 17 19 triplet 8 500 11 7 triplet 9 150 15 15 triplet 10 150 13 9

They were all run for each value of T = [7, 14, 21, 28, 35, 42, 49, 56, 63, 70].

3.4.3 Support vector machine

The svm multi-classifier was implemented, taking the following parameters: • Kernel type set to RBF

• Penalty constant C • Kernel parameter γ

In a similar way as for rcf a parameter grid search was setup for the svm param-eters accordingly:

• C = [2−5, 2−3, 2−1, 21, 23, 25, 27, 29, 211, 213, 215] • γ = [2−₁₅

(35)

4

Result

4.1 Generating the ground truth

Figure 4.1 shows the mesh of the Nkpg01 model before and after manual label annotation and figure 4.2 shows the result after transferring the labels from the mesh to its dense point cloud.

(a)The mesh of the Nkpg01 model

Figure 4.1:Nkpg01 before and after manual label annotation

(36)

(b)The Nkpg01 mesh after manual labelling

Figure 4.1:(Continued) Nkpg01 before and after manual label annotation

Figure 4.2: The result of transferring the labels from the mesh to the point cloud of the Nkpg01 model. Note, the figure shows the simplified version of the labels where only 7 labels are present. The labelSign, for instance, has

been converted to the labelWall. Furthermore, figure 4.1b lacks consistent

colour scheme why the colours representing the labels are different from figure to figure. The correct colour legend is seen in table 4.1

Table 4.1:Colour legend of labels

Label Other/Unknown Wall Window Roof Chimney Door Ground Colour Black Magenta Blue Yellow White Cyan Red

4.2 Feature space evaluation

The result of the feature space evaluation can be seen in table 4.2. The table shows the classification accuracy, defined as in equation 4.1, after the knn

(37)

classi-fication on each model with the different feature spaces. The visual results of the classifications can be seen in Appendix A

accuracy = number correctly classified points

total number of points ·100 (4.1)

Table 4.2:Feature space performance, showing classification accuracy. Best for each dataset is shown in green.

Dataset All Features w/o Height w/o Normal w/o HSV w/o Lab w/o FPFHr=2 w/o FPFHr=0.2

Nkpg01 38.9 29.4 30.1 39.9 41.4 47.6 38.5 Nkpg02 52.0 32.2 49.3 51.8 51.0 58.2 49.1 KvPolisen 36.7 33.8 33.4 36.3 35.6 41.2 38.3 Persby 42.0 41.3 37.3 44.4 44.6 25.5 40.0 Vasallen 26.1 20.0 20.7 30.1 28.9 40.2 26.0 Average all models 39.1 31.3 34.2 40.5 40.3 42.5 38.4

As can be seen, the feature space without the fpfhr=2feature gave the highest

accuracy on average and for all datasets except for Persby — which got the highest accuracy using the feature space without the lab feature.

4.3 Parameter tuning

The visual result of the highest accuracy after parameter tuning is given below. The result of each parameter setup is shown in Appendix A.

4.3.1 Random classification forest

Table 4.3 shows the parameter setup that gave the highest accuracy for each model. The result of the grid search (each parameter setup) for each model can be seen in section A.2.1.

Table 4.3:The parameters that gave the highest accuracy

n D ρ T accuracy Nkpg01 50 9 15 7 52.80 Nkpg02 2000 3 29 7 66.67 KvPolisen 1000 5 3 7 40.27 Persby 50 19 3 7 19.25 Vasallen 2000 13 11 7 38.61

Average over all models 150 17 5 7 35.51 The visual result of the classification is given below in figure 4.3.

(38)

(a) rcf classification of Nkpg01 with highest accuracy

(b)Ground truth of Nkpg01

(c) rcf classification of Nkpg02 with highest accuracy

(d)Ground truth of Nkpg02

(e) rcfclassification of KvPolisen with highest accuracy

(f)Ground truth of KvPolisen

(g) rcf classification of Persby with highest accuracy

(h)Ground truth of Persby

(i) rcf classification of Vasallen with highest accuracy

(j)Ground truth of Vasallen

Figure 4.3:Ground truth compared with the best rcf classifications, based on the accuracy measure, after parameter tuning.

(39)

Increasing the number of trees

Figures 4.4 shows the result of increasing the number of trees for 10 constant triplets, see table 3.1, for each model.

(a)Nkpg01

(b)Nkpg02

Figure 4.4: Result of increasing the number of trees for the 10 random pa-rameter triplets

(40)

(c)KvPolisen

(d)Persby

(e)Vasallen

Figure 4.4: (continued) Result of increasing the number of trees for the 10 random parameter triplets

(41)

4.3.2 Support vector machine

In section A.2.2 is the result of the grid search for each model. Table 4.4 shows the parameter setup that gave the highest accuracy for each model.

Table 4.4: The parameters that gave the highest accuracy classifying with svm C γ accuracy Nkpg01 2−3 2 47.44 Nkpg02 29 2−15 52.74 KvPolisen 2−5 23 54.69 Persby 25 2−15 61.10 Vasallen 27 2−13 57.06 Average over all models 2−3 2−9 34.45

Figure 4.5 shows the visual result of the classifications that gave the highest accuracy after parameter tuning.

(42)

(a) svm classification of Nkpg01 with highest accuracy

(b)Ground truth of Nkpg01

(c) svm classification of Nkpg02 with highest accuracy

(d)Ground truth of Nkpg02

(e) svmclassification of KvPolisen with highest accuracy

(f)Ground truth of KvPolisen

(g) svm classification of Persby with highest accuracy

(h)Ground truth of Persby

(i) svm classification of Vasallen with highest accuracy

(j)Ground truth of Vasallen

Figure 4.5: Ground truth compared with the best svm classifications, based on the accuracy measure, after parameter tuning.

(43)

5

Discussion

This chapter discusses the choice of method, approach, result and possible fur-ther work.

5.1 Approach

5.1.1 Generating the ground truth data

Early in the planning process it was decided to do point-wise classification. With no ground truth data available, the first major task was to acquire correctly la-belled point clouds to be used as training data. The approach chosen was to manually annotate each polygon of the mesh of the 3D model, then to transfer the labels to the point cloud by simply giving a point the label that the closest lo-cated polygon was set to. The drawback with the method is the limited precision. A polygon may cover an area including multiple labels, see figure 5.1, finally re-sulting in miss-labelled 3D points. However, given the limited time aspect of the thesis and that no alternative method could fulfill that demand, the approach seemed motivated.

(44)

Figure 5.1:Manual labelling process. There is no guarantee that each poly-gon only covers the area of one label

5.1.2 Choosing the training data

Before training of any classifier the training data is picked out as described in sec-tion 3.4. Thus, all 7 labels occurred in each model, none of the training samples came from the validation model, and the size of the training data was consistently constant for all classifications. Furthermore, the training data was forced to be balanced, consisting of equally many samples of each label. However, since the samples were picked randomly, as long as they fulfilled those criteria, there was no guarantee that equally many samples were picked from each one of the four training models. In practise, this also means that there is some randomness in-volved — depending on what training points that are actually chosen — in how good the classifier will turn out. Furthermore, the method is rather wasteful when it only uses a small subset of the available training data — just in order to balance it. Section 5.3.3 discusses an alternative way of using the training data.

5.1.3 Evaluating feature spaces

Since there is an infinite number of possible feature spaces where only the imag-ination is the limit, the feature space evaluation had to be strictly limited due to the time aspect. Hence, the approach was restricted to evaluate the full set of features that were extracted from each model, see equation 3.1, as well as all possible subsets with one feature removed. In total, that resulted in 7 different feature spaces of various dimensions. If there had been more time all possible feature space subsets should have been evaluated.

(45)

The performance of the knn classifier is very dependent on how well the features are able to separate the different labels in that space. Furthermore the knn clas-sifier is one of the simplest clasclas-sifier with only one parameter, which was another reason why it seemed suitable for this purpose. The most obvious drawback us-ing knn is the run time since it does not generalize over the data in advance — as svm and rcf do — but searches through the stored training data each time a sample is being classified. The more training data, the greater time complexity. Each feature space was evaluated on each model, using a constant number of sam-ples from the other models as training data, thus performing cross-validation in order to reduce the risk of overfitting.

5.1.4 Parameter tuning

In order to compare the classifiers, some parameter tuning was considered neces-sary. A grid search testing all combinations of parameter setups within the grid boundary was computed. This resulted in many combinations, especially for the rcfclassifier that had 3 parameters to tune, which in turn resulted in a time-consuming process. The benefit of using grid search is that it is rather intuitive. Blindly looking at the accuracy for the validation model did not consider the pos-sibility that the classifiers were overtrained. An accuracy measure also for the training data could give a hint whether overfitting has occurred. More about this in section 5.3.3.

5.2 Results

5.2.1 Generating the ground truth data

As can be seen in figures 4.1 and 4.2 the ground truth acquisition turned out as expected. Not perfect but with a vast majority of correctly labelled points. The polygon annotation could of course have been performed even more thoroughly, however the precision of the method would still be limited by the size of the polygons inescapably leading to a small percentage of incorrectly labelled points in the point cloud.

5.2.2 Feature space evaluation

With limited possibilities to parallelise the feature space evaluations, due to lack of multiple computer access, it took long time to finish. Interestingly the results, presented in table 4.2, suggest that using all features may not be the best alterna-tive. For all data sets except Persby, the knn classifier performs better without the fpfh feature with radius 2 meter. For Persby the best performing feature space is the subset without the lab feature. Removing the height feature results in worse performance for each data set, suggesting that the height feature is an important feature. Another observation based on the accuracy result is that the performance is about the same or slightly better only using one of the colour

(46)

space representations, hsv or lab. Finally worth mentioning is that the feature space that performed best on average, looking at the result on all models, was the feature space without the fpfh feature with radius 2 meters, why that feature space was only used in the investigations of the classifiers.

5.2.3 Parameter tuning

From table 4.3 in section 4.3.1 and figures in section A.2.1 in appendix A, one can observe an apparent variation among the parameter combinations that per-formed best for each individual data set. The variation is possibly caused by over-fitting to the chosen training data subset which, in turn, may not be a good rep-resentation of the whole training data. Even though model-wise cross-validation was done during the parameter tuning, the subset of samples from the training models that constituted the training data may have looked rather different when tuning parameters for Nkpg02 and for Persby for example. Consequently, only looking at the accuracy of the validation models (Nkpg02 in one case, Persby in an other case) may have forced the tuning of the parameters into a bad general-ization of the training data. Since overfitting happens when the gap between the training error and the test error is too large it can be identified by looking at the training accuracy as well [10]. As mentioned in section 5.1.4 the training accu-racy was never calculated and the comparison was hence never done. However, when it comes to the rcf classifier it is the tree depth parameter D that regulates the amount of overfitting where large D tends to overfit and small D tends to un-derfit according to [5]. In general, the best parameter combination is the one that generalizes best which would be the one that has the highest accuracy on average over all models. Looking at table 4.3, the tree depth value for the best parameter combination on average is relatively large (D = 17), but whether a tree depth of 17 is too large is difficult to say only looking at its value, since the optimal tree depth is a function of the problem complexity [5].

Studying the visual result of Nkpg02 using the best performing parameters, see figure 4.3c, one can notice that the only labels present in the classification are roof, wall, door, ground and other. In other words, there are no points that were clas-sified as window or chimney. This may be the result of very shallow tree depth (D = 3). However, the particular rcf model apparently gave the highest clas-sification performance, even though the labels chimney and window were com-pletely missing, which may suggest that a different evaluation measure would have been appropriate. More about this in section 5.3.

According to theory, see section 2.3.3, adding more trees in a forest should only increase the performance of the classifier. In figure 4.4 the alternating perfor-mance of each constant parameter triplet is shown for an increasing number of trees. For most of the triplets the performance is kept about the same going from 7 trees to 70. Surprisingly some triplets suddenly drop in performance rate for some values of the number of trees. The trend of increasing the number of trees seems somewhat erratic, not according to theory. One possible explanation to

(47)

this behaviour is the randomness introduced when picking out the training data samples, as discussed in section 5.1.2 above. Consequently the one million train-ing samples may not be the exact same samples when evaluattrain-ing, for example, 7 trees and 14 trees. However, the fact that a slightly different subset of training samples affect the validation accuracy so drastically is yet another indication that overfitting has occurred.

The result after svm parameter tuning is seen in table 4.4. The optimal parame-ters for the different data sets are totally different. In the same way as for the rcf, it is likely that these variations among the optimal parameters are the effect of not using the training samples in an efficient way in combination with overfitting to the training data. An alternative approach in order to avoid this is discussed in section 5.3.3. Only focusing on the accuracy result for Persby with 61% correctly classified points seems incredibly good — especially since Persby was the hardest data set to classify with good result. However, looking at figure 4.5, one can see that a relatively high accuracy not necessarily results in a good visual result. Yet again, this suggests that an alternative evaluation method might have been appro-priate, depending on what is asked for. Further interesting observation that can be made is the rather good visual result of Vasallen in figure 4.5i. For Vasallen the svm classifier has given both a better accuracy result and a better visual re-sult than the rcf classifier did.

5.3 Further work

5.3.1 Investigating the effect of more training data

Throughout this thesis work, each classification, knn, rcf and svm, has been executed using a fixed number of training data — one million training samples. An interesting investigation would be to evaluate the effect when increasing the amount of training data. This could be done by using a simple learner such as knnand running a number of classifications with alternating amount of training data and see if there is any on-going trend, performance wise, when adding more training data. Another interesting approach to this investigation would be to evaluate the effect when increasing the variety of the training data — having low variety when the training data is constituted from only one model compared to a higher variety when constituted from several models. These investigations could perhaps give a hint of how much training data — from how many models — would be needed to reach a certain classification accuracy.

5.3.2 Feature space evaluation

A wider feature space evaluation would be useful in order to draw any conclu-sions about what feature space is suitable for classifying urban scene models. It would be interesting to introduce new features but also to evaluateall subsets of

(48)

the feature vector in equation 3.1 — even the subsets containing only 5, 4, 3, 2 and 1 features. Furthermore, one might also consider a thorough investigation whether two fpfh features of different radius could be in a good synergy with each other.

5.3.3 Using training data smarter and re-do parameter tuning

As can be seen in the results the parameter tuning did not turn out great. Whether grid search is the best approach for this purpose is questionable. There are other methods for finding optimal parameters [7], which could be interesting to inves-tigate as an alternative to grid search.

However, the grid search method is probably not the issue. The training data was always one million samples even though it came from 4 models with a to-tal of at least 23 million samples — depending on what model was classified. Instead of wastefully only selecting one million samples from the whole train-ing data volume, all samples should have been used in a k-fold cross-validation. A k-fold cross-validation divides the training data into k subsets and lets one subset constitute the validation data while the classifier is trained on the other

k − 1 subsets. The classifier is then validated on the samples in the validation

subset before switching which subset to constitute the validation data and the procedure restarts [10]. In this way the risk of overfitting is minimized and the most promising parameters (the ones that on average over the k iterations got the highest accuracy) could be tested on the test data — the fifth model called the evaluation model.

The problem with using all training data is that the data is most likely unbal-anced. This could however be compensated for with use of weights that are in-versely proportional to the number of samples in the set that belongs to a certain label. That is, a sample belonging to a label that is common gets a low weight while samples with uncommon labels get higher weights.

Weights could also be applied to the validation data and the test model. Dur-ing the parameter tunDur-ing, even though the trainDur-ing data was balanced, that was not the case for the evaluation models. A theory that could perhaps explain the results seen in figures 4.5a and 4.5g, is that the evaluation models (Nkpg01 and Persby in this case) are unbalanced — resulting in a bias evaluation of the best parameters for those models. As it turned out in both models, the classifier basi-cally outputs the majority of points as the labelGround (red), which apparently

is enough to get high accuracy — due to the unbalanced labels occurring in the models.

In future work one should re-do the parameter tuning, use the training data less wasteful as described above and take the unbalanced evaluation models into ac-count as well.

(49)

6

Conclusion

This master thesis has included three major tasks; acquiring ground truth data, evaluation of feature spaces, and evaluation of three different classifiers. In gen-eral, the method used for acquiring the training data worked fine but was still very time consuming. As a consequence the rest of the results was only based on the five 3D models that were annotated. Moreover, the models that were used are all models of houses and other buildings in Sweden. In other words, the extent of this work is simply not enough to draw any general conclusions. However, based on the data at hand, the feature space evaluation suggested that using as many features as possible is not necessarily the best option and that the best perform-ing feature space that was evaluated had the followperform-ing features: height, normal, hsv, lab, and fpfh_r=0.2.

To compare the different classifiers is difficult. The knn classifier is intuitive and worked well as a benchmark for the feature space evaluation. On the other hand it is by far the slowest classifier when dealing with a large set of training data, since it does not generalize over the data in advance, but searches through the training data each time a new sample is being classified. The rcf classifier seems promising even though the accuracy never was higher than 66.7%. Whether it had to do with the lack of ground truth training data — "the curse of annotation" — or the fact that overfitting may have occurred during the parameter tuning is difficult to say. The same goes for the svm classifier. With only two parameters to tune compared to the four parameters in rcf, the parameter tuning took less time than for the rcf. Furthermore, looking at the best average accuracy over all models for a certain parameter set-up, the svm and rcf had about the same accu-racy, around 35%. Somewhat surprising is that the knn had an average accuaccu-racy, over all five models, of 42.5% and thereby being the best one in that sense.

(50)

(51)

(52)

(53)

A

Complementary result figures

This appendix contains result from the parameter tuning of both rcf classifier and svm classifier, and complementary figures from the feature space evaluation.

(54)

(55)

A.1 Feature space evaluation

A.1.1 All features

(a)Nkpg01 (b)Nkpg01

(c)Nkpg02 (d)Nkpg02

(e)KvPolisen (f)KvPolisen

(g)Persby (h)Persby

(i)Vasallen (j)Vasallen

(56)

A.1.2 Without the height feature

(a)Nkpg01 (b)Nkpg01

(c)KvPolisen (d)KvPolisen

(e)Persby (f)Persby

(g)Vasallen (h)Vasallen

Figure A.2: knnclassification without the height feature. Images of Nkpg02 is unfortunately missing.

(57)

A.1.3 Without the normal feature

(a)Nkpg01 (b)Nkpg01

(c)Nkpg02 (d)Nkpg02

(e)KvPolisen (f)Persby

(g)Vasallen (h)Vasallen

(58)

A.1.4 Without the Hue, Saturation and Value feature

(a)Nkpg01 (b)KvPolisen

(c)Nkpg02 (d)Nkpg02

(e)Persby (f)Persby

(g)Vasallen

(59)

A.1.5 Without the CIELab feature

(a)Nkpg01 (b)KvPolisen

(c)Nkpg02 (d)Nkpg02

(e)Persby (f)Persby

(g)Vasallen

(60)

(61)

A.1.6 Without the FPFH feature with radius 2 meters

(a)Nkpg01 (b)Nkpg01

(c)Nkpg02 (d)Nkpg02

(g)Persby (h)Persby

(62)

(63)

A.1.7 Without the FPFH feature with radius 0.2 meters

(a)Nkpg01 (b)Nkpg01

(c)Nkpg02 (d)Nkpg02

(g)Persby (h)Persby

(64)

A.2 Result of parameter tuning

A.2.1 Random Classification Forest

Nkpg01

(a)Minimum number of samples set to 50 (b)Minimum number of samples set to 150

(c)Minimum number of samples set to 250 (d)Minimum number of samples set to 500

(e)Minimum number of samples set to 750 (f) Minimum number of samples set to 1000

(65)

(g) Minimum number of samples set to 1500

(h) Minimum number of samples set to 2000

(66)

Nkpg02

(67)

(68)

KvPolisen

(69)

(70)

Persby

(71)

(72)

Vasallen

(73)

(74)

A.2.2 Support Vector Machine

Nkpg01

(75)

Nkpg02

Figure A.14: svmparameter tuning result for Nkpg02

KvPolisen

(76)

Persby

Figure A.16: svmparameter tuning result for Persby

Vasallen