Umeå University Classiﬁcation of tree species from 3D point clouds using convolutional neural networks

(1)

Classification of tree species from 3D point clouds

using convolutional neural networks

July 7, 2020 Author: Marcus Wiklander Supervisors: Petter Lindgren, Johan Holmgren, Henrik Persson

(2)

Master in engineering physics

Classification of tree species from 3D point clouds using convolutional neural networks

by Marcus Wiklander

Name Responsibility E-mail

(3)

(4)

1.1 Background . . . 1

1.2 Purpose and goal . . . 2

2 Theory 3 2.1 Light detection and ranging, LiDAR . . . 3

2.2 Backprojection . . . 3

2.3 Plane projections and rotations . . . 4

2.4 Machine learning . . . 5

2.5 Deep learning and neural networks . . . 5

2.5.1 Training . . . 6

2.5.2 Training data, validation data and test data . . . 6

2.5.3 Overfitting . . . 7

2.6 Convolutional neural networks. . . 8

2.6.1 Pooling layers . . . 9 2.6.2 Hyperparameters . . . 9 3 Method 11 3.1 Data set . . . 11 3.2 Backprojection . . . 11 3.3 Annotation of data . . . 12 3.4 Preparation of data. . . 13

3.5 Testing for best network architecture . . . 14

3.6 Final testing. . . 15

4 Results 16 4.1 Annotation of data . . . 16

4.2 Classification - Neural network . . . 16

5 Discussion 19 5.1 Retrieval of training data . . . 19

(5)

1 Introduction

1.1 Background

There is great interest in being able to classify individual tree species and preferably from data covering large areas. When planning deforestation and estimation of forest property values, the distribution of tree species is an important factor to take into account, both economically and with respect to biodiversity.[3] Since manually estimating tree species distribution out in the field is tedious labour, a method for doing this with the use of data from Airborne laser scanning, ALS, would be very advantageous and reduce costly field labour.

Swedish University of Agriculture Sciences (SLU) is currently working on a project to-gether with the company Svenska Cellulosa AB (SCA), in which the aim is to further develop automation of forestry inventory. The project involves the use of laser scanners and cameras attached to a helicopter collecting data at relative low altitudes. This data collected by the company Visimind AB ,will be used in this project to develop a method capable of classifying individual tree species from 3D point clouds.

In order to automate forestry inventory, the collected data has to be of high quality so that important forestry variables can be determined. Dense 3D point clouds are neces-sary for automatic detection of individual trees, exact measurements of tree heights and classification of tree species. With precise results from sample forests, the same variables can be estimated on a larger scale for big forest areas.

There are various uses of different tree species due to differences in properties between individual species. Spruce and pine are two dominating tree species for construction purposes and also for furniture and paper production. Leaf trees like birch and aspen are mainly used for furniture and paper.[4] The cost for individual tree species depends on the demand and for which purpose they are used.

The presence of different tree species is important for the biodiversity of a forest, which in turn is key to a healthy forest ecosystem. One individual tree species can host up to 1100 different other species such as fungi, lichen and various invertebrates.[5] In Sweden, the United Nations’ convention regarding biodiversity is in effect, meaning that the country is obligated to ensure the persistence of forest biodiversity.[6]

(6)

Features commonly used for data classification are the shape of the tree canopy, the height distribution of the tree and the intensity of the returned pulses (for a scanned point). To find the feature for the shape of the canopy, an ellipsoid can be fitted to the data points. The dimensions of the ellipsoid are then used in different ways as variables for classification. The way the points are distributed also give information about the tree species; for example, pine trees have a high density at the top of the tree, while a spruce is denser at the lower part of the tree.[7] [1] In this project, instead of finding specific features of the trees, the method of classification will be through the use of machine learning, more specifically artificial neural networks.

An artificial neural network is a computational model composed of interconnected pro-cessing units called neurons or nodes, which uses sample data (input) to learn to perform certain tasks or to generate predictions (output) without being explicitly programmed how to do so. Neural networks are especially good at finding complex patterns or visual structures without rules or previous knowledge about the data.[9] Finding unique pat-terns for different tree species is therefore, in theory, a fitting task for a neural network. In this project, convolutional neural networks was used for classification. A convolutional neural network is a type of artificial neural network that is particularly good at analysing images and finding local visual patterns.[10]

1.2 Purpose and goal

(7)

2 Theory

2.1 Light detection and ranging, LiDAR

LiDAR is an optical measuring technique for determining x, y and z coordinates of a target, with high accuracy. This is done by using a laser scanner system which illuminates the target with laser light and measures the returning reflection with a sensor. 3D coordinates of the target can then be created using information about the wavelength of the light and the laser return time. Additional information about position and orientation of the target via a GPS, and an internal navigation unit, INS, gives the 3D coordinates a geographical reference.[11]

2.2 Backprojection

Backprojection is the process of projecting a 3D-world point back to the image plane of a camera. The first step in the process is connecting the reference frames of the 3D point and the cameras using the extrinsic camera parameters (location and orientation of the camera reference frame with respect to the world reference frame).[12]

Pc= R(PW − T ) (1)

where PC and PW is the X, Y and Z coordinates of the camera position and the world position, R is the rotation matrix and T is the translation vector, between the reference frames. More specifically,

T =   Tx Ty Tz   , R =   r11 r12 r13 r21 r22 r23 r31 r32 r33   (2)

To get image plane coordinates which are located along the ZC-axis the following relation is used,

x = fXC ZC

, (3)

with XC from equation1 we get,

x = R T

1(PW − T )

RT₃(PW − T ). (4)

One final transformation from image coordinates to image pixel coordinates is attained with,

xpx= −x/sx+ x0 (5)

where x0 is the x coordinate of the image’s principal point and sx is the effective size of the pixels in orizontal direction. With equation4, image pixel coordinates for x becomes,

xpx= x0− fsx

RT₁(PW − T )

(8)

where fsx = f /sx. The same relations holds true for ypxwith corresponding y-values.[12]

2.3 Plane projections and rotations

Projecting 3D points onto the axis planes is done by setting each point’s value to zero for the axis which is perpendicular to the plane of choice. For projections onto the XY plane, the Z values are set to zero. A visualization can be observed in figure1.

Figure 1 – Four points in the left image being projected onto the XY-plane in the right image, by setting z values equal to 0.

Rotating a point counterclockwise around a certain axis by an angle θ is done by multi-plying the point with the rotation matrix for that corresponding axis.[13] For a rotation around the Z-axis the rotation matrix looks as follows,

Rθ =   cos θ − sin θ 0 sin θ cos θ 0 0 0 1  ,

the rotated point becomes   x0 y0 z0  =   cos θ − sin θ 0 sin θ cos θ 0 0 0 1     x y z   (7)

For a rotation of θ = 45◦ we have,

sin 45 = cos 45 = √1

(9)

Equation8 and7 gives   x0 y0 z0  =    1 √ 2 − 1 √ 2 0 1 √ 2 1 √ 2 0 0 0 1      x y z  =    x−y_√ 2 x+y_√ 2 z   . (9) 2.4 Machine learning

In classical programming, rules are typically defined by the user and applied to data to produce answers to a problem. Machine learning, however, is a way of using input data and answers to produce rules. These rules can in turn be applied to new data to obtain new answers.[10] A machine-learning system uses relevant examples, so called training data, in a mathematical algorithm to find statistical structures in the data which can be used to make decisions or predictions about wanted outputs. This process is called "learning" and involves transformation of training data into more useful representations. Coordinate changes, linear projections or translations are examples of such transforma-tions, and exactly how they should be applied in order to give the best prediction possible, is what the system learns. The transformations in traditional (shallow) machine learning tend to be restricted to a predefined set of operations, the so-called hypothesis space. However, deep learning, a subset of machine learning, is not restricted in the same way but instead learns abstract representation from numerous layers of an artificial neural network.[10][14]

2.5 Deep learning and neural networks

(10)

Figure 2 – A simple neural network with layers of four neurons fully connected (dense layer) by weights to the next layer. The input data is input for the first layer and the final output comes from the last layer of neurons.

2.5.1 Training

The weights of a network are what transform input data into new representations, layer by layer, to finally output a prediction in the last layer. The values of the weights adjust during training through the use of training data with known outputs, so-called labeled data. During training, predictions computed by the network are compared to the true answer and a loss score is calculated. The loss score is a measure of how far the network’s predictions are from the true output, which is used to adjust the weights. An optimiza-tion algorithm, the optimizer, is used to transform the loss score into weight adjustments. The optimal adjustment to the weights is computed using a so-called backpropagation algorithm. The process of neural network training is visualized in figure3.[10] [15] Since all weights from previous layers will impact the final output, optimizing the value of every weight becomes a complex problem. Backpropagation is a method for minimizing the loss score by computing the gradient of the loss function with respect to the network’s different coefficients. These coefficients (weights) are all differentiable, making it possible to find the gradient of the loss and adjust the weights. By repeatedly taking small steps against the gradient direction in every training iteration, the loss score will decrease in value and improve the overall result of the network.[10] [15]

2.5.2 Training data, validation data and test data

(11)

Figure 3 – A simple illustration of how the training of a network looks. Input data X is transformed by the weights in each layer. Predictions are made based on the new representations of X and are compared with the correct answers, producing a loss score. The optimizer uses the loss score to adjust the weights for the next training iteration, using backpropagation.

optimize accuracy and loss for the validation data set. Even though the validation set is not used in training, the network often becomes biased towards it due to the fact that the user tends to tweak the network to optimize validation scores. This is why a third data set is necessary; test data, to be used as a final unbiased evaluation of the network. Test data is often what is used to compare results from different networks.[16] [17] 2.5.3 Overfitting

During training, the network learns representations of the training data to perform bet-ter and betbet-ter results for loss and accuracy. Before the network is able to find relevant representations and achieve good results, the network is said to be underfit. As training proceeds, there often comes a point at which the network’s performance for validation data peaks and starts to degrade, while its performance for training data continues to improve. When this occurs the network has started to adjust its algorithm based on rep-resentations of specific details found only in the training data. This is called overfitting. The aim is to get a network that is good at generalization (i.e. predicting correct results for unseen data), therefore methods for reducing overfitting is of high importance. The most obvious way of doing so is by gathering more training data. Besides this, two very common methods for avoiding overfitting are called data augmentation and regulariza-tion.

(12)

task. This makes data augmentation a useful tool. Basically, additional simulated data is created from already existing training data to produce a larger training dataset. The augmentation is transformations of data such as rotations, flips, translations and scaling; all used to construct new data in order to improve generalization error.

Another way of improving the network’s generalization ability is by using different reg-ularization methods. One common method is so-called dropout regreg-ularization, which works by randomly removing (dropping out) a number of layer outputs during training by setting them to zero. This reduces learning of insignificant patterns and makes the network less likely to overfit the data.

Both of the mentioned methods for reducing overfitting are applied only during training and not used for validation or testing of the network.

2.6 Convolutional neural networks

A convolutional neural network, CNN or convnet, is a type of model in deep learning known for its practical applications in computer vision especially. A network is said to be convolutional if it uses a mathematical operation called convolution in at least one layer. One of the big advantages of convnets is that each layer in the network learns local patterns, in contrast to regular dense layers which learn global patterns. A specific pattern learnt by a convnet layer can later be recognized in other locations of a image, making the network translation invariant. In contrast, when a specific pattern learned by a regular dense layer appears in a different position, the dense layer based network would need training data containing the same pattern in the new position to recognize it. This gives convnets an advantage since less amounts of training data is needed to learn these kinds of representations.[10] [16]

(13)

Figure 4 – A filter (blue 3x3) convolving on a 6x6 image of the digit four. Image displays two values computed for the output feature map.

2.6.1 Pooling layers

In convolutional neural networks, the layers of the architecture are often structured in sequences, or blocks, of layers. These sequences often involve one or two convolutional layers and one so-called pooling layer. The pooling layers are used to downsample the amount of parameters by reducing the dimensions of the feature maps. A max-pooling layer extracts windows from the feature maps and outputs only the maximum value. Typically, these windows are 2x2 in size with no overlapping of the extracted windows, which results in reducing each feature dimension (not channels) by a factor of 2.[10] 2.6.2 Hyperparameters

(14)

network. Performing a parametric sweep for the hyperparameters is a more useful way of finding a high-performing network. This is a process in which different values for chosen hyperparameters are tested, in order to output the model with the highest performing results. When designing convolutional neural networks, common hyperparameters used for parametric sweeps are;

• Number of layers - Determines the depth of the network. Too few layers might result in underfitting while too many can result in overfitting. The more layers used, the more detailed information can be learned (training-set specific details when overfitting).

• Number of filters - The amount of filters convolving on the input data. Can be seen as how many types of features/patterns to extract from input data.

• Filter size - The size of the convolving filters. Big filters collect global information from input data while small filters collect local patterns.

• Number of neurons in dense layer - Similar effect of how detailed information to learn from previous layers, as number of filters and number of layers parameter. • Dropout rate - The fraction of the output features of a layer to drop out during

(15)

3 Method

3.1 Data set

Data used in this project are photographic images and LiDAR data (point clouds) of forest owned by SCA, mostly consisting of spruce, pine and birch trees. A helicopter equipped with two cameras and a laser scanner collected data at an altitude of 70 meters above the ground. That is considered a low flight altitude, placing the laser scanner in a favorable position to collect high-detailed data of the trees’ structure, branches and stems. The fact that the resolution of the scanner is relatively high (625 points/m2) at this altitude results in dense point clouds with good possibilities for extensive analysis of the scanned forest. The point clouds of the forests were segmented into individual 2D polygons by researchers at SLU, using a tree crown density model.[18] This resulted in one 3D point cloud for each individual tree.

The cameras attached to the helicopter captured images from two different directions, nadir (straight down) and 45 degrees from nadir direction (in the direction of the heli-copter’s flight path). These two collections of images will be referred to as nadir images and overview images (from the camera with a 45 degree angle). The helicopter was also equipped with an INS and a GPS, providing information about the orientation and the position of the helicopter for every image captured. By georeferencing the data, back-projection could be performed to connect the images to the point clouds and thus utilize all available information about the trees in the forest.

In the classification, all scanned trees were divided into three classes; spruce, pine and deciduous trees. Deciduous trees were grouped together as one class due to the fact that there were not enough trees of each deciduous tree species to create reliable training data from. That class consisted mostly of birch, but some aspen, alder and other uncommon tree species might also have been included.

3.2 Backprojection

(16)

The point cloud of each tree was backprojected onto the 2D images as red pixels, with a rectangle showcasing the boundaries of the projection. The image was then cropped in such a way that the surrounding trees are still visible, giving visual information about neighbouring trees. A visualisation of the backprojections for one tree can be seen in figure5.

Figure 5 – Backprojection of a spruce from its point cloud onto a series of nadir images captured with 1.6 s intervals. Each red pixel corresponds to a backprojected 3D point framed by a red rectangle with corners corresponding to lowest and highest x and y pixel values.

3.3 Annotation of data

(17)

overview and facilitating fast and informed decisions for the labels. The images from the nadir backprojection (90 degree angle) and the overview backprojection (45 degree angle) were displayed for the user, together with 2D projections of the point cloud. A visualisation of the annotation program can be seen in figure6 .

Figure 6 – The layout of the annotation program consisting of six images. Images a) to d) are 2D projections of the tree’s point cloud. The last two images, e) and f) are backpro-jected images from the nadir and overview images respectively. The buttons represents the different species and useful commands as skip, undo and next page of images. For better visualization of a certain image, fullscreen mode of that image is activated when clicked.

Four projections were created from the point cloud and displayed for the user. Three of the projections are projections onto the axis planes, XY, XZ and YZ. In the last projection the point cloud was first rotated 45 degrees around the Z-axis and then projected onto the XZ-plane. All projections with the Z-axis were used as separate input data for the neural network.

3.4 Preparation of data

(18)

dividing each pixel value with the highest pixel value of the entire image. Lower pixel values of input data reduces the risk of the network becoming unstable due to high weight values, during training.[19]

Since the trees varied in size, a maximum height and width was set for the projections in order to get input data of the same size. These values were set so that the images would not be too big, which meant that some very high or wide trees would not entirely fit. Such high or wide trees were however very uncommon and it was therefore of higher priority to lower the resolution, since this reduces computational time.

3.5 Testing for best network architecture

To find the best architecture for the network, different models were methodically tested with different parameters. Since there were many different adjustable variables and no obvious way of knowing how they would interact, this was a challenging undertaking. The aim was not to find a perfect network architecture, but a network performing as optimally as possible.

The first method explored was inspired by the article A guide to an efficient way to build neural network architectures- Part II: Hyper-parameter selection and tuning for Convolutional Neural Networks using Hyperas on Fashion-MNIST, written by Shashank Ramesh.[20] It involves the python library Hyperas; a useful tool for optimizing neural networks by performing parametric sweeps for chosen hyperparameters. As an initial test, commonly known and basic networks VGG, LeNet and a custom network (Conv-Pool-Conv-Pool) were tested to see what kind of network architecture performed best.[21] [22] LeNet performed best for the initial test and was chosen for further testing with Hyperas. Four hyperparmeters were chosen for the parametric sweep: number of channels of the convolutional layers, number of layer sequences, dropout rates and number of neurons in the dense layers. Based on the validation accuracy, the best values for each hyperparam-eter were chosen and used for the final adaption of the network.

(19)

3.6 Final testing

(20)

4 Results

4.1 Annotation of data

The procedure of backprojecting proved to be more complicated than anticipated. Nadir images were accurately backprojected from the point clouds while overview images had an unpredictable translation, making it difficult to identify the tree of interest. This error is displayed in appendix, figure 7.

The number of annotated trees for each species can be observed in table1.

Table 1 – The number of trees annotated for each species.

Tree Species Number of Annotated Trees

Spruces 758

Pines 572

Deciduous trees 691

Total: 2021

Any annotation errors were not possible to evaluate, since there was no available data of the true distribution of tree species in the scanned areas.

4.2 Classification - Neural network

The final depth for the LeNet adaptation was 6 layers, the shallowest tested, with 3 convolutional layers and 3 dense layers. The architecture can be observed in appedix figure8 with values of dropout layers and dense layers in figure9.

For the ResNet adaptation the depth was 28 layers deep consisting of 12 ResNet se-quences/blocks corresponding to 27 convolutional layers and one dense layer. Part of the deep architecture of the ResNet adaptation can be observed in figure 10 (first two sequences and last sequence are displayed).

Accuracy and loss from training, validation and testing for both networks can be seen in table 2.

Table 2 – Accuracy and loss from training, validation and testing for the tested networks. Accuracy and loss for training data was calculated as the mean from every batch at the epoch where the best validation score was attained.

Network Score Training Validation Test

(21)

In table 3and4 , a closer look at how the LeNet and ResNet adaptations performed for each individual tree species, are presented in a confusion matrix. Recall (also known as sensitivity) represents the ratio between true positive predictions and the total number of trees of a certain species. Precision is the ratio between true positive predictions and the total number of positive predictions for a certain species.

Table 3 – Confusion matrix showing precision, recall and accuracy for the LeNet imple-mented network.

Predicted Class

n = 202 Spruce Pine Decidious Recall

Spruce 76 0 2 97 % Pine 1 49 2 94 % Actual Class Decidious 0 0 72 100 % Precision 99 % 100 % 95 % Accuracy: 98 %

Table 4 – Confusion matrix with precision, recall and accuracy for the ResNet V1 imple-mented network.

Predicted Class

n = 202 Spruce Pine Decidious Recall

Spruce 77 0 1 99 % Pine 1 50 1 96 % Actual Class Decidious 1 1 70 97 % Precision 97 % 98 % 97 % Accuracy: 98 %

To compare the recall and precision of the two models the their corresponding F1 score was computed and can be observed in table 5. F1 score is a weighted average of recall and precision and gives a better value on overall performance of the different classes.

Table 5 – LeNet adaptation’s and ResNet adaptaion’s F1 score for each species.

F1-score (%) Species LeNet adp. ResNet adp.

Spruce 98 98

Pine 97 97

Birch 97 97

(22)

obtained for the LeNet adaptation.

(23)

5 Discussion

5.1 Retrieval of training data

The process of making the backprojection to work was very time-consuming which limited the amount of time spent on testing and analysing different neural network classifiers. Ul-timately, only nadir images were successfully backprojected, however, in most cases they provided enough information for making it possible to identify the correct tree species. The backprojections seen in the overview images were incorrectly projected onto pixels located in a close proximity of the actual tree of interest; an error which in dense forests made it extremely difficult to determine what tree was being backprojected. However, the overview images still contributed information about what tree species were present in the vicinity of the tree of interest.

Another difficulty with the process of labeling trees, was to know when to include or exclude a tree in the training data. Reasons for excluding trees were: not being able to determine what species it was; the showing of several trees in one point cloud; if the point cloud was not representing a tree; if the point cloud only represented part of a tree or if the appearance of the point cloud was highly divergent from other projections of the same species (to the degree of being unrecognizable). This way of subjectively determining which trees to exclude and include, causes a bias which affects the training data’s ability to mirror the true population and thus, also the network’s ability to accurately classify unbiased data from a forest. However, this error cannot be seen in the results of this project since validation data and test data are also affected by the same bias.

5.2 Comparison of architectures

The two different architectures achieved very similar results. Accuracy, loss and F1 score basically had the same values for both architectures. However the fact that LeNet adaptation performed slightly better for the loss score and has a simpler architecture with less trainable parameters (lower training and classification times) makes it the better architecture. Looking at the confusion matrices of table 3 and 4 we see that the same amounts of trees are incorrectly classified for both networks (5 trees). With more test data the differences between the two models would probably be more apparent. One could argue that with more training and test data the ResNet adaptation might have higher potential with further modifications and testing due to its high benchmark performances. Even though the ResNet adaptation consisted of considerably less trainable parameters the GPU ran into memory loss more often for this network.

5.3 Error analysis

(24)

various appearances depending on height, age and subspecies, and if all variations of pine were not included in the training data, this would result in worse generalization abilities of the network. Another source of error that is already mentioned in the report, was the problems regarding manually labeling training data.

5.4 Improvements

The most obvious improvement needed for this project would be a properly working backprojection algorithm for the overview images. This would not only speed up the an-notation process, but also reduce the amount of disregarded trees. The overview images were of high resolution and the camera angle made it very clear what kind of trees that were present in the images. The nadir images on the other hand were of lower resolution and the straight down view made the interpretation of tree species more difficult for the user, compared to being able to use the overview images.

Another improvement that could have been made, is performing the training of the neu-ral network on a computer with higher capacity. For especially deep networks with large amounts of trainable parameters, the GPU ran out of memory during training. This limited the types of networks that could be used in this project to relatively simple networks. Better computers were available but due to lack of time, getting access to those and installing software for these machines was not prioritized. Luckily, reasonably non-complex networks proved to be sufficient for the task. Initially, I thought that a par-ticularly deep network was a necessity to achieve outcome with high accuracy, especially since the differences between pine and spruce seemed small and difficult to distinguish. However, it is not evident that a better performing machine would have been able to yield better results by using deeper and more advanced networks.

There are some potential improvements regarding the choice of networks and input data as well. In this project, the 3D point clouds were projected onto a few 2D planes, pre-serving some information from the third dimension and still allowing for 2D networks. However, using 3D data as input (the point cloud or 3D voxels of the point clouds) for the network, none or considerably less information would be lost from the third dimension. To achieve reasonable results from training with these kinds of inputs, high amounts of computational power would be needed. An alternative could be to keep input data as 2D projections but to increase the number of projections (for example using the XY-plane projection to get more information about girth).

(25)

5.5 Conclusion

(26)

References

[1] J. Holmgren, Å. Persson U. Söderman (2008) Species identification of individual trees by combining high resolution LiDAR data with multi-spectral images, International Journal of Remote Sensing, 29:5, 1537-1552, DOI: 10.1080/01431160701736471 [2] H.Hamraz, N. Jacobs, M. Contreras, C. Clark, 2019, Deep learning for

conifer/deciduous classification of airborne LiDAR 3D point clouds representing in-dividual trees, ScienceDirect, DOI: https://doi.org/10.1016/j.isprsjprs.2019.10.011. [3] Sveaskog, Planering, https://www.sveaskog.se/om-sveaskog/var-verksamhet/

produktion-och-skotsel/planering/, downloaded 2019-10-28

[4] SkogsSverige, Använding av olika träslag, https://www.skogssverige.se/tra/ fakta-om-tra/anvandning-av-olika-traslag, downloaded 2019-09-30.

[5] T. Carlberg, Rapport visar betydelsen av träd och andra växter för biologisk mångfald, https://www.artdatabanken.se/arter-och-natur/Dagens-natur/ rapport-visar-betydelsen-av-trad-och-andra-vaxter-for-biologisk-\ mangfald/, downloaded 2019-09-30.

[6] SkogsSverige, Biologisk mångfald i skogen, https://www.skogssverige.se/ klimat-miljo/biologisk-mangfald-i-skogen, downloaded 2019-09-30.

[7] E. Lindberg, L. Eysn , M. Hollaus , J. Holmgren , N. Pfeifer. 2014. Delineation of Tree Crowns and Tree Species Classification From FullWaveform Airborne Laser Scanning Data Using 3-D Ellipsoidal Clustering. DOI: 10.1109/JSTARS.2014.2331276.

[8] A. Arvid. 2018. Using multispectral ALS for tree species identification.,http://urn. kb.se/resolve?urn=urn:nbn:se:slu:epsilon-s-10194.

[9] Nvidia, Deep learning, https://developer.nvidia.com/deep-learning, down-loaded 2019-10-23

[10] F. Chollet, DEEP LEARNING with Python, Manning Publications 2017.

[11] ArcGIS Desktop, What is lidar data?, https://desktop.arcgis.com/en/arcmap/ 10.3/manage-data/las-dataset/what-is-lidar-data-.htm, downloaded 2020-04-26.

[12] Dr. Bebis, University of Nevada, Reno, Geometric Camera Parameters, https: //www.cse.unr.edu/~bebis/CS791E/Notes/CameraParameters.pdf, University of Nevada, Reno, downloaded 2020-05-23.

(27)

[14] Artem Oppermann, Artificial Intelligence vs. Machine Learn-ing vs. Deep Learning, https://www.deeplearning-academy.com/p/ ai-wiki-machine-learning-vs-deep-learning, downloaded 2020-05-11.

[15] Kerkar, N, Deep Learning with Python: A Hands-on Introduction, Apress 2017. [16] Goodfellow, Ian and Bengio, Yoshua and Courville, Aaron, Deep Learning, isbn:

0262035618, The MIT Press 2016

[17] J. Brownlee, What is the Difference Between Test and Validation Datasets?,https: //machinelearningmastery.com/difference-test-validation-datasets/, downloaded 2020-05-13.

[18] J. Holmgren, E. Lindberg, 2019, Tree crown segmentation based on a tree crown density model derived from Airborne Laser Scanning, DOI: 10.1080/2150704X.2019.1658237.

[19] J. Brownlee, How to use Data Scaling Improve Deep Learning Model Stability and Performance, https://machinelearningmastery.com/ how-to-improve-neural-network-stability-and-modeling-performance-with\ -data-scaling/, downloaded 2020-05-25.

[20] S. Ramesh, A guide to an efficient way to build neural network architectures-Part II: Hyper-parameter selection and tuning for Convolutional Neural Net-works using Hyperas on Fashion-MNIST, https://towardsdatascience.com/ a-guide-to-an-efficient-way-to-build-neural-network-architectures-part\ -ii-hyper-parameter-42efca01e5d7, 2018, downloaded 2020-05-25.

[21] K. Simonyan, A. Zisserman, 2014, VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION, https://arxiv.org/abs/1409. 1556v6.

[22] Y. LeCun, 1998, LeNet-5, convolutional neural networks,http://yann.lecun.com/ exdb/lenet/.

(28)

A

Appendix

(29)

(30)

(31)