Tree trunk image classifier: Image classification of trees using Collaboratory, Keras and TensorFlow

(1)

Engineering Degree Project

Tree trunk image classifier

- Image classification of trees using Collaboratory,

Keras and TensorFlow.

(2)

Abstract

In the forestry industry tree trunks are currently classified manually. The object of this thesis is to answer whether it is possible to automate this using modern computer hardware and image-classification of tree-trunks using machine learning algorithms. The report concludes, based on results from controlled experiments that it is possible to achieve an accuracy above 90%

across the genuses Birch, Pine and Spruce with a classification-time per tree shorter than 500 milli seconds. The report further compares these results against previous research and concludes that better results are probable.

Keywords: Barknet, TRUNK12, image classification, tree classification.

(3)

Preface

A special thanks is given to the creators of Barknet [8] for initiating the publication of an open dataset enabling open research and making this project possible. I would further like to thank the subreddit “r/computervision” for support when no support could be found. I would also like to thank my family for believing in me and making it possible for me to study at university. Lastly, I would like to thank Jonas Lundberg for helping me focus on depth and solving the success criteria rather than future possibilities.

(4)

1 Introduction ________________________________________________ 5 1.1 Background ___________________________________________ 5 1.1.1 Computer Vision & Machine Learning, Terms & Origin _____ 6 1.1.3 Technologies, image resizing and augmentation ___________ 9 1.2 Related work _________________________________________ 10 1.3 Problem formulation ___________________________________ 11 1.4 Motivation ___________________________________________ 11 1.5 Objectives ____________________________________________ 11 1.6 Scope/Limitation ______________________________________ 12 1.7 Target group __________________________________________ 12 1.8 Outline ______________________________________________ 12 2 Method __________________________________________________ 13 2.1 Reliability and Validity _________________________________ 13 2.2 Ethical considerations __________________________________ 14 3 Implementation ____________________________________________ 15 3.1 Splitting datasets ______________________________________ 16 3.2 Cropping _____________________________________________ 16 3.3 Color-space RGB to BW/Grayscale ________________________ 16 3.4 Rudimental ML-algorithm implementation __________________ 16 3.5 CNN ________________________________________________ 17 4 Results ___________________________________________________ 19 4.1 The datasets __________________________________________ 19 4.2 Algorithms performance ________________________________ 24 4.3 Additional experiments _________________________________ 26 4.3.1 Effect of cropping _________________________________ 27 4.3.2 Effect of Augmentation _____________________________ 29 4.3.3 Effect of unique trees and unique images _______________ 32 4.3.4 Effect of learning-rates and number of validation-steps ____ 33 4.3.5 Effect of locking layers _____________________________ 35 4.3.6 Effect of class-merging _____________________________ 38 4.3.7 Barknet, results in context ___________________________ 44 5 Analysis and discussion _____________________________________ 48 6 Conclusion _______________________________________________ 50 6.2 Future work and lessons learnt ____________________________ 51

(5)

6.2.1 Future work and objectives __________________________ 51 6.2.2 Data-Collection ___________________________________ 52 References ___________________________________________________ 53 Appendix 1 __________________________________________________ 56 Appendix 2 __________________________________________________ 57 Appendix 3 __________________________________________________ 58 Appendix 4 __________________________________________________ 59 Appendix 5 __________________________________________________ 60 Appendix 6 __________________________________________________ 61 Appendix 7 __________________________________________________ 62 Appendix 8 __________________________________________________ 63 Appendix 9 __________________________________________________ 64

(6)

1 Introduction

There exists an interest in the forest industry to automate classification of trees. It is therefore the aim of this report to evaluate the potential of automated tree-classification for classifying birch, pine and spruce using the existing datasets TRUNK12 [18] sampled in Appendix 3, visualized in Figure 4.1.4 and Barknet [8] presented in Section 4.1 and sampled in Appendix 1.

The report concludes that although possible to achieve accuracy above 95%

on a set merged sub-species of birch, pine and spruce its potential is largely related to its context, such as: The classes of interest, the likeness between classes (e.g. sugar-maple and red-maple) and the technique one use for resizing and processing the images such as cropping and downsampling. This context can be exemplified with the conclusion that trees can be trained and merged into its genus (e.g. red-pine into “Pinus”) with no major difference than merged into genus before training.

The report also establishes a baseline model that classifies images of 20 species of trees using the machine-learning algorithm “Convolutional Neural Network”. This model was built using support from TensorFlow and Keras.

The baseline used rudimental image-scaling, achieving a test accuracy of 77%. Ending with listing possible/suitable objectives for future improvements and research.

1.1 Background

The employer of this graduation thesis is Dasa Control Systems AB, who develop hardware and software for Rottne Industri AB who manufacture Forestry-machines. It would be interesting for both Rottne and Dasa if it is possible to replace the manual classification done by harvester-operators with computer-automated classification.

This report presents a solution by using image classification to classify images of trees.

Image classification is in this report understood as a subfield of image understanding that is a subfield of computer vision which is an umbrella term for different processes of gathering, analyzing and processing images or videos into models that a computer can act upon.

This report will cover many terms within the fields of: statistics, computer vision, machine learning, dendrology and French acronyms. In what follows several concepts that will be used throughout this report are presented and introduced.

(7)

1.1.1 Computer Vision & Machine Learning, Terms & Origin Computer vision is as aforementioned an umbrella term, this also applies to Machine Learning. This report only touches a small subsection of these areas, where the two intersect, mainly focusing on different algorithms for classification and Convolutional Neural Networks (CNN) in particular.

Confusion matrix: It is important to notice that confusion matrices in this report are structured in line with Scikit-Learn [1] with its rows representing actual classes and columns represent the predicted values. The diagonal represents Y-pred = Y-true = TP.

Table 1.1.1.1 describes crucial terms for understanding a classification- model.

Class Class is in this project used to define the species that are to be classified.

X Input (to be predicted or trained upon).

Y-true The original classes corresponding to each input.

Y-pred The predicted classes predicted from input.

Table 1.1.1.1: Terms for understanding a classification-model.

Table 1.1.1.2 exemplifies terms that are used both in matrix-level and on class-level, here exemplified on class-level of “spruce”.

Support Number of spruces in the original class (class-level support = TP+FN).

True Positive (TP) Number of spruces correctly predicted as spruce.

False Negative (FN) Number of actual spruces not correctly predicted as spruce.

False Positive (FP) Number of trees incorrectly classified as spruce.

True Negative (TN) Total support of all trees subtracted by support of spruce and its False Positives.

Table 1.1.1.2: Terms in context of class “spruce”.

Table 1.1.1.3 shows the equations foremostly used, one should note the difference between a class level “accuracy” and matrix-wide accuracy. For this reason, F1-Score was chosen when discussing class-level accuracy in this report, a choice further discussed below.

(8)

Equation Formula

Precision 𝑇𝑃

𝑇𝑃 + 𝐹𝑃

Recall 𝑇𝑃

𝑇𝑃 + 𝐹𝑁

F1-Score 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Accuracy 𝑇𝑃 + 𝑇𝑁

𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 Table 1.1.1.3: Equations used.

In layman’s terms:

Recall = Fraction of trees correctly recollected to its original specie/class.

Precision = Fraction of trees correctly predicted/classified.

On class level recall and precision can be contradictory since it is possible to recall all classes to an singular class and get 100% recall in that class, but the precision would be worse, since the number of False Positives is then equal to the total support of all the classes. In the opposite case one can recall 0 trees and get 0/0 = undefined = infinite precision and 0% recall. So, to make the measures more balanced this report use F1-score on class level to combine precision and recall into something one could call “class wise weighted accuracy”. There exists criticism against using F1-measure/score [30] since it treats recall and precision as equally important, but in this case, they were deemed equally important.

Underfitting and overfitting are two important concepts when describing statistical models and their ability to generalize/fit/model data, here explained from the scope of machine-learning:

Underfitting: Underfitting occurs when a model due to too few parameters, incorrect learning rate or other reasons (e.g. bad feature extraction), does not fit (correctly describe) the underlying data.

Overfitting: Overfitting occurs when a model overfit the training set or noise/outliers.

It can be said that underfitting is the inability to remember anything but a very small abstract subset of the information it was trained on, and overfitting is an inability of abstracting what was remembered/learnt from the training onto the reality/test-set.

Hyper parameter: Is a parameter of a learning algorithm and is set before training to optimize the training time and accuracy of the model.

(9)

Hyper parameters can be changed between training epochs or steps as defined by a scheduler set before training. In this project all hyper-parameters are set statically before training.

Loss: As explained by J.Brownlee in [2]: “As part of the optimization algorithm, the error for the current state of the model must be estimated repeatedly. This requires the choice of an error function, conventionally called a loss function, that can be used to estimate the loss of the model so that the weights can be updated to reduce the loss on the next evaluation”. In this project cross-entropy was used as the loss-function.

Feature extraction: Is the process of reducing an input into something that is optimized for machine learning. As explained by Z.C.Horn et.al. in [3]:

“Images are high dimensional and require feature extractors to produce lower dimension representation ……. …. Thereby mitigating the curse of dimensionality”.

ANN: Artificial Neural Network is a system of interconnected nodes that are named after the inspiration that it takes from the human brain's Neurons. It is best explained using graph-theory, which is beyond the scope of this report.

SVM: Support Vector Machine is a kernel-based algorithm for creating models primarily for solving binary classification, it can be used in other contexts and more detail on its parameters will be provided in the report.

MLP: Multilayer Perceptron is a class of feedforward Artificial Neural Networks (ANNs).

Decision Tree: The Decision tree classifier is based on the structure of a decision tree with leaves representing the class label and internal nodes representing attributes and edges representing decisions.

KNN: K-Nearest Neighbors is a non-parametric instance-based algorithm for classifying individual data-points based on its “K” Nearest-Neighbors.

CNN: Convolutional Neural Networks is a class of deep Neural Networks that employs convolution in at least one of its layers. Like an MLP it at least consists of an input layer, several hidden layers, and an output layer, but unlike MLP at least one of these layers are convolutional, with input layer always being convolutional [4]. These convolutions mean that unlike less advanced algorithms CNN can handle feature extraction, but unlike less advanced feature extractors CNN needs to be trained [3].

Transfer learning: To reduce the need for large datasets when training a Convolutional network one can use architecture and weights from pre-trained

(10)

VGG19: Is a 19-layer variant of the Visual Geometry Group (VGG) architecture. VGG19 consists of 16 convolution layers, 3 Fully connected layer, 5 MaxPool layers and 1 SoftMax layer.

This project used Resnet34 and VGG19, both pre-trained on ImageNet.

Resnet34 was loaded from [6] and VGG19 was loaded from [7] with the number of parameters described in Table 1.1.2.1.

0-Locked layers 5-Locked layers

Resnet34 23,272,221 23,247,316

VGG19 54,011,564 53,898,988

Table 1.1.2.1: Number of trainable parameters of Resnet34 and VGG19.

1.1.3 Technologies, image resizing and augmentation

To make it easier for the feature extractor to load and process an image and possibly helping it retain information, Carpentier, Giguère, & Gaudreault [8]

combined the two basic concepts for image dimension reduction:

Downsampling/resizing and cropping.

Cropping: Cropping is taking a set of pixels from a part of the image and creating a new image-representation from this subsection of the original image.

Resizing/image-scaling/downsampling: Is used arbitrarily in this report to reference the technology of reducing the size/number of pixels of an image by using resampling filters like Nearest-neighbor or Lanczos resampling.

This project used Pillow implementation as inherited from Keras [9] for downsampling using its default “nearest neighbor”-filter [10].

Image augmentation: Is the technique of creating a larger, or more diverse dataset by applying different functions such as rotation, horizontal flipping, noise, color-saturation, or brightness/contrast on existing images in the original dataset, resulting in artificial images. The augmentation is used foremostly with the intention of reducing overfitting by e.g. making dark images brighter, letting the model be trained on all possible cases in small or unbalanced datasets.

TensorFlow: TensorFlow is an open sourced library for numerical computations developed by Google [11, p. 376] that also supports features like GPU-support, distributed computing and core support for Keras.

In this project TensorFlow is used primarily for handling GPU allocation and Keras.

(11)

Keras: Keras is an open-source high-level Deep Learning API that can be run on top of TensorFlow, Microsoft Cognitive Toolkit, Theano [11, p. “xvi”]

it can also be translated into solutions for Android, iOS, front end JavaScript, servers and embedded systems [12].

Colab: Officially “Colaboratory”, a.k.a. “Google Colab” is a free cloud service for running Jupyter Notebooks on VM environments hosting GPUs, TPU and Linux (ubuntu 18.04) [13].

1.2 Related work

There exist decades of work in the field of Artificial Intelligence (AI) and Machine Learning (ML), the applications of this work onto the field of forestry have however remained somewhat limited. Possible reasons for the limited work in the field might be the lack of data to work with compared to e.g. fashion where open preprocessed datasets exist [14]. The field has however started to expand in recent years, with e.g. [8,16,17] all published 2018.

In [8] a dataset named Barknet is presented, along with how it was gathered, how it was trained against, creating a CNN-model with resnet34 architecture pre-trained on ImageNet. It further presents how the best performing model achieves an 97.81% accuracy by performing majority voting on image to tree basis. This majority-voting used all available single-cropped images representing the individual tree. This majority-vote thereby classified each individual tree. Reference is also made to an external report [15] that lets

“experts” classify images in a dataset with lower resolution than the Barknet dataset. These experts scored an accuracy of 56.6% and 77.8% respectively, compared to the models 69.7% accuracy. It is also stated in [8] that it is more important to provide individual trees than images [8, Fig.6].

Bressane et.al. presents in [16] performance of different algorithms for classifying images of bark taken at 5cm from the trunk and the feature extraction used to achieve these performances. It concludes that among the best performing algorithms are SVM, KNN and Probabilistic Neural Networks all scoring test-accuracies above 89% (CNN is not even top 5 at accuracy of 78.5%).

Lastly Bressane et.al. [17] suggest an increase in accuracy when analyzing a combination of both leaf and bark rather than analyzing bark and leaf separately: This is however not relevant in this report since it was deemed un- robust and impractical to use in forestry machines. It could however be of

(12)

1.3 Problem formulation

The problem that this project aims to handle is the ability of correctly classifying images of birch, pine, and spruce with an accuracy above 90%

evaluated using f1-score. As a consequence, the project further aims to evaluate the ability of image-classifiers to generalize for classifying unique, yet similar classes of trees and how future research should be conducted to achieve a more robust result.

1.4 Motivation

The employer of this graduation thesis is Dasa Control Systems AB, who develop hardware and software for Rottne Industri AB who manufacture Forestry-machines. It would be interesting for both Rottne and Dasa if it is possible to replace the manual classification done by harvester-operators with computer-automated classification.

Further affected groups is the entire forestry industry from its nursery- gardens through forest inventorying and harvesting, to the lumberyards and sawmills. This and the lack of actual implementations was deemed as beyond enough motivation for initiating basic research on potentiality of implementation.

1.5 Objectives

O1 Find a suitable algorithm for image classification O2 Evaluate datasets

O3 Evaluate methods and approaches against success criteria

The first objective (O1) will test different algorithms and techniques for image classification, specifically: SVM, MLP, Decision Tree, KNN and CNN.

Objective two (O2) will evaluate the datasets TRUNK12 [18] and Barknet [8]. Evaluation of different subsets of these datasets may also be performed.

One hypothesis being that TRUNK12 contains multiple images per individual tree, while Barknet allows separation based on tree-trunk identity rather than just image-identity separation.

Lastly objective three (O3) is to evaluate methods and approaches to achieve the specified success criteria of accuracy above 90% evaluated using f1-score on the genuses Birch, Pine and Spruce with a classification time no longer than 500ms per tree. The methods focused on are: Class-merging, augmentation, and Cropping vs Downsampling.

(13)

1.6 Scope/Limitation

This project will focus on achieving an accuracy above 90% on birch, pine and spruce without using advanced feature-extraction (unless provided by the algorithm as in CNN). It will further focus more on hyperparameters than architecture.

1.7 Target group

The primary target group for this report is software developers with limited experience in machine learning, however not excluding scientists in the field of machine learning interested in getting a baseline for further research. The report further tries to be available for a wider public interested in the potential of the field.

1.8 Outline

This report is structured as it first presents the objectives and background required for understanding the results, as is done above. Thereafter the method chosen for solving these objectives is presented, discussed, and criticized in the method-chapter. After the method is established the implementation details for the gathering of the results is presented in the chapter “implementations”. Results are then presented and along with some discussion regarding unscientific assumptions made is included in the Section

“Results”. After the results have been presented the results are discussed, analyzed, and put into context against previous work in the field in the Chapter “Analysis and Discussion”.

After the results have been discussed and put into context the results are evaluated against the objectives in the “conclusion”-section and finally lessons learnt and suggestions regarding future research is presented in the subsection “Future work and lessons learnt”.

(14)

2 Method

This project used controlled experiments for handling the objectives specified in Section 1.5. With the independent variables: Dataset, algorithm, method of downsampling, learning-rates and the dependent variables: Accuracy, f1- score, and classification-time.

Details surrounding the independent variables and the splitting into test, validation and training-set is given in Chapter 3.

O1 was handled by implementing different algorithms for classification, these being: SVM, MLP, Decision Tree, KNN and CNN with parameters specified in Section 3.4 and 3.5. The algorithm CNN with architecture and weights loaded from VGG19 was found to result in the optimal independent variables referenced against the success-criteria of O3.

O2 was handled by observing pattern between test-accuracy of TRUNK12 and Barknetx20.

Based on these results further experiments were then made on other aspects such as downsampling instead of cropping and merging classes into genuses that were deemed to have potential of satisfying the success-criteria in O3.

2.1 Reliability and Validity

All accuracies in this report can be reproduced within a range of ±2 percent- points, it is however important to include the following considerations, especially regarding time before reproducing the experiments or implementing improvements:

Due to the method of implementation, where no online-learning algorithm was used for the rudimental ML-algorithms, enabling the validation-set to act as a held-out test-set. Some confusion arose in the start of evaluating the datasets using CNN-models where intermingling validation and test-set caused invalid results. This however only affected the decision of abandoning augmentation. Looking back augmentation should only be used after establishing a stable baseline. Neither is the Learning-Rate of 10⁻⁵ scientifically proven as optimal for reasons illustrated in Section 4.3.4, it is however proven to be the highest reasonable learning-rate allowing convergence, as implied by inductive reasoning (despite instability, 10⁻⁴ will never converge).

Further issues with reliability and validity is caused by the method of not creating feature extractors and not using advanced learning algorithms for the rudimental ML-algorithms making results of these experiments more or less

(15)

scientifically invalid. The results of these experiments are only included in the report for encouraging future research.

Remarks can be made against using Keras for measuring the time taken processing images and training models using the time per step. It can be argued that these measures are unreliable since it does not separate loading images, processing images and calculating predictions, and as experienced vary greatly whether it loads images from google drive or from internal memory, where google Colab takes over 650ms for each step on the first epoch, and steps on successive epochs require only around 100ms. For this reason, time-stability was evaluated against the one achievable on different hardware, with the result that 100ms corresponding to that achievable when images are preloaded into the virtual machine. Concluding why time referenced in this report was never measured on the first execution.

Execution time is alsfo greatly affected by the hardware, especially the processing-unit used for processing the models, why consideration was made to include this information. To further notice is the training-times not being a part of any success criteria.

Regarding Library-implementations it is true that they may change over time, but with the time-scope of this project, using libraries was deemed the most reliable and valid method, considerations were also made to describe the technologies and algorithms used by the libraries.

Lastly and most importantly, the performances measured were not taken from an average of e.g. “5-best” trained-models, resulting in some unreliable accuracy. This can be seen in models trained on large datasets with many classes where accuracy can shift with ±2 percent-points. The most unreliable accuracies were gathered from models whose validation-loss did not fully converge, this however only applies to experiments surrounding cropping and unoptimized learning-rates.

2.2 Ethical considerations

No personal or sensitive data is used in the experiments presented in this report, wherefore no ethical considerations is necessary.

(16)

3 Implementation

Figure 3.1: Course of implementation.

Figure 3.1 establish an outline to the course of implementation. Where firstly the datasets were divided into test, training and validation-sets as described in Section 3.1. This separation allowed evaluating which of the algorithms presented in Section 3.4 and 3.5 best suit the project based on its test- accuracy. Then crude experiments with hyperparameters and image processing was conducted resulting in e.g. excluding the cropping used in the start of the project in favor of the nearest-neighbor downsampling using Keras specified in Section 1.1.2.

After the algorithm and image-processing was established further experiments was focused on barknetx3 and barknetx9 presented in Section 3.1 and 4.1 to reduce the size of the problem and to focus work on the success criteria.

(17)

3.1 Splitting datasets

Firstly, three species of trees were excluded from the Barknet dataset (Barknetx23), thereby creating Barknetx20 as explained in Section 4.1. After this exclusion and based on commonly-accepted practice 20% of the original datasets Barknetx20 and Trunk12 was held out as an test-set, the remaining 80% was further divided on the ratio 80:20 into a training, respectively validation set, resulting in three separated subsets of each original dataset.

Later in the process Barknetx20 was separated based on the identity of the trees, keeping the mentioned subsets separated on basis of tree identity rather than image identity. Barknetx20 was also further separated into barknetx9 and barknetx3, where barknetx9 is all the species of birch, pine and spruce contained in barknetx20, and barknetx3 is the classes of barknetx9 merged into the three genuses.

3.2 Cropping

The first step was to use TRUNK12 for evaluating what algorithms were possible to use in the limited time frame, for this there was a need to reduce the size of the images, for which cropping was chosen using PIL.Image.Image.crop [27] from the Python Image Library (PIL/Pillow) where either the center pixels was chosen or n-random parts of the image, these crops where then saved un-augmented to memory.

3.3 Color-space RGB to BW/Grayscale

ITU-R 601-2 luma transform implemented in PIL.Image.convert [19] was used to convert the images from RGB to grayscale, further reducing the dimensionality before training non-CNN based algorithms.

3.4 Rudimental ML-algorithm implementation

Table 3.4.1 specifies the rudimental algorithms used and the hyperparameters used in Section 4.2. These hyperparameters were not finetuned.

Since only rudimental batch-based learning-algorithms were used in these experiments, experiments were only conducted on TRUNK12 for the simple reasons of fitting inside the 16 GB RAM of the computer used and experiments soon being abandoned in favor of CNN. Here is listed the library implementation and parameters used for achieving the performance later referenced in Section 4.2.

(18)

Algorithm Implementation

SVM sklearn.svm.SVC(kernel = 'rbf', C = 2)

KNN sklearn.neighbors.KNeighborsClassifier(n_neighbors = 7) Decision Tree sklearn.tree.DecisionTreeClassifier

(max_depth = 2,min_samples_split=3) MLP sklearn.neural_network.MLPClassifier (solver='lbfgs',max_iter=1000,alpha=1e- 5,hidden_layer_sizes=(15,),random_state=1) Table 3.4.1: Rudimental ML-algorithms and their implementations.

3.5 CNN

CNN was implemented from Keras, excluding the three fully connected top layers allowing other target-sizes than the default 224x224 [20] and loading pretrained network from either VGG19 or Resnet34. The pretrained network VGG19 pre-trained on ImageNet was loaded using keras.Applications [7]

while Resnet34 pre-trained on ImageNet was loaded using a library called

“segmentation-tools” [6]. Thereafter different numbers of locked layers were locked, for the experiments in Section 4.3.5 using the following for-loop:

for layer in model.layers[:num_locked_layers]:

layer.trainable = False

The Output layers was then, using classes from [23] constructed. The architecture of these output-layers may have resulted in worsened performance. However, since the experiments achieved the success-criteria with the layers included, no experiments were made skipping them, it is however encouraged to research the effect of skipping them. However, the research regarding the effect of skipping the layers bellow should be conducted after one have provided a baseline in accordance with [8], as discussed in Section 6.2:

x=model.output x=Flatten()(x)

x=Dense(1024, activation="relu")(x) x=Dropout(0.5)(x)

predictions = Dense(num_classes, activation="softmax")(x)

The model was then constructed and compiled using the following code with the class “Model” loaded from [24] and optimizers from [26], optimizer, metric and loss is required by Keras [28] to use the function “.fit”, for handling the training-loop, also referenced as “model trainer” in this report.

(19)

Categorical cross-entropy was chosen as the loss-function based on commonly accepted practices when creating a multiclass image-classifier.

The use of accuracy as the metric was chosen because accuracy is the prime focus of this project. NAdam (Nesterov-accelerated Adaptive Moment Estimation) [29] was chosen as the optimizer along the fact of it being the optimal optimizer based on crude research [29], this research is however insignificant and if time is provided after establishing an baseline in future research it is encouraged to compare the effect of using different optimizers.

model_final =Model(inputs=model.input, outputs=predictions) model_final.compile(loss="categorical_crossentropy",

optimizer=optimizers.Nadam(learning_rate=learning_rate), metrics=["accuracy"])

Further important things to include are the callbacks “CSVLogger” and

“ModelCheckpoint” from: [25] using “save_freq=’epoch’" when training/fitting the model. And finally when evaluating the predictions the class/tree-specie was classified to be the class with highest probability among the returned probabilities using numpy.argmax().

y_pred1=model_final.predict(test_set,steps=int(nb_samples/b atch_size),batch_size=batch_size,verbose=1)

y_pred = np.argmax(y_pred1,axis=1)

(20)

4 Results

In this chapter the results that followed experiments for handling the objectives are presented.

Firstly, the datasets are presented in Section 4.1. After this the rudimental algorithms presented in Section 3.4 are compared against the Convolutional Neural Networks presented in Section 3.5, establishing CNN using VGG19- architecture pretrained on ImageNet being the optimal algorithm for the purposes of this project thereby answering the questions in Objective 1 regarding suitable algorithms.

Lastly the results from additional experiments for satisfying the success criteria is presented in Section 4.3 along with some experiments handling Objective 2 regarding evaluating datasets.

4.1 The datasets

Table 4.1.1 color-codes birch, pine, and spruce; the table also translates the French acronyms used in the labeling of Barknet into Latin and English with red rows representing the classes ERB, PEG and PID, which were held out from the project due to few samples as discussed in [8] and evident in Figure 4.1.1. and the detailed Figure 4.1.2.

(21)

Label Latin English Number of samples/images

BOJ Betula alleghaniensis Yellow birch 1255

BOP Betula papyrifera White birch 1285

CHR Quercus rubra Northern red oak 2724

EPB Picea glauca White spruce 596

EPN Picea mariana Black spruce 885

EPO Picea abies Norway spruce 1324

EPR Picea rubens Red spruce 740

ERB Acer platanoides Norway maple 70

ERR Acer rubrum Red maple 1676

ERS Acer saccharum Sugar maple 1911

FRA Fraxinus americana White ash 1472

HEG Fagus grandifolia American beech 840

MEL Larix laricina Tamarack 1874

ORA Ulmus americana American elm 767

OSV Ostrya virginiana American hophornbeam 612

PEG Populus grandidentata Big-tooth aspen 64

PET Populus tremuloides Quaking aspen 1037

PIB Pinus strobus Eastern white pine 1023

PID Pinus rigida Pitch pine 123

PIR Pinus resinosa Red pine 596

PRU Tsuga canadensis Eastern Hemlock 986

SAB Abies balsamea Balsam fir 922

THO Thuja occidentalis Northern white cedar 746

Table 4.1.1: Acronyms and their abbreviations for barknetx23, trees marked with red are held out of barknetx20. Trees included in barknetx3 merged to Birch are cyan, merged to pine green and spruce are yellow.

(22)

Figure 4.1.1: Number of images/samples per class in barknetx23 (details in Figure 4.1.2).

(23)

(24)

Figure 4.1.3 illustrates the distribution of resolutions in the Barknet dataset. It may be interesting to note the distribution in image height is probably caused foremostly by the use of different cameras, while the distribution in width is caused by the cropping used in response to the semi-fixed distances and varying Diameter at Breast Height (DBH) discussed in detail in Chapter 5 and in [8]. Most tree-images are around 3000, 4000, or 5000 pixels high and between 1000 and 3000 pixels wide. This can be compared to TRUNK12 presented in Figure 4.1.4 where all images are 4000 pixels high and 3000 pixels wide.

Figure 4.1.3: Number of images/samples sorted by width, respectively height and visualized with histograms for “barknetx23”.

(25)

Figure 4.1.4: number of images/samples per class in TRUNK12, further note the fact that all images in TRUNK12 are of the resolution 3000x4000px.

4.2 Algorithms performance

This section presents the results of the rudimental algorithms Decision Tree, KNN, SVM and MLP along with the more advanced Convolutional Networks with the pretrained architectures VGG19 and ResNet34 with weights trained on ImageNet using 1000 training-epochs and learning-rate of 10^-5.

The test-accuracy in Table 4.2.1 presents the optimal algorithm as CNN using the architecture VGG19 with weights pretrained on ImageNet. Table 4.2.1 further visualizes the classification-time per tree (“Time/tree”), and establish the success-criteria of 500ms allows usage of CNN with the hardware specified by the following abbreviations achieving classification-time of ca 80-90ms per tree:

(26)

SC = SingleCore single-thread Intel i7-8750H @4.10GHz, as SciKit-learn does not support multi-threading.

CG = Colab Graphics-Card is one of the following: NVIDIA K80, P100, P4, T4, or V100 GPU [21].

Accuracy Time/tree TRAIN-time

SVM (128x128)

20% 5.5ms SC 2.8s

SVM 26.3% 20ms SC 13.3s

MLP 9.1% 0.3 ms SC 43s

Resnet34 49% 78ms CG some hours

VGG19 69% 86ms CG some hours

Decision Tree (512x512)

16% 0.4ms SC 16.9s

KNN (512x512)

11% 128ms SC 8.7s

SVM (512x512)

23% 87ms SC 50s

Table 4.2.1: Performance of some different algorithms on central-cropped images of TRUNK12.

The different resolutions specified inside the parenthesis of Table 4.2.1 is resolutions differentiating from the default resolution of 256x256 pixels. The fact that SVM without feature extractor performs worse on 512x512px crops than on 256x256 pixel crops is probably because it underfits since it is given too much input to train on and too few parameters to fit the data. This is however not any definitive results of anything except the number of pixels increase the time required to process the images and make classifications with more than 350%. Inconclusive experiments were also made with different resolutions using CNN architectures left out of the table since they were inconclusive. These inconclusive experiments resulted in the training and classification-time of CNN architectures increased by more than 28%

when increasing the resolution to 512x512 pixels from 256x256 pixels, and the memory required doubled. There were some signs of accuracy improving, but not more than 2 percent-points, and since this increase in accuracy is not reliably replicated, they are referenced as possible signs rather than results and although 28% slower is less than 350% slower (probably because the

(27)

feature-extraction of CNN) the gain in accuracy is too small for me to recommend any research to turn these results from inconclusive to conclusive, especially before establishing the baseline discussed in Section 6.2.1 Objective 1.

The results of the rudimental algorithms in Table 4.2.1 achieved a lower accuracy than the ones in [16], for reasons discussed in Chapter 5 and 6 of this report (feature-extraction and image-pretext). The classification-time of the rudimental algorithms in Table 4.2.1 are promising if one manages to achieve the results in [16] where the algorithms here referenced as

“rudimental” achieved higher accuracy than CNN-algorithms.

VGG19 and Resnet34 was in this experiment trained using learning-rate of 10^-5 using 1000 epochs and tested against central crops of RGB images from TRUNK12 as this achieved a better performance than grayscale. The rudimental algorithms were trained and tested against grayscale central crops from TRUNK12 because this achieved better accuracy and shorter execution times. Feature extraction and objectives for future research regarding improving the Rudimental algorithms is given in Section 6.2.1 Objective 5.

Final note regarding Table 4.2.1 is how one can consider that insufficient effort was made with the rudimental algorithms as the 11% accuracy of KNN in Table 4.2.1 was achieved by classifying all trees as Spruce since this was the class with the largest support as consequence of Figure 4.1.3. More effort was made on SVM before concluding it was futile without proper feature extraction recommended by Objective 5 in Section 6.2.1.

4.3 Additional experiments

The experiments in Section 4.1 and 4.2 did not achieve the success-criteria defined in Objective 3 of Section 1.5 as an classification-time per tree below 500ms, and an accuracy above 90% across the genuses Betula, Pinus and Picea commonly known as Birch, Pine and Spruce. As established in Section 4.2 CNN with VGG19-architecture pretrained on ImageNet is the algorithm and architecture with the greatest potential of satisfying the accuracy-criteria, why it is the algorithm and architecture used for all experiments in this subchapter.

The classification-time is established as satisfied in Figure 4.2.1 where 86ms using CNN with architecture from VGG19 is less than 500ms, however the additional experiments in this subchapter was required for satisfying the

(28)

downsampling, results of which is presented in Section 4.3.1. When it was established that downsampling had a higher potential then cropping experiments was made estimating the effect of augmentation in Section 4.3.2.

After the effect of augmentation is established to require more research Section 4.3.3 presents the effect of training a model on images of trees and using the model to classify unseen images rather than training an model on images of trees and then using that model to classify unseen trees. After establishing the effect of classifying unseen trees rather than unseen images Section 4.3.4 verifies the learning-rates of 10⁻⁵ and 10⁻⁶ to be useable learning-rates.

Locking layers and its effect on training-time is established in Section 4.3.5, recommending future research to establish the correlation between over/under-fitting and locking parameters from pretrained architectures.

In Section 4.3.6 the effect of merging multiple classes of tree into its genus is evaluated, establishing a baseline satisfying the success criteria of an accuracy above 90%. Lastly, in Section 4.3.7 the result of this project is compared against [8], with an accuracy 8 percent points lower than the worst accuracy achieved in [8].

4.3.1 Effect of cropping

Figure 4.3.1.1 and 4.3.1.2 describe the effect of cropping, respectively downsampling of images from the TRUNK12 dataset. Images previously established in Section 4.1 to be of the resolution 4000x3000 pixels. The cropping used is defined in Section 3.2 as PIL.Image.Image.Crop [27], while the downsampling used was established in Section 1.1.3 to be PIL Image.Resize [10] using Nearest-Neighbour downsampling.

(29)

Figure 4.3.1.1: Cropped image to the resolution 256x256px from a pine in the TRUNK12 dataset.

Figure 4.3.1.2 Downsampled image to the resolution 256x256px from a pine in the TRUNK12 dataset.

The data in Table 4.3.1.1 is gathered from the dataset TRUNK12 with a unique test-set of images, not unique trees, using CNN with the VGG19- architecture, trained over 50 training-epochs. Based on this data, presented in Table 4.3.1.1, where downsampling performed 39 percent-points higher test- accuracy than any of the cropping-methods downsampling was chosen as the

(30)

The data in Table 4.3.1.1 may also suggest that a lot information is lost when only using 256x256 pixels, which is 0.5% of an image with the resolution 4000x3000 pixels since using 3*256x256 parts, which is 1.6% of the image improves the accuracy with 20 percent-points. This is part of the hypothesis of why as done in [8] downsampling first, to shrink the image, and thereafter using cropping to lower the dimension without squeezing the image. As the downsampling in this project caused some images in the Barknet dataset to be distorted from an aspect-ratio of 5:1 (resolution 4000x500 pixels) to 1:1, compared with the distortion from 4:3 to 4:4 of Trunk12.

It is however unclear if the lower accuracy presented in Section 4.3.7 of this project compared to [8] is caused by distortion of the images or the fact that the downsampling trains on the entire specific tree, whereas training on the crops focus on features of trees.

Training Accuracy

Validation Accuracy

Test Accuracy

central crop 19% 29% 30%

3x random 59% 53% 53%

downsample 79% 98% 92%

Table 4.3.1.1: Results from training the above specified CNN model with VGG19-architecture on: a single crop from the center of the image, 3 random crops from anywhere in the image, and of downsampling. The crops and downsampled images are as specified taken from TRUNK12

4.3.2 Effect of Augmentation

The difference between Figure 4.3.2.1 and Figure 4.3.2.2 is generated by augmentation used in generating Figure 4.3.2.2. This augmentation was generated using Keras ImageDataGenerator [21] set to use the augmentation arguments specified in Table 4.3.2.1.

(31)

Parameter Argument

shear_range 0.2

zoom_range 0.2

horizontal_flip True

fill_mode "nearest"

width_shift_range 0.3

height_shift_range 0.3

rotation_range 30

Table 4.3.1.1: Augmentation arguments.

Figure 4.3.2.1: Un-augmented image of a white-birch from the Barknet- dataset, one can also observe the effect of squeezing inflicted by resizing a tall image into a square image.

(32)

Figure 4.3.2.2: Augmented image of a white-birch from the Barknet-dataset.

The results in table 4.3.2.1 are gathered from CNN models using the established pretrained VGG19-architecture to be trained on ImageNet.

The established models was trained on three different versions of the frequently used validation-set that contain 4’000 images instead of the 14’000 images of the entire barknetx20. This validation-set is established to contain unique trees defined as “barknetx20_UQval”, the three versions being:

1. Augmented using settings defined above, corresponding to figure 4.3.2.2.

2. Augmentation only including horizontal flipping.

3. Completely un-augmented (only downsampled as previously established).

These models were tested against a test-set of un-augmented unique trees.

Resulting in the indecisive results of Table 4.3.2.1.

Although the only pattern that can be observed in Table 4.3.2.1 is a suggestion of improved test-accuracy when using horizontal flipping, it is strongly encouraged to use augmentation mindfully and to conduct complementary experiments after establishing a baseline resembling [8].

(33)

Training Accuracy

Validation Accuracy

Test Accuracy

No augmentation 97.2% 89% 65%

Only Horizontal-Flip 96.8% 90% 68%

Augmentation 94.8% 83% 64%

Table 4.3.2.1: Effect of augmentation.

4.3.3 Effect of unique trees and unique images

Considerations were initially not made to separate trees rather than images, partly because as it seems by Table 4.3.3.1 TRUNK12 did not allow this separation of unique trees.

The sign of TRUNK12 not containing unique trees is that Barknet achieved an accuracy 2 percent-points lower than TRUNK12 using the same CNN VGG19 architecture pretrained on ImageNet and the same number of training-epochs and learning-rate. The fact that this test-accuracy later dropped to 17 percent-points lower than TRUNK12 when separating Barknet based on the tree-trunk identity instead of the image-identity (creating barknetx20_UQ).

These signs are insignificant for drawing any conclusions, it is therefore recommended in Section 6.2.1 to further research the phenomena by taking 1 image of 12 unique tree-trunks for each of the species in TRUNK12 and possibly noticing the same pattern of dropped test-accuracy as the one between barknetx20 and barknetx20_UQ.

The conclusion that image-classification of unseen images not being the same thing as image-classification of unseen tree-trunks resulted in all accuracies comparing the results of this project against [8], and against the success- criteria of accuracy above 90% was taken from experiments using unseen images of unseen (UniQue) tree-trunks rather than unseen images of seen tree-trunks. With unseen it is meant, that it is not seen by the model during training.

(34)

Dataset Train Acc Val Acc Test Acc

TRUNK12 99% 99% 96%

Barknetx20 Unique images

99% 85% 94%

Barknetx20 Unique trees

99.6% 86% 79%

Table 4.3.3.1: Accuracies (Acc:s) of different datasets using learning-rate 10⁻⁵ and 100 epochs.

4.3.4 Effect of learning-rates and number of validation-steps

For understanding the results of this and following sections it is important to remember that 1 training-epoch, also referenced as “epoch” is used to reference the iterative process of passing the entire dataset through the model-trainer, or subsets representing it as specified by the number of training-steps and chosen on equal/random distribution across the classes as handled by Keras ImageDataGenerator [21] mentioned in section 4.3.2.

It is hard to process all the images in one batch as noticed with the rudimental ML-algorithms used in Section 4.2. Instead one can easily divide the training into batches by using keras.Model [24] which handles the gradient-decent specified by the optimizer and ImageDataGenerator for handling the loading and processing of images. These batches make it possible to divide the training into a number of batches with a specified batch-size, these batches are called “training-steps”.

In this project validation-set was used for validating the performance of each epochs gradient decent (gradients are in this project updated on each training- step). This validation requires specifying a specific number of validation- steps.

One of the confusions of this project was caused by the invalid choice of the number of validation-steps (training-steps was never incorrect). This invalid choice resulted in unstable results. Why the choice of validation-steps has effect on the validity of the validation-accuracy is explained by the fact that instead of setting the number of validation-steps to 1/8 of the number of samples (since eight was the batch-size used), giving roughly 450-460 validation-steps required for the model-trainer to validate the model against all the available images in the validation-set when the validation-set was of the size 3’620.

(35)

Figure 4.3.4.1 shows the result when using 40 instead of 460 validation-steps, allowing the model-trainer to only validate against roughly 9% of the actual validation-set. Since it is randomly chosen it still models the performance of the model on unseen images, as can be seen in the accuracy curve of the red learning-curve representing validation-accuracy of the learning-rate of 10⁻⁵ in Figure 4.3.4.1 and compare this against the one in yellow one in Figure 4.3.4.2. It is however important to note that even though it models the performance of the models on different learning-curves Figure 4.3.4.1 results in a noticeably less stable result than the proper number of validation-steps used in Figure 4.3.4.2.

Figure 4.3.4.2 excludes the learning-rates 10⁻⁴ and 10⁻⁷ since 10⁻⁷ did not convolve within 50 epochs compared to the 10 epochs of the learning-rate 10⁻⁵ and the 30 epochs of the learning-rate 10⁻⁶, while the learning-rate 10⁻⁴ convolved at an validation-accuracy of 50% compared with the validation-accuracy of 85-90% that the learning-rates 10⁻⁵ and 10⁻⁶ convolved at.

Further research on the optimal learning-rate and its effect on test-accuracy is recommended, but the speed of which the model converge when using the learning rate of 10⁻⁵ is deemed to be satisfactory since no difference except the time required to train was noticed in using 50 epochs with learning-rate of 10⁻⁵ and 100 epochs with learning-rate of 10⁻⁶.

Figure 4.3.4.1: Accuracy of different learning-rates for the dataset barknetx3 over 50-epochs (Test=Validation).

Tree trunk image classifier: Image classification of trees using Collaboratory, Keras and TensorFlow

Engineering Degree Project