Detecting Spruce Bark Beetle Infestations with Satellite Imagery

(1)

Detecting Spruce Bark Beetle

Infestations with Satellite Imagery

PER EMIL HAMMARLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Infestations with Satellite Imagery

PER EMIL HAMMARLUND

Master’s Programme, Machine Learning Date: June 30, 2020

Supervisor: Mårten Björkman / Fredrik Linde / Simon Johansson / Magnus Jutterström

Examiner: Danica Kragic Jensfelt

School of Electrical Engineering and Computer Science Host company: Sogeti Sverige AB

Swedish title: Detektering av Barkborre Angripningar med satellit Bilder

(4)

(5)

Abstract

Sveaskog is Swedens largest forest owner, owning 14 percent of the Swedish forest lands. Recently, due to warmer and drier summers as a consequence of climate change, spruce bark beetles have caused damages at a massive scale.

In 2019 Sogeti developed a promising first product for monitoring the vitality of large areas through the use of Sentinel-2 data by comparing images from the same month between two years, and the results from this first product where promising. To take the detection of bark beetle infestations to the next stage of development, supervised learning was used. Models where trained with a time-series of Sentinel-2 data in conjunction with national landcover data, ground level humidity data, and height data to predict segmentations that represented infestations. Target segmentations where created by clustering GPS points and tresholding a rasterized representation of the generated clusters. In total four different model architectures where tested (LSTM, GRU, 3D convolutional, and logistic regression) and then evaluated both quantitatively and qualitatively on a test set. It was found that the GRU based model was best able to identify bark beetle infestations.

(6)

Sammanfattning

Sveaskog är Sveriges största skogsägare, med en total landareal på 14 pro- cent av den svenska skogsmarken. Nyligen, p.g.a. varmare och torrare somrar som en konsekvens av klimatförändringar har granbarkborren orsakat skada till skogen på en enorm skala. År 2019 utvecklade Sogeti ett lovande första resultat för övervakning av skogen på stor skala genom att jämföra vitaliteten i skogen från Sentinel-2 data i samma månad mellan två år. Resultaten från denna första produkt var lovande. För att kunna vidareutveckla produkten till nästa steg användes övervakad inlärning. Modeller tränades på en tidserie av Sentinel-2 data i kombination med nationella marktäckedata, markfuktighets- data, och höjddata för att kunna prediktera segmenteringar som representerar skog som har angripits av granbarkborre. Målsegementeringarna som model- lerna tränades mot skapades genom att klustra skördardata och den rastroise- rade representationen av de generarade klustrarna filtrerades. Totalt tränades fyra olika modeller (LSTM, GRU, 3D faltning, och logistisk regression), som sedan evaluerades på ett sista test set både kvantitativt och kvalitativt. Från det- ta drogs slutsatsen att den GRU baserade modellen var bäst på att identifiera skog som har angripits av granbarkborre.

(7)

1 Introduction 1

1.1 Background . . . 1

1.2 Aim and Objective . . . 2

1.3 Delimitations . . . 2

1.4 Methodology . . . 3

1.5 Research Question and Hypothesis . . . 3

1.6 Benefits, Ethics and Sustainability . . . 3

1.7 Definitions . . . 4

2 Theory 5 2.1 Background . . . 5

2.1.1 Spruce Bark Beetles . . . 5

2.1.2 Data sources . . . 6

2.1.3 Feed-forward neural networks . . . 10

2.1.4 Overfitting and how it can be overcome . . . 14

2.1.5 Convolutional Neural Networks . . . 15

2.1.6 Recurrent Neural Networks . . . 17

2.2 Related Work . . . 23

2.2.1 Remote Sensing . . . 23

2.2.2 Biomedical Image segmentation . . . 26

3 Methods 29 3.1 Software packages and hardware . . . 29

3.1.1 Software packages . . . 29

3.1.2 Hardware . . . 31

3.2 Data preprocessing . . . 31

3.2.1 Sentinel 2 data . . . 32

3.2.2 Land cover, humidity, and height data . . . 35

3.2.3 Harvester data . . . 35

v

(8)

3.2.4 The trial and error of creating a trainable dataset . . . 38

3.3 Model selection . . . 39

3.3.1 Recurrent model selection . . . 39

3.3.2 3D convolutional model selection . . . 40

3.4 Experimental setup . . . 42

3.4.1 Dataset . . . 42

3.4.2 Training considerations . . . 42

3.4.3 Early stopping and model saving . . . 42

3.4.4 Calibration of model outputs . . . 43

4 Results 45 4.1 Quantitative results . . . 45

4.1.1 Sensitivity, Specificity, Balanced Accuracy, Precision, F1, and Support . . . 45

4.1.2 Confusion Matrices . . . 47

4.2 Qualitative results . . . 50

4.2.1 Training-sized regions . . . 50

4.2.2 Full 5 × 5 km region . . . 57

5 Discussion 61 5.1 Preparation of dataset . . . 61

5.2 Model performance evaluation . . . 61

5.3 Conclusion . . . 63

5.4 Future work . . . 63

Bibliography 65 A Results from earlier iterations of dataset 70 A.1 First iteration result . . . 70

A.2 Second iteration results . . . 71

B Grid search results 73

(9)

Introduction

Chapter 1 lays out the introduction of the degree project. Here, the necessary background information for the reader to have an understanding of the prob- lem at hand is presented first. Then the objective, limitations and methodology are presented, explaining the overall aim, limitations, and chosen methodol- ogy. Finally, the research question as well as the hypothesis are presented concluding the introductory chapter.

1.1 Background

Bark beetles pose a threat to the vitality of the forest and can create damage on a massive scale. Accordning to a report from the Swedish University of Agri- cultural Sciences written by Åke Lindelöw, bark beetles have killed approximately 2-3 million m³of the trees in Götaland in 2018, the largest destruction of forest in a single year [1].

In order to maintain a healthy forest, it is crucial that infestations can be identified and remedied before the infestation spreads causing more damage.

Bark beetle infestations are difficult to find due to the practicality of current conventional methods:

• Performing searches on the ground level are time consuming

• Using flight photos is costly when we are interested over how the forest changes over time

• Drones are not practical for monitoring entire forest regions

1

(10)

• Solely using the RGB color space is not sufficient since much of the information that indicates where infestations could be located is found in the infrared spectrum.

To address this problem, satellite data is well suited to for training a model using machine learning due to its rapid availability, relatively high and consistent temporal frequency, and coverage of large areas. It will not be sufficient though, since changes in vitality that can be seen from satellite data could have several causes. These may include harvesting, forest fires e.t.c.. Therefore, it will be necessary to combine the available satellite data with other data sources such as ground level humidity, frost and ice formation, terrain-roughness, and metadata such as location and time. When combined, there will be sufficient information for a model to be trained using machine learning for the detection of infestations.

Over the course of 2018-2019, harvesters have collected data from trees that have shown signs of being infested. When this annotation was performed, the GPS position, date and time, volume of the log with and without bark, and log id. More then 100 000 trees where annotated in this manner. This set of annotated trees will be the ground truth data for creating the training and test datasets.

The main challenge of this degree project will be the creation of a dataset that makes it possible to train models using machine learning to identify spruce bark beetle infestations.

1.2 Aim and Objective

The aim of this degree project is to be able to identify spruce bark beetle infestations. The objective is to train models using logistic regression, recurrent neural networks, and 3d convolutional neural networks and compare how they perform both qualitatively and quantitatively.

1.3 Delimitations

Due to the constraint of only being able to train using the available 100 000 annotated trees as ground truth for where infestations are located, the problem will not be addressed on a general level and is limited to regions containing trainable data.

(11)

1.4 Methodology

Work will begin with a literature study and familiarization with the data and tools that will be used in the degree project. In order to perform a controlled experiment, all methods will be compared in the same test set. The empirical data gathered from these tests will be the basis for the evaluation of the chosen models.

1.5 Research Question and Hypothesis

Research Question:

From logistic regression, to recurrent neural networks in the form of both LSTMs and GRUs, and 3D convolutional neural networks. Which model is best able to identify spruce bark beetle infestations both quanti- tatively and qualitatively?

Hypothesis:

It is possible to identify where bark beetle infestations are located us- ing machine learning.

1.6 Benefits, Ethics and Sustainability

As explained earlier in section 1.1, spruce bark beetles damage forest at a massive scale and conventional methods that rely on drones, physically inspecting forest, or flying over select regions is costly making the monitoring of large regions impractical. Through using satellite data in conjunction with data for background conditions the monitoring of entire forests and could be key in identifying where forest should be cleared to mitigate the damage caused by spruce bark beetles.

The monitoring of forest vitality is a fairly innocent endeavour, but there is an interesting ethical question with respect to sustainability. Are us humans justified in trying to reduce the population of spruce bark beetles? If warm and dry summers are leading to greater bark beetle populations is this not just simply a consequence of natural selection? Personally, I would say that the reduction of spruce bark beetles is justified since other species rely on trees for their survival and if we let the spruce bark beetles population grow, there might be larger ecological consequences. Moreover, the hotter and dryer

(12)

summers are a consequence of climate change, so in trying to mitigate bark beetle infestations we are reducing our ecological impact.

1.7 Definitions

Definition 1 (Sensitivity)

The proportion of true positives that are correctly identified. TP is the number of True Positives, and FN is the number of False Negatives. [2]:

sensitivity = T P T P + F N. Definition 2 (Specificity)

The proportion of true negatives are correctly identified. TN is the number of True Negatives, and FP is the number of False Positives. [2]:

specificity = T N T N + F P. Definition 3 (Balanced accuracy)

The average between sensitivity and specificity [3]:

balanced accuracy = sensitvity + specificity

2 ,

which gives a balanced measure for accuracy when the number of positives and the number of negatives in the dataset is unbalanced.

Definition 4 (Precision)

The number of true positives divided by the the sum of true positives and false positives. TP is the number of True Positives, and FP is the number of False Positives. [4].

precision = T P T P + F P Definition 5 (F1)

A harmonic mean between the precision and sensitivity [4].

F1 = 2 · precision · sensitivity precision + sensitivity

(13)

Theory

2.1 Background

This chapter begins with presenting relevant background information that de- scribes the problem and how it is addressed. It then continues on to give a quick overview into the theoretical background behind feed-forward, convolu- tional, and recurrent neural networks. Finally, related work is presented from both remote sensing and biomedical image segmentation.

2.1.1 Spruce Bark Beetles

Spruce Bark Beetles are a member of the family bark beetles, they are approximately four millimeters in length and reproduce in stressed or windthrown trees with most ease. After the extremely dry summer of 2018 large amounts of trees became stressed leading to a decrease in resilience and beneficial conditions for spruce bark beetles. Bark beetles communicate via scents, the re- productive process starts with a male bark beetle detecting a stressed tree. The male bark beetle then starts to drill itself into the tree, and the tree tries to defend itself by drowning the bark beetle in sap. The bark beetle then uses the tree sap to produce a pheromone that it releases signaling to other bark beetles to come to the tree. Bark beetles that are passing by then detect the scent and start gather around the tree and the male bark beetles start to drill into the tree.

The tree continues to try to drown the attacking bark beetles, but eventually it is not able to produce enough sap to defend itself. Once the bark beetles have overwhelmed the tree the males start to release a pheromone signaling to the female bark beetles to come closer and the female bark beetles enter the mating chamber that the males have prepared. Each female bark beetle then

5

(14)

Figure 2.1: A spruce bark beetle, image collected from [5]

digs a straight approximately 10 cm long tunnel and lays the eggs in the sides of the tunnel. The larvae then eat outwards from the tunnel and pupate after 5-6 weeks. The bark beetles then hatch a week later and stay feeding off the tree for a few more weeks until their exoskeleton has gained enough strength to leave the tree. This whole process takes approximately 8-10 weeks from the swarming [6].

2.1.2 Data sources

Harvester data

In order to be able to train a model using supervised learning, some form of ground truth label is necessary. Here the ground truth data comes in the form of spatiotemporal points (i.e. points in both time and space) created by forest harvesters annotating trees that have showed signs of having been infested by bark beetles. Each registered point corresponds to a tree, and each point contains a GPS position, date, quantity of the tree that could be used for timber, and a quantity of the tree that was able to be used for the production of pulp (material for the production of paper). The GPS position was used to annotate where infestations are located, the date was used to annotate when the infestation happened, and the quantities of pulp and timber wood where used to determine the severity of the infestation. The GPS position is registered from the center of the harvester, so the actual GPS position of the tree can be within a radius that corresponds to the maximum length of the end effector of the harvester. A sample of the harvester data can be seen in figure 2.2.

(15)

Figure 2.2: Visualization of a sample of the harvester data

Sentinel 2 data

Sentinel-2 data is produced by the Eurpean Space Agency’s Copernicus Sentinel- 2 mission. It is comprised of a constellation of two satellites phased 180◦ from each other. [7]

To observe how the forests vitality changes over time it is necessary to have a data source that has a relatively high temporal frequency (i.e. time between images over the same area) and is able to cover large areas. Satellite data from Sentinel-2 is particularly useful in this respect, with a temporal resolution of 5 days, a swath of 290 km, and a spatial resolution of 10m (for the bands used in this thesis) [8]. In total there are 13 spectral bands in the hyperspectral images produced by Sentinel 2, the full list of spectral bands can be seen in table 2.1.

National land cover data

National land cover data ("Nationella Marktäckedata" in Swedish) is a full land cover mapping of Sweden. The current version of the national land cover data was created from a mapping created over the coarse of 2017 to 2019. It is comprised of 25 different thematic classes at three different hierarchical levels [9]. National land cover data is useful since it allows for the disregarding of regions that are not forest and can provide usefull contextual information for the classifier producing the segmentation. A full visualization of the landcover classes found in the national land cover data can be seen in figure 2.3 and a visualization of a sample of the national land cover data can be seen in figure 2.4a.

(16)

Band number Band name Spatial Resolution

Band 1 Coastal aerosol 60

Band 2 Blue 10

Band 3 Green 10

Band 4 Red 10

Band 5 Vegetation red edge 20

Band 8 Near Red Edge (NIR) 10

Band 8a Narrow NIR 20

Band 9 Water Vapour 60

Band 10 Short Wave Infrared (SWIR) Cirrus 60

Band 11 SWIR 20

Band 12 SWIR 20

Table 2.1: The available bands in S2 data

Height data

To further complement the national land cover data object height data was also used. Height data indicates the age of the trees in a given regions which gives us a measure of biomass that bark beetles can feed off of. There are four different object height data products available from the Swedish Environmental Protection Agency.

1. Object heights between 0.5 and 5 meters.

2. Object heights between 5 and 45 meters.

3. Density of objects with heights between 0.5 and 5 meters.

4. Density of objects with heights between 5 and 45 meters.

Object heights of between 0.5 and 5 meters are indicate the occurrence of low height objects such as bushes while object heights higher then 5 meters indicate the occurrence of high height objects such as forest. For this project, the second product was used. The full class labeling of the selected height data can be seen in table 2.2 and a visualization of the a sample of the height data can be seen in figure 2.4b. [10, 11]

(17)

1. Forest

1.1.1 Pine

1.1.2 Spruce

1.1.3 Mixed coniferous

1.1.4 Deciduos coniferous

1.1.5 Common deciduous

1.1.6 Broadleaf

1.1.7 Common deciduous and broadleaf

1.1.8 Temporarily

not forest 1.1. Forest not

on wetlands 1.2. Forest on wetlands

1.2.1 Pine

1.2.2 Spruce

1.2.3 Mixed coniferous

1.2.4 Deciduos coniferous

1.2.5 Common deciduous

1.2.6 Broadleaf

1.2.7 Common deciduous and broadleaf

1.2.8 Temporarily

not forest 2. Open wetlands

National Land cover data

3. Arable land 4. Other open land

4.1 Without vegetation

4.2 With vegetation

5. Exploited land

5.1 Building

5.2 Other

5.3 Road

6. Water

6.1 Lakes and watercourses 6.2 Sea

Figure 2.3: Visualization of land cover classes from the national land cover data [9]

Value 10 15 20 25 30 45 255

Height (m) 5-10 10-15 15-20 20-25 25-30 > 30 nodata

Table 2.2: Class encoding of object height data 5-45 meters [10, 11]

Ground level humidity data

Finally, ground level humidity data was also included. Ground level humidity data is useful since soil moisture effects the type of vegetation that thrives in a given area [12]. The ground level humidity data used in this thesis is provided by the Swedish environmental protection agency and comes in the form of a soil wetness index. The soil wetness index is scale encoded to values from 0-240 where 240 is the most wet and 0 is the driest [13]. A visualization of the ground level humidity index can be seen in figure 2.4c.

(18)

(a) Landcover data (b) Height data (c) Humidity data

Figure 2.4: Visualization of a sample of the national landcover data, height data, and ground level humidity data

2.1.3 Feed-forward neural networks

This section starts with an explanation of one of the simplest form of neural network, it then continues on explaining k-layered neural networks and con- cludes with an overview of how to overcome the issues that arise when using deeper neural networks.

Multi-class Logistic Regression

Logistic regression is the simplest form of neural network, given an n-dimensional input vector x it computes a prediction y using a single layer. The forward pass is performed by simply computing s = W x + b and then using softmax C = arg maxe _i

e^si P3

j=0e^sj

to compute the final prediction figure 2.5 illustrates this computation.

The predictions that are produced by the model are then compared to the ground truth ¯y. Where ¯y is a one-hot encoded vector:

¯ y_i =

(1 If i is the ground truth class 0 If i is not the ground truth class

. (2.1)

Now, let D denote a set of labeled training data with N training examples D = {xⁱ, yⁱ}^{N −1}_i=0 and let l be the loss function (cross-entropy) between the two probability density vectors ¯y and y l = −¯y^Tlog(y). To find the parameter settings W , and b that maximizes the accuracy at which the network predicts the correct class label eC the objective is to optimize:

(19)

x₁

x2

x3

x₄ x5

s0

s₁

s2

s3

s = Wx + b

y0

y1

y2

y3

= yi e^sⁱ

∑³_j=0e^s^j

argmax C˜

x6

x0

Figure 2.5: Logistic regression for 4-class problem. The variables x0-x6 are the scalars that make up the observation x, s⁰-s³are the linear scores for classes 0-3, y⁰-y³are the final predicted probabilities for classes 0-3, and eC is the class with the highest predicted probability.

arg min

W, b

( 1

|D|

N −1

X

i=0

l( ¯yⁱ, yⁱ) )

= arg min

W, b

( 1

|D|

N −1

X

i=0

− ¯yⁱ^Tlog(yⁱ) )

(2.2)

= arg min

W, b

( 1

|D|

N −1

X

i=0

− ¯yⁱ^Tlog( e^{W x}ⁱ^+b e^P³^j=0^{W x}ⁱ^j^+b

) )

In practice the optimization is performed in mini-batches. Mini-batches are small subsets of the dataset D, training is performed by approximating the gradient gb = _δb^δl, gW = _δW^δl in each training step using one mini-batch at a time. The approximated gradient is then used to optimize the parameters of the network:

b ← b − αδl

δb, W ← W − α δl

δW, (2.3)

where α is the learning rate, the size of each step that the model takes when optimizing towards the minimum of the loss function l. [14]

(20)

Going deeper

x0

x1

x₂ x3

x4

h⁰₀ h⁰₁ h⁰₂ h⁰₃ h⁰₄

h¹₀ h¹₁ h¹₂ h¹₃ h¹₄

= max(0, x + )

h⁰ W0 b0 h¹= max(0,W1x + )b1

h²₀ h²₁

= max(0, x + )

h² W2 b2

y0

y1

= yi e^h²ⁱ

∑¹_j=0e^h²^j

argmax C˜

Figure 2.6: Simple 4-layered neural network for 2-class classification with ReLu

While a single layered neural network is able to to perform well in a variety of applications, it is often beneficial to go deeper. K-layered neural networks are able to extract features using multiple layers by letting the outputs from the previous layer be the inputs for the next layer. The computation of each layer is computed in the same manner as the single layered model, for each layer the linear scoring function is computed sⁱ = Wix + bi, then the activation function is applied (three different typical activation functions can be seen in figure 2.7) the output from the i’th layer is then used as the input for the i + 1’th layer.

Two main issues are created when going deeper; 1. Since the gradient for the parameters of the earlier layers are computed by the chain rule, as the

10 5 0 5 10

0.0 0.2 0.4 0.6 0.8

1.0 Sigmoid

10 5 0 5 10

1.0 0.5 0.0 0.5

1.0 tanh

10 5 0 5 10

0 2 4 6 8

10 ReLu

Figure 2.7: Typical activation functions

(21)

number of layers grows so does the number of multiplications to earlier layers leading to vanishing and exploding gradients. 2. The inputs to each layer changes as the parameters of previous layers are adjusted, slowing down training and making the parameter initialization sensitive. To address this Ioffe and Szegedy [15] introduced batch normalization. Batch normalization works by computing the mean µ^Band variance σB² for a mini-batch, the outputs from the linear scoring function sⁱ are then normalized and scale and shifted to maintain the expressivity of the current layer. This process can be seen in algorithm 1.

Algorithm 1 Batch normalization applied to a mini-batch

1: . s is the outputs from the linear scoring function in the current layer

2: . γ and β are learned parameters for maintaining the expressivity of the current layer

3: function batch_norm(s, γ, β)

4: . Compute the mini-batch mean and variance

5: µB ← _m¹ Pm−1

i=0 s_i

6: σ²_B ← _m¹ Pm−1

i=0 (si− µB)²

7: . Normalize ( is a small constant for avoiding division by zero)

8: sˆ_i ← ^√^sⁱ_σ^−µ^B

B+

9: . Scale and shift

10: yi ← γ ˆsi+ β

11: return y

To further address the issue of vanishing and exploding gradients it is also important to have a proper initialization of model weights enabling the propagation of information through the network. If the weights are too small the signal will grow smaller for each layer and eventually vanish, and if the weights are too large the signal will grow ever larger as it is propagated through the network leading to exploding values. To address this Glorot and Bengio [16] pur- posed what is known as Xavier (or Glorot) initialization and He et al. [17] pro- posed He initialization. Xavier initialization works by initializing the weights of a given layer from a distribution with 0 mean and variance _N¹ where N is the number of input neurons to the current layer. He initialization is similar to Xavier initialization in that it also adapts the variance of the distribution that model weights are sampled from but with the difference being that the variance is larger by a factor of 2 (_N²). Both Xavier and He initialization use a similar theoretical basis for deriving an appropriate variance for weight ini-

(22)

tialization, reasoning that the signal should have the same variance before and after passing through each layer of the network.

2.1.4 Overfitting and how it can be overcome

Overfitting occurs when the model becomes so good at the training data, that it no longer is able to generalize well on unseen data. Figure 2.8 illustrates overfitting. In the following subsections three apporaches for reducing the amount of overfitting are explained.

Figure 2.8: An illustration of overfitting. The black line is the desired decision boundary, the green line is the current decision boundary as a consequence of overfitting, and the red and blue dots represent datapoints from each class.

Image collected from Chabacano [18]

L2 Regularization and Dropout

On a high level L2 regularization is a method that controls the flexibility of a model by punishing the use of large model weights. By making the cost of using large weights high, the model will stay within a range of configurations that is less prone to overfitting. [19] Regularization can take other forms as well, such as randomly dropping nodes from the network during training, a technique known as dropout. Dropout is a regularization technique specifi- cally made for neural networks. When using dropout, nodes are randomly and

(23)

temporarily removed along with their ingoing and outgoing connections from the network during train time. This technique prevents nodes from co-adapting too much leading to better generalization. Dropout was found by Srivastava et al. [20] to significantly reduce overfitting and give major improvements over other regularization techniques. A visualization of dropout can be seen in figure 2.9.

Figure 2.9: A visualiztion of dropout. Image collected from Srivastava et al.

[20]

Data augmentation

When the amount of training data is limited, our model will not be able to learn invariances leading to worse generalization. In an ideal world it would be best to collect more data and create a training dataset that is better able to represent each class, but this is often not an option. Instead one can "simulate" more training data then what is actually available by creating more observations that are variations of the original training data while not destroying the features that the model is supposed to learn. This process can include operations such as mirroring and applying affine transformations to images, reversing sequences, and adding white noise to observations.

2.1.5 Convolutional Neural Networks

Ian Goodfellow describes convolutional neural networks as "Convolutional neural networks are a form of neural network made for processing data with grid-like topology" [14]. They achieve this ability through the use of two layers

(24)

convolutional layers, and pooling layers. In the following subsections these layers are explained.

Convolutional layers

0 24

0 20

73

45 49

0 0

70 82

0 0

79 0

23 26

96 55

53 0

70 0

0 16

0 73

1 85

21 57

0 0

78 65

0 0

51 34

72 12

73 0

95 0

72 78

0 0

39 0

0 0

0 13

0

0 93 95

95

1 2

0 -1

0 -2

1 0 -1 0 ⋅ 1 + 0 ⋅ 0 + 0 ⋅ (−1) + 0 ⋅ 2 + 13 ⋅ 0 + 95 ⋅ (−2) +0 ⋅ 1 + 93 ⋅ 0 + 95 ⋅ (−1) = −285

-285

Figure 2.10: A convolutional operation in the two dimensional discrete case with zero padding

Convolutional layers make use of a mathematical operation known as a convolution. When performing a convolution a filter f (a) is slides over all values of another function g(a) within a domain to produce a new function h(x). This process of sliding a filter over a function is known as convolving.

A one dimensional convolution in the continuous case takes the form [14]:

h(x) = (f ∗ g)(x) = Z

f (a)g(x − a)da, and in the discrete one dimensional form:

h(x) = (f ∗ g)(x) =X

∀a

f (a)g(x − a).

The convolutional operation can easily be extended to two dimensions, this takes the continuous form [14]:

h(x, y) = (f ∗ g)(x, y) = Z Z

f (a, b)g(x − a, y − b))da db, and the discrete form:

(25)

h(x, y) = (f ∗ g)(x, y) =X

∀a

X

∀b

f (a, b)g(x − a, y − b).

Convolutional layers learn the values for n filters f , where each filter has a kernel size and each filter produces a feature channel. The kernel size is the size of the window for the filter f , and feature channels are convolved representations of the input. These feature channels are then in turn fed into an activation function to produce the final output. Typically, a convolutional layer consists of three stages; 1. A number of convolutions are performed in parallel, 2. The activations from the convolutional operations are fed into a nonlinear activation function, and 3. a pooling layer is used to modify the output from the convolutional layer further [14].

Pooling layers

Pooling layers are a form of dimensionality reduction that extracts features, two common pooling operations can be seen in figure 2.11. To the left in figure 2.11a, average pooling can be seen. In average pooling the dimensionality is reduced by computing the average for an n × n window. To the right in figure 2.11b, max pooling can be seen. When performing max pooling the maximum value is extracted for each window.

95 39 36 35

1 54

38 77

98 4

70 4

66 96 44 29

51.25 42.5 44 58.75 Avg

Pool

(a) Average pooling

95 39

36 35

1 54

38 77

98 4

70 4

66 96 44 29

95 77

98 96

Max Pool

(b) Max pooling

Figure 2.11: Visualization of pooling layers with kernel size 2 × 2 and stride 2

2.1.6 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a family of neural networks that are used for processing sequences of data. There are a number of design patterns that an RNN can have, including but not limited to the following:

1. Sequence to vector models

(26)

2. Vector to sequence models 3. Sequence to sequence models

Sequence to vector models are used when the objective is to classify a sequence of data or find a vector representation of a sequence of data. Vector to sequence models are used for generating sequences of data from vectors, one example is image to sentence models. Sequence to sequence models are used when the objective is to generate a new sequence from observing a sequence, a typical application of sequence to sequence models is language translation. In this thesis, our problem is finding a class from an observed sequence of images, therefore we are interested in the first type of RNN, sequence to vector models.

y L

o

V h^(t) h^(τ)

x^(t) x^(τ) W

U

t ∈ {1, ⋯ , τ − 1}

W

Figure 2.12: A vanilla RNN for sequence to vector classification RNNs achieve their ability to process arbitrarily long sequences of data by using shared parameters over time. A vanilla RNN for sequence to vector classification can be seen in figure 2.12. The forward propagation for the RNN begins with a specification of the first initial state h⁽⁰⁾, then the follwing set of equations are used for t ∈ {1, · · · , τ } [14]:

a^(t) = Wh^(t−1)+ Ux^(t) (2.4)

h^t= tanh a^(t) , (2.5)

In which τ is the number of timesteps in the sequence, W is the weight matrix for the previous time steps hidden state h^(t−1), U is the weight matrix for the current observation x^(t), and a^(t) is the tentative value for the current time steps hidden state. Finally, for the final element in the sequence:

(27)

o = c + Vh^{(τ )} (2.6) ˆy = SoftMax o^{(τ )} , (2.7) where c is the bias and V is the weight matrix for the final time steps hid- den state h^{(τ )}, and ˆy is the predicted probability for each class. This gives the loss function L = − log p(y|{x⁽¹⁾, · · · , x^{(τ )}}). The computation of gradient with respect to the loss function is computationally expensive due to the need of both a forward pass and a backward pass that cannot be parallellized due to the inherent sequential form of the data. The backward pass is performed by propagating the loss backwards through time in a process that is known as Backpropagation Through Time (BPTT). In theory, RNNs should be able to learn dependencies of arbitrary length. However, in practice this is not the case. Since gradients are propagated through time, as the length of the sequences grows gradients vanish and explode [14]. In the the following subsections two nodes that where made for addressing the problem of vanishing and exploding gradients are explained.

Long Short Term Memory

= σ ( + )

ft xtU^f ht−1W^f

= σ ( + )

it xtUⁱ ht−1Wⁱ

= σ ( + )

ot xtU^o ht−1W^o

= tanh ( + )

C˜t xtU^g ht−1W^g xt

ht−1

i_t

Ct

˜ ft

= ⊙

ht Ct ot

Ct

ot ht

C_t

= tanh ( ⊙ + ⊙ ) Ct ft Ct−1 it C˜t

ht

Ct−1

Figure 2.13: An LSTM node, is the element-wise product, and the visualization is based on the figures from Olah [21]

Long Short Term Memory (LSTM) addresses the issue of vanishing and exploding gradients for long-term dependencies by introducing an internal

(28)

self-loop through the use of a cell state Ct. This internal cell state is reg- ulated between time steps through the use of "gates", an additional internal hidden unit that uses element-wise products and sigmoid activation layers to determine what information gets to flow through time. An LSTM node contains three gates: 1 a forget gate ft, 2 and input gate it, and 3 an output gate ot

[14, 22]. The flow of computation for the variation of LSTM node used in the scope of this project is visualized in figure 2.13

The forget gate ftdecides what information is kept from the previous cell state C^t−1, the input gate i^t controls what new information should be introduced into the cell state C^t, and the output gate o^t regulates the hidden state h_tfor the current time step should be. All gates take the previous time steps hidden state ht−1 and the current observation xt as input and are computed with the following set of equations:

f_t = σ x_tU^f + h_t−1W^f , (2.8) it= σ xtUⁱ+ ht−1Wⁱ , (2.9) o_t= σ (x_tUô+ h_t−1Wô) , (2.10) in which U^f, Uⁱ, and Uô are the weight matrices for the current observation xt and W^f, Wⁱ, and Wô are the weight matrices for the previous time step ht−1. A tentative cell state eC_tis also computed in much the same manner with the exception of the use of a tanh activation fuction instead of a sigmoid activation function:

Ce_t= tanh (x_tU^g+ h_t−1W^g) , (2.11) where once again U^g is the weight matrix for the current observation x^t and W^gis the weight matrix for the previous time steps hidden state ht−1. The tentative cell state eC_tis then combined with the previous time steps cell state C_t−1 and the input itand forget gate ftto compute the current time steps cell state C^t( is the element-wise product):

C_t = tanh

f_t C_t−1+ i_t eC_t

, (2.12)

and the hidden state for the current time step htis then finally computed:

h_t= C_t o_t. (2.13)

(29)

= σ ( + ) zt xtU^z ht−1W^z

= σ ( + )

rt xtU^r ht−1W^r xt

ht−1

r_t

z_t

ht

= tanh ( + ⊙ )

ht

˜ xtU^h˜ rt W^h˜ht−1

= ⊙ + (1 − ) ⊙ ht zt ht−1 zt h˜t

ht

˜

h_t

Figure 2.14: A GRU node, is the element-wise product, and the visualization is based on the figures from Olah [21]

Gated Recurrent Unit

The Gated Recurrent Unit (GRU) can be seen as a simplified version of an LSTM, with the main simplification being the use of a single gate for control- ling the flow of information from the current and previous time steps. There are two gates in the GRU cell, a reset gate rtand an update gate zt. Both gates are computed with the current time step x^t’s observation and the previous time steps ht−1hidden state:

r_t= σ (x_tU^r+ h_t−1W^r) , (2.14) z_t= σ (x_tU^z + h_t−1W^z) , (2.15) where U^r and U^z are the weight matrices for the current observation xt, and W^r and W^z are the weight matrices for the previous time steps hidden state ht−1. As can be seen in figure 2.14, the reset gate rtis then combined with the previous time step ht−1’s to compute the tentative hidden state gh_t−1:

he_t= tanh

x_tU^e^h+ r_t W^e^hh_t−1

, (2.16)

in which U^e^hand W^e^hare the weight matrices for the current time steps observation x^tand the previous time steps hidden state h^t−1respectively. Finally, the tentative hidden state eh_t is combined with the previous time steps hidden state ht−1 and the update gate zt to compute the hidden state for the current time step h^t: [14, 23]

h_t= z_t h_t−1+ (1 − z_t) eh_t. (2.17)

(30)

As before this is one version of a number of different variations of the GRU node, but the version explained here is the one used in the scope of this thesis.

(31)

2.2 Related Work

This chapter consists of two main sections, the first give a glance into the field of remote sensing and the second gives a short overview of some related problems in the Biomedical Image segmentation, a relevant field since spruce bark beetle infestations can also be seen as the spread of a disease in a sense.

2.2.1 Remote Sensing

Deep Recurrent Neural Networks for Winter Vegetation Quality Mapping via Multitemporal SAR Sentinel-1

The Sentinel-1 mission is the first of five missions of the European Space Agency’s Copernicus initiative. Sentinel-1 data is produced by two polar- orbiting satellites performing C-band aperture synthetic radar imaging, enabling them to aquire images regardless of weather. [24]c

In this paper Minh et al. [25] present a method for winter vegetation quality mapping through the use two different types of Deep Recurrent Neural Networks (DRNNs) and compare the performance of their DRNNs to the performance of Random Forest (RF), and Support Vector Machines (SVMs).

Though one is able to vectorize the temporal aspect of the satellite data and then perform classifications with RF or SVMs on the resulting feature vectors, the more traditional machine learning models lack the explicit ability of learning the temporal aspect of the satellite data that DRNNs have. Here Minh et al.

[25] show that this ability yields a consistent improvement over RF and SVMs for winter vegetation classification.

To be able to train the models, 13 Sentinel-1 (S1) radar images where ac- quired. From these 13 images vertical transmit horizontal receive (VH), and vertical transmit vertical recieve (VV) where calculated, giving a total of 26 images. The ground truth data was gathered by surveying 194 land plots.

The images where then preprocessed by first selecting a master image, and then coregistering all images while taking the Terrain Observation with Pro- gressive Scan (TOPS) mode of the master image into account. Five back scat- ter images where generated and calibrated, and the time series SAR S1 dataset was improved by using the temporal filtering introduced by [26] so that the noise was reduced as much as possible while retaining the structural informa-

(32)

tion. Finally, the dataset was orthorectified by creating a simulated SAR image and using it to coregister the two image sets.

The DRNN was constructed by stacking five units together. Through the use of multiple units, the model is able to extract high-level non-linear temporal dependencies. Finally, a soft max layer was stacked on top of the last recurrent layer to perform the final multi-class prediction. In total, two DRNNs where constructed using this high level architecture. One was built using Long Short Term Memory (LSTM) and another was built using Gated Recurrent Units (GRU). For validation, all models where trained using five-fold cross validation. The results from the experiments can be seen in table 2.3.

Classifier F-meassure Accuracy Kappa

RF 91.77% 91.79% 0.897

SVM 91.22% 91.79% 0.890

LSTM 98.83% 98.83% 0.985

GRU 99.05% 99.05% 0.988

Table 2.3: Observed performance for winter vegetation quality mapping

3D convolutional Neural Networks for Crop Classification with Multi- Temporal Remote Sensing Images

In the work presented here by Ji et al. [27], 3D convolutions are compared to 2D convolutions, Support Vector Machines (SVMs), k-Nearest-Neighbors (kNN), and Principle Component Analysis (PCA) + kNN.

2D CNNs lack the ability to extract information from the temporal dimension since 2D convolutions collapse the temporal dimension through averaging. To overcome this Kussul et al. [28] introduced two different CNNs, one 2D CNN for spatial feature learning, and a 1D CNN for spectral feature learning. However, this strategy requires additional empirical processing to combine the learned features. Conversely, 3D CNNs are able to naturally process spatio-temporal data without the need for aditional empirical processing.

3D CNNs are, at least in theory, able to extract both spatial and temporal features in a more elegant manner then 2D CNNs. To demonstrate this Ji

(33)

et al. [27] develop a 3D CNN framework to extract information from multi- temporal satellite images and compare it to a 2D CNN, SVMs, kNN, and PCA + kNN. Due to the limited availibility of well labeled samples, they use a semi- automatic semi-supervised learning strategy. At the time of writing this paper, to the best of their knowledge 3D CNNs have not been applied to crop classification of multi-temporal mutli-spectral satellite images before.

Inspired by the neural network structure developed by Oxfored’s Visual Geometry Group (VGGnet), Ji et al. [27] use the same high level structure to build their 3D CNN. The first three layers consist of convolution-pooling con- catenation, with 32, 32, and 64 neurons. In order to best preserve the temporal information, the temporal dimension remains uncompressed until the first fully connected layer through the use of a 1 × 2 × 2 average pooling strategy. In adition to the 3D CNN, a 2D CNN is also trained with the same structure as that of the 3D CNN.

In testing two GF (Gaofen 2) multi-temporal images aquired in June, July, August, and September of 2015, and May, June, July, September, and October of 2016 respectively. Aswell as one GF1 (Gaofen 1) multi-temporal image in April, May, June, July, and September of 2014. The GF2 images had a size of 1417 × 2652, and 1111 × 293 pxiel, with a spatial resolution of 4m, and the GF1 image was 5400 × 6500 pixels with a 15m spatial resolution. Both the GF1 and GF2 images where pre-processed using quick atmospheric cor- rection (QUAC) and geometric rectification. Handcrafted shape files of the GF2 images where used as a reference for pixelwise classification. 400 training samples and 2 000 testing samples where selected from the 2015 shape file and 50 training samples and 1500 testing samples were selected from the 2016 shape file.

In testing, input size, convolutional kernel size, the number of layers, and the pooling strategy where each tested by averaging the results from each set- ting from 10 test runs. From this experimental setup, the kernel size of 3×3×3 was found to be best, a result that was consistent with the results found in Ji et al. [29]. A patch size of 8 × 8 was found to be best for both the GF1 and GF2 data. Larger patch sizes led to mixed pixel problems. Three convolutional layers performed slightly better then two and four layer networks, 4 or more layers where not considered due to the image space having been reduced to one after three 2x poolings. Average pooling performed slightly better then max pooling. The 2D CNN was tuned in the same manner as the 3D CNN to

(34)

ensure comparability. Finally, all models where tested on the test set and the 3D CNN was found to be the best performing model.

2.2.2 Biomedical Image segmentation

The growth of bark beetle infestations behave in a similar manner to that of a cancer. So much of the inspiration for the construction of the model in this thesis can be gathered from biomedical image segmentation.

Deep Neural Networks Segment Neuronal Membranes in Electron Mi- croscopy Images

Here Ciresan et al. [30] present a method for the automatic segmentation of neuronal structures from stacks of electron microscopy images. This 3D structure is represented by a 512 × 512 × 30, and the ground truth segmentation is represented by another 512 × 512 × 30 image stack where neuron membranes have value 1 and all other pixels have value 0.

To achieve this, the raw intensity values for a window of width w with its center in the pixel p is given to to the Deep Convolutional Neural Network as input. The classifier then decides if the current pixel is membrane or non- membrane. If the window overlaps with the image border, missing pixels for the window are synthesized by mirroring pixels from the current window. The network architecture for the network with w = 95 can be seen in table 2.4

To train the classifier, all available slices in the image stack are used. For each slice, all membrane pixels are used as positive examples and the same amount of non-membrane pixels are selected at random from the image slice.

On average, there are 50 000 membrane pixels per image slice. So in total we have approximately three million training examples evenly balanced between the two classes. Furthermore, since the neuronal structures are not affected by their orientation the data is augmented at the beginning of each epoch by randomly mirroring and/or rotating the images by ±90^◦. The input data was also further manipulated by using foveation and non-uniform sampling.

Foveation is a technique where the center of the image remains undistorted, but as the pixels move away from the image center, a blur is increasingly applied.

Nonuniform sampling is performed by letting center pixels map directly to the network neurons, but as the distance to the center increases the source pixels

(35)

Layer Type Maps and neurons Kernel Size

0 input 1 map of 95 × 95 neurons

1 Convolutional 48 maps of 92 × 92 neurons 4 × 4 2 Max Pooling 48 maps of 46 × 46 neurons 2 × 2 3 Convolutional 48 maps of 42 × 42 neurons 5 × 5 4 Max Pooling 48 maps of 21 × 21 neurons 2 × 2 5 Convolutional 48 maps of 18 × 18 neurons 4 × 4 6 max Pooling 48 maps of 9 × 9 neurons 2 × 2 7 Convolutional 48 maps of 6 × 6 neurons 4 × 4 8 Max Pooling 48 maps of 3 × 3 neurons 2 × 2

9 Fully Connected 200 neurons 1 × 1

10 Fully Connected 2 neurons 1 × 1

Table 2.4: The network architecture for w = 95 in [30]

are sampled with decreasing frequencies. Combining foveation and nonuniform subsampling enables one to exploit the data at multiple resolutions.

Because the network is trained on a dataset with equal representation of each class, the network outputs cannot be directly interpreted as probabilities.

Since the base rate of membranes is much smaller in reality, the model tends to severely overestimate the probability of a membrane. To overcome this issue, a polynomial function post-processor is applied to the network outputs.

After performing this calibration, the outputs from the networks are spatially smoothed using a 2-pixel radius median filter.

Ciresan et al. [30] observed that there was variability in the outputs produced by different networks despite being trained on the same dataset. This suggests that the classifiers have large variance and low bias. Therefore, the output from each of the models was averaged giving a sort of ensemble based classification.

The final segmentation is evaluated on three metrics:

1. Rand error: Defined as 1 − Frand where Frand represents the F¹ score of the rand index, which meassures the accuracy at which pixels are associated with their respective neurons.

2. Warping error: A segmentation meassure that is designed to account for topological disagreements.

(36)

3. Pixel error: 1 − Fpixelwhere Fpixelis the F1 pixel similarity.

On total, four networks where trained three with w = 65, one with w = 95.

All networks except for one w = 65 network where trained using fovean and nonuniform sampling. The results can be seen in table 2.5.

Rand error Warping error Pixel error 48 · 10⁻³ 434 · 10⁻⁶ 60 · 10⁻³

Table 2.5: Results from [30]

(37)

Methods

In this chapter, the methods used for implementing and testing are explained.

First in section 3.1, the chosen software packages are explained and justified and the hardware used for training models is specified. Then in section 3.2, the steps taken in preprocessing the data for training the models is explained in detail. Next in section 3.3, the method used for selecting the recurrent and 3D convolutional models is explained. Finally in section 3.4 the chapter con- cludes with an explaination of the experimental setup.

3.1 Software packages and hardware

For the purpose of enabling reproducability this section specifies the software packages and harware that was used to complete the degree project.

3.1.1 Software packages

Arcgis

Arcgis is an geographic information system (GIS), it is used for collecting and managing geographic data, creating professional maps, and spatial analysis [31]. ArcGIS was primarily used for visualizing results and converting GeoJSON [32] formatted data to shapefiles [33].

Arcpy

Arcpy is a python package for performing geographic data analysis, data con- version, data management, and map automation [34]. Arcpy was used for reading the harvester data, and clipping out regions of interest from the national land cover data, height data, and ground level humidity data.

29

(38)

GDAL

GDAL is a translator library for geospatial raster and vector data [35]. GDAL was used for reading and writing geospatial raster data and creating the target rasters.

NumPy

NumPy is a python package for handeling N-dimensional arrays [36]. NumPy was used for performing elementary computations and augmenting the training data in an online fashion.

Matplotlib

Matplotlib a a python package for creating visualizations in python [37]. Mat- plotlib was the primary tool used for creating the visualizations seen in this thesis.

SciPy

The SciPy library is an open-source python package for scientfic computing [38]. SciPy was used to solve a wide variety of problems ranging from poly- gonizing clusters, to histogram matching and interpolating.

Scikit-image

Scikit-image is a library that contains a myriad of algorithms for image processing [39]. Scikit-image was used for performing image translations with sub-pixel precision.

Xarray

Xarray is an open-source python package for labeled mutlidimensional arrays, Xarray introduces labels in the form of dimensions, coordinates, and attributes to produce a pandas-like experience for development with multi-dimensional data [40]. The Xarray package was used for handling geospatial raster data, and the imputation of missing satellite data.

Pandas

Pandas is an open-source library for data analysis and manipulation [41]. Pan- das was used for handling the harvester data metadata such as timestamps, product use e.t.c.

Geopandas

Geopandas is a pandas-like library for working with geospatial data in python [42]. Geopandas was used for working with cluster metadata, such as deter- mining when the time-series of satellite data should stop based on the average timestamp for a cluster.

PyProj

PyProj is a python package for performing cartographic projection transformations [43]. Pyrpoj was used for converting geospatial data between coordinate

(39)

systems.

Keras

Keras is a high-level neural network API written in python [44]. Keras was used to train all models.

3.1.2 Hardware

Swedish Space Data Lab

The Swedish Space Data Lab is a cooperation between the Swedish National Space Agency, RISE, Luleå University of Technology and AI Innovation of Sweden. The project aims to increase the use of data from space to further develop society and industry for the benefit of the planet. The Swedish Space Data Lab will achieve this goal by creating a national infrastructure for the exploitation of space data that will in turn foster a national knowledge and data hub for working with earth observation data with AI-based analysis [45].

The Swedish Space Data Lab is accessed through a Jupyter lab login, and users have access to a Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz processor, a GeForce GTX 1080 Ti GPU, and 256 GB of ram. [45]

3.2 Data preprocessing

Sentinel 2 data

NDVI

Alignment

Histogram equalization

Crop roi's with infestations

Landcover

map Ground level

humidity Height map

Image stack

Crop roi's with infestations

(N, T, 1, Y, X) tensor

(N, 3, Y, X) tensor

Input data

Harvester data

Clustering

Rasterization

Target segmentation Imputation of

missing data Thresholding

Figure 3.1: Overview of data preprocessing