• No results found

Detecting Spruce Bark Beetle Infestations with Satellite Imagery

N/A
N/A
Protected

Academic year: 2022

Share "Detecting Spruce Bark Beetle Infestations with Satellite Imagery"

Copied!
86
0
0

Loading.... (view fulltext now)

Full text

(1)

Detecting Spruce Bark Beetle

Infestations with Satellite Imagery

PER EMIL HAMMARLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)
(3)

Infestations with Satellite Imagery

PER EMIL HAMMARLUND

Master’s Programme, Machine Learning Date: June 30, 2020

Supervisor: Mårten Björkman / Fredrik Linde / Simon Johansson / Magnus Jutterström

Examiner: Danica Kragic Jensfelt

School of Electrical Engineering and Computer Science Host company: Sogeti Sverige AB

Swedish title: Detektering av Barkborre Angripningar med satellit Bilder

(4)
(5)

Abstract

Sveaskog is Swedens largest forest owner, owning 14 percent of the Swedish forest lands. Recently, due to warmer and drier summers as a consequence of climate change, spruce bark beetles have caused damages at a massive scale.

In 2019 Sogeti developed a promising first product for monitoring the vitality of large areas through the use of Sentinel-2 data by comparing images from the same month between two years, and the results from this first product where promising. To take the detection of bark beetle infestations to the next stage of development, supervised learning was used. Models where trained with a time-series of Sentinel-2 data in conjunction with national landcover data, ground level humidity data, and height data to predict segmentations that rep- resented infestations. Target segmentations where created by clustering GPS points and tresholding a rasterized representation of the generated clusters. In total four different model architectures where tested (LSTM, GRU, 3D con- volutional, and logistic regression) and then evaluated both quantitatively and qualitatively on a test set. It was found that the GRU based model was best able to identify bark beetle infestations.

(6)

Sammanfattning

Sveaskog är Sveriges största skogsägare, med en total landareal på 14 pro- cent av den svenska skogsmarken. Nyligen, p.g.a. varmare och torrare somrar som en konsekvens av klimatförändringar har granbarkborren orsakat skada till skogen på en enorm skala. År 2019 utvecklade Sogeti ett lovande första resultat för övervakning av skogen på stor skala genom att jämföra vitaliteten i skogen från Sentinel-2 data i samma månad mellan två år. Resultaten från denna första produkt var lovande. För att kunna vidareutveckla produkten till nästa steg användes övervakad inlärning. Modeller tränades på en tidserie av Sentinel-2 data i kombination med nationella marktäckedata, markfuktighets- data, och höjddata för att kunna prediktera segmenteringar som representerar skog som har angripits av granbarkborre. Målsegementeringarna som model- lerna tränades mot skapades genom att klustra skördardata och den rastroise- rade representationen av de generarade klustrarna filtrerades. Totalt tränades fyra olika modeller (LSTM, GRU, 3D faltning, och logistisk regression), som sedan evaluerades på ett sista test set både kvantitativt och kvalitativt. Från det- ta drogs slutsatsen att den GRU baserade modellen var bäst på att identifiera skog som har angripits av granbarkborre.

(7)

1 Introduction 1

1.1 Background . . . 1

1.2 Aim and Objective . . . 2

1.3 Delimitations . . . 2

1.4 Methodology . . . 3

1.5 Research Question and Hypothesis . . . 3

1.6 Benefits, Ethics and Sustainability . . . 3

1.7 Definitions . . . 4

2 Theory 5 2.1 Background . . . 5

2.1.1 Spruce Bark Beetles . . . 5

2.1.2 Data sources . . . 6

2.1.3 Feed-forward neural networks . . . 10

2.1.4 Overfitting and how it can be overcome . . . 14

2.1.5 Convolutional Neural Networks . . . 15

2.1.6 Recurrent Neural Networks . . . 17

2.2 Related Work . . . 23

2.2.1 Remote Sensing . . . 23

2.2.2 Biomedical Image segmentation . . . 26

3 Methods 29 3.1 Software packages and hardware . . . 29

3.1.1 Software packages . . . 29

3.1.2 Hardware . . . 31

3.2 Data preprocessing . . . 31

3.2.1 Sentinel 2 data . . . 32

3.2.2 Land cover, humidity, and height data . . . 35

3.2.3 Harvester data . . . 35

v

(8)

3.2.4 The trial and error of creating a trainable dataset . . . 38

3.3 Model selection . . . 39

3.3.1 Recurrent model selection . . . 39

3.3.2 3D convolutional model selection . . . 40

3.4 Experimental setup . . . 42

3.4.1 Dataset . . . 42

3.4.2 Training considerations . . . 42

3.4.3 Early stopping and model saving . . . 42

3.4.4 Calibration of model outputs . . . 43

4 Results 45 4.1 Quantitative results . . . 45

4.1.1 Sensitivity, Specificity, Balanced Accuracy, Precision, F1, and Support . . . 45

4.1.2 Confusion Matrices . . . 47

4.2 Qualitative results . . . 50

4.2.1 Training-sized regions . . . 50

4.2.2 Full 5 × 5 km region . . . 57

5 Discussion 61 5.1 Preparation of dataset . . . 61

5.2 Model performance evaluation . . . 61

5.3 Conclusion . . . 63

5.4 Future work . . . 63

Bibliography 65 A Results from earlier iterations of dataset 70 A.1 First iteration result . . . 70

A.2 Second iteration results . . . 71

B Grid search results 73

(9)

Introduction

Chapter 1 lays out the introduction of the degree project. Here, the necessary background information for the reader to have an understanding of the prob- lem at hand is presented first. Then the objective, limitations and methodology are presented, explaining the overall aim, limitations, and chosen methodol- ogy. Finally, the research question as well as the hypothesis are presented concluding the introductory chapter.

1.1 Background

Bark beetles pose a threat to the vitality of the forest and can create damage on a massive scale. Accordning to a report from the Swedish University of Agri- cultural Sciences written by Åke Lindelöw, bark beetles have killed approxi- mately 2-3 million m3of the trees in Götaland in 2018, the largest destruction of forest in a single year [1].

In order to maintain a healthy forest, it is crucial that infestations can be identified and remedied before the infestation spreads causing more damage.

Bark beetle infestations are difficult to find due to the practicality of current conventional methods:

• Performing searches on the ground level are time consuming

• Using flight photos is costly when we are interested over how the forest changes over time

• Drones are not practical for monitoring entire forest regions

1

(10)

• Solely using the RGB color space is not sufficient since much of the information that indicates where infestations could be located is found in the infrared spectrum.

To address this problem, satellite data is well suited to for training a model using machine learning due to its rapid availability, relatively high and consis- tent temporal frequency, and coverage of large areas. It will not be sufficient though, since changes in vitality that can be seen from satellite data could have several causes. These may include harvesting, forest fires e.t.c.. Therefore, it will be necessary to combine the available satellite data with other data sources such as ground level humidity, frost and ice formation, terrain-roughness, and metadata such as location and time. When combined, there will be sufficient information for a model to be trained using machine learning for the detection of infestations.

Over the course of 2018-2019, harvesters have collected data from trees that have shown signs of being infested. When this annotation was performed, the GPS position, date and time, volume of the log with and without bark, and log id. More then 100 000 trees where annotated in this manner. This set of annotated trees will be the ground truth data for creating the training and test datasets.

The main challenge of this degree project will be the creation of a dataset that makes it possible to train models using machine learning to identify spruce bark beetle infestations.

1.2 Aim and Objective

The aim of this degree project is to be able to identify spruce bark beetle in- festations. The objective is to train models using logistic regression, recurrent neural networks, and 3d convolutional neural networks and compare how they perform both qualitatively and quantitatively.

1.3 Delimitations

Due to the constraint of only being able to train using the available 100 000 annotated trees as ground truth for where infestations are located, the problem will not be addressed on a general level and is limited to regions containing trainable data.

(11)

1.4 Methodology

Work will begin with a literature study and familiarization with the data and tools that will be used in the degree project. In order to perform a controlled experiment, all methods will be compared in the same test set. The empirical data gathered from these tests will be the basis for the evaluation of the chosen models.

1.5 Research Question and Hypothesis

Research Question:

From logistic regression, to recurrent neural networks in the form of both LSTMs and GRUs, and 3D convolutional neural networks. Which model is best able to identify spruce bark beetle infestations both quanti- tatively and qualitatively?

Hypothesis:

It is possible to identify where bark beetle infestations are located us- ing machine learning.

1.6 Benefits, Ethics and Sustainability

As explained earlier in section 1.1, spruce bark beetles damage forest at a mas- sive scale and conventional methods that rely on drones, physically inspecting forest, or flying over select regions is costly making the monitoring of large regions impractical. Through using satellite data in conjunction with data for background conditions the monitoring of entire forests and could be key in identifying where forest should be cleared to mitigate the damage caused by spruce bark beetles.

The monitoring of forest vitality is a fairly innocent endeavour, but there is an interesting ethical question with respect to sustainability. Are us humans justified in trying to reduce the population of spruce bark beetles? If warm and dry summers are leading to greater bark beetle populations is this not just simply a consequence of natural selection? Personally, I would say that the reduction of spruce bark beetles is justified since other species rely on trees for their survival and if we let the spruce bark beetles population grow, there might be larger ecological consequences. Moreover, the hotter and dryer

(12)

summers are a consequence of climate change, so in trying to mitigate bark beetle infestations we are reducing our ecological impact.

1.7 Definitions

Definition 1 (Sensitivity)

The proportion of true positives that are correctly identified. TP is the number of True Positives, and FN is the number of False Negatives. [2]:

sensitivity = T P T P + F N. Definition 2 (Specificity)

The proportion of true negatives are correctly identified. TN is the number of True Negatives, and FP is the number of False Positives. [2]:

specificity = T N T N + F P. Definition 3 (Balanced accuracy)

The average between sensitivity and specificity [3]:

balanced accuracy = sensitvity + specificity

2 ,

which gives a balanced measure for accuracy when the number of positives and the number of negatives in the dataset is unbalanced.

Definition 4 (Precision)

The number of true positives divided by the the sum of true positives and false positives. TP is the number of True Positives, and FP is the number of False Positives. [4].

precision = T P T P + F P Definition 5 (F1)

A harmonic mean between the precision and sensitivity [4].

F1 = 2 · precision · sensitivity precision + sensitivity

(13)

Theory

2.1 Background

This chapter begins with presenting relevant background information that de- scribes the problem and how it is addressed. It then continues on to give a quick overview into the theoretical background behind feed-forward, convolu- tional, and recurrent neural networks. Finally, related work is presented from both remote sensing and biomedical image segmentation.

2.1.1 Spruce Bark Beetles

Spruce Bark Beetles are a member of the family bark beetles, they are approx- imately four millimeters in length and reproduce in stressed or windthrown trees with most ease. After the extremely dry summer of 2018 large amounts of trees became stressed leading to a decrease in resilience and beneficial con- ditions for spruce bark beetles. Bark beetles communicate via scents, the re- productive process starts with a male bark beetle detecting a stressed tree. The male bark beetle then starts to drill itself into the tree, and the tree tries to de- fend itself by drowning the bark beetle in sap. The bark beetle then uses the tree sap to produce a pheromone that it releases signaling to other bark beetles to come to the tree. Bark beetles that are passing by then detect the scent and start gather around the tree and the male bark beetles start to drill into the tree.

The tree continues to try to drown the attacking bark beetles, but eventually it is not able to produce enough sap to defend itself. Once the bark beetles have overwhelmed the tree the males start to release a pheromone signaling to the female bark beetles to come closer and the female bark beetles enter the mating chamber that the males have prepared. Each female bark beetle then

5

(14)

Figure 2.1: A spruce bark beetle, image collected from [5]

digs a straight approximately 10 cm long tunnel and lays the eggs in the sides of the tunnel. The larvae then eat outwards from the tunnel and pupate after 5-6 weeks. The bark beetles then hatch a week later and stay feeding off the tree for a few more weeks until their exoskeleton has gained enough strength to leave the tree. This whole process takes approximately 8-10 weeks from the swarming [6].

2.1.2 Data sources

Harvester data

In order to be able to train a model using supervised learning, some form of ground truth label is necessary. Here the ground truth data comes in the form of spatiotemporal points (i.e. points in both time and space) created by forest harvesters annotating trees that have showed signs of having been infested by bark beetles. Each registered point corresponds to a tree, and each point contains a GPS position, date, quantity of the tree that could be used for timber, and a quantity of the tree that was able to be used for the production of pulp (material for the production of paper). The GPS position was used to annotate where infestations are located, the date was used to annotate when the infestation happened, and the quantities of pulp and timber wood where used to determine the severity of the infestation. The GPS position is registered from the center of the harvester, so the actual GPS position of the tree can be within a radius that corresponds to the maximum length of the end effector of the harvester. A sample of the harvester data can be seen in figure 2.2.

(15)

Figure 2.2: Visualization of a sample of the harvester data

Sentinel 2 data

Sentinel-2 data is produced by the Eurpean Space Agency’s Copernicus Sentinel- 2 mission. It is comprised of a constellation of two satellites phased 180◦ from each other. [7]

To observe how the forests vitality changes over time it is necessary to have a data source that has a relatively high temporal frequency (i.e. time between images over the same area) and is able to cover large areas. Satellite data from Sentinel-2 is particularly useful in this respect, with a temporal resolution of 5 days, a swath of 290 km, and a spatial resolution of 10m (for the bands used in this thesis) [8]. In total there are 13 spectral bands in the hyperspectral images produced by Sentinel 2, the full list of spectral bands can be seen in table 2.1.

National land cover data

National land cover data ("Nationella Marktäckedata" in Swedish) is a full land cover mapping of Sweden. The current version of the national land cover data was created from a mapping created over the coarse of 2017 to 2019. It is comprised of 25 different thematic classes at three different hierarchical levels [9]. National land cover data is useful since it allows for the disregarding of regions that are not forest and can provide usefull contextual information for the classifier producing the segmentation. A full visualization of the landcover classes found in the national land cover data can be seen in figure 2.3 and a visualization of a sample of the national land cover data can be seen in figure 2.4a.

(16)

Band number Band name Spatial Resolution

Band 1 Coastal aerosol 60

Band 2 Blue 10

Band 3 Green 10

Band 4 Red 10

Band 5 Vegetation red edge 20

Band 6 Vegetation red edge 20

Band 7 Vegetation red edge 20

Band 8 Near Red Edge (NIR) 10

Band 8a Narrow NIR 20

Band 9 Water Vapour 60

Band 10 Short Wave Infrared (SWIR) Cirrus 60

Band 11 SWIR 20

Band 12 SWIR 20

Table 2.1: The available bands in S2 data

Height data

To further complement the national land cover data object height data was also used. Height data indicates the age of the trees in a given regions which gives us a measure of biomass that bark beetles can feed off of. There are four dif- ferent object height data products available from the Swedish Environmental Protection Agency.

1. Object heights between 0.5 and 5 meters.

2. Object heights between 5 and 45 meters.

3. Density of objects with heights between 0.5 and 5 meters.

4. Density of objects with heights between 5 and 45 meters.

Object heights of between 0.5 and 5 meters are indicate the occurrence of low height objects such as bushes while object heights higher then 5 meters indicate the occurrence of high height objects such as forest. For this project, the second product was used. The full class labeling of the selected height data can be seen in table 2.2 and a visualization of the a sample of the height data can be seen in figure 2.4b. [10, 11]

(17)

1. Forest

1.1.1 Pine

1.1.2 Spruce

1.1.3 Mixed coniferous

1.1.4 Deciduos coniferous

1.1.5 Common deciduous

1.1.6 Broadleaf

1.1.7 Common deciduous and broadleaf

1.1.8 Temporarily

not forest 1.1. Forest not

on wetlands 1.2. Forest on wetlands

1.2.1 Pine

1.2.2 Spruce

1.2.3 Mixed coniferous

1.2.4 Deciduos coniferous

1.2.5 Common deciduous

1.2.6 Broadleaf

1.2.7 Common deciduous and broadleaf

1.2.8 Temporarily

not forest 2. Open wetlands

National Land cover data

3. Arable land 4. Other open land

4.1 Without vegetation

4.2 With vegetation

5. Exploited land

5.1 Building

5.2 Other

5.3 Road

6. Water

6.1 Lakes and watercourses 6.2 Sea

Figure 2.3: Visualization of land cover classes from the national land cover data [9]

Value 10 15 20 25 30 45 255

Height (m) 5-10 10-15 15-20 20-25 25-30 > 30 nodata

Table 2.2: Class encoding of object height data 5-45 meters [10, 11]

Ground level humidity data

Finally, ground level humidity data was also included. Ground level humidity data is useful since soil moisture effects the type of vegetation that thrives in a given area [12]. The ground level humidity data used in this thesis is provided by the Swedish environmental protection agency and comes in the form of a soil wetness index. The soil wetness index is scale encoded to values from 0-240 where 240 is the most wet and 0 is the driest [13]. A visualization of the ground level humidity index can be seen in figure 2.4c.

(18)

(a) Landcover data (b) Height data (c) Humidity data

Figure 2.4: Visualization of a sample of the national landcover data, height data, and ground level humidity data

2.1.3 Feed-forward neural networks

This section starts with an explanation of one of the simplest form of neural network, it then continues on explaining k-layered neural networks and con- cludes with an overview of how to overcome the issues that arise when using deeper neural networks.

Multi-class Logistic Regression

Logistic regression is the simplest form of neural network, given an n-dimensional input vector x it computes a prediction y using a single layer. The forward pass is performed by simply computing s = W x + b and then using softmax C = arg maxe i

esi P3

j=0esj



to compute the final prediction figure 2.5 illustrates this computation.

The predictions that are produced by the model are then compared to the ground truth ¯y. Where ¯y is a one-hot encoded vector:

¯ yi =

(1 If i is the ground truth class 0 If i is not the ground truth class

. (2.1)

Now, let D denote a set of labeled training data with N training examples D = {xi, yi}N −1i=0 and let l be the loss function (cross-entropy) between the two probability density vectors ¯y and y l = −¯yTlog(y). To find the parameter settings W , and b that maximizes the accuracy at which the network predicts the correct class label eC the objective is to optimize:

(19)

x1

x2

x3

x4 x5

s0

s1

s2

s3

s = Wx + b

y0

y1

y2

y3

= yi esi

3j=0esj

argmax

x6

x0

Figure 2.5: Logistic regression for 4-class problem. The variables x0-x6 are the scalars that make up the observation x, s0-s3are the linear scores for classes 0-3, y0-y3are the final predicted probabilities for classes 0-3, and eC is the class with the highest predicted probability.

arg min

W, b

( 1

|D|

N −1

X

i=0

l( ¯yi, yi) )

= arg min

W, b

( 1

|D|

N −1

X

i=0

− ¯yiTlog(yi) )

(2.2)

= arg min

W, b

( 1

|D|

N −1

X

i=0

− ¯yiTlog( eW xi+b eP3j=0W xij+b

) )

In practice the optimization is performed in mini-batches. Mini-batches are small subsets of the dataset D, training is performed by approximating the gradient gb = δbδl, gW = δWδl in each training step using one mini-batch at a time. The approximated gradient is then used to optimize the parameters of the network:

b ← b − αδl

δb, W ← W − α δl

δW, (2.3)

where α is the learning rate, the size of each step that the model takes when optimizing towards the minimum of the loss function l. [14]

(20)

Going deeper

x0

x1

x2 x3

x4

h00 h01 h02 h03 h04

h10 h11 h12 h13 h14

= max(0, x + )

h0 W0 b0 h1= max(0,W1x + )b1

h20 h21

= max(0, x + )

h2 W2 b2

y0

y1

= yi eh2i

1j=0eh2j

argmax

Figure 2.6: Simple 4-layered neural network for 2-class classification with ReLu

While a single layered neural network is able to to perform well in a variety of applications, it is often beneficial to go deeper. K-layered neural networks are able to extract features using multiple layers by letting the outputs from the previous layer be the inputs for the next layer. The computation of each layer is computed in the same manner as the single layered model, for each layer the linear scoring function is computed si = Wix + bi, then the activation func- tion is applied (three different typical activation functions can be seen in figure 2.7) the output from the i’th layer is then used as the input for the i + 1’th layer.

Two main issues are created when going deeper; 1. Since the gradient for the parameters of the earlier layers are computed by the chain rule, as the

10 5 0 5 10

0.0 0.2 0.4 0.6 0.8

1.0 Sigmoid

10 5 0 5 10

1.0 0.5 0.0 0.5

1.0 tanh

10 5 0 5 10

0 2 4 6 8

10 ReLu

Figure 2.7: Typical activation functions

(21)

number of layers grows so does the number of multiplications to earlier lay- ers leading to vanishing and exploding gradients. 2. The inputs to each layer changes as the parameters of previous layers are adjusted, slowing down train- ing and making the parameter initialization sensitive. To address this Ioffe and Szegedy [15] introduced batch normalization. Batch normalization works by computing the mean µBand variance σB2 for a mini-batch, the outputs from the linear scoring function si are then normalized and scale and shifted to main- tain the expressivity of the current layer. This process can be seen in algorithm 1.

Algorithm 1 Batch normalization applied to a mini-batch

1: . s is the outputs from the linear scoring function in the current layer

2: . γ and β are learned parameters for maintaining the expressivity of the current layer

3: function batch_norm(s, γ, β)

4: . Compute the mini-batch mean and variance

5: µBm1 Pm−1

i=0 si

6: σ2Bm1 Pm−1

i=0 (si− µB)2

7: . Normalize ( is a small constant for avoiding division by zero)

8:isiσ−µB

B+

9: . Scale and shift

10: yi ← γ ˆsi+ β

11: return y

To further address the issue of vanishing and exploding gradients it is also important to have a proper initialization of model weights enabling the prop- agation of information through the network. If the weights are too small the signal will grow smaller for each layer and eventually vanish, and if the weights are too large the signal will grow ever larger as it is propagated through the net- work leading to exploding values. To address this Glorot and Bengio [16] pur- posed what is known as Xavier (or Glorot) initialization and He et al. [17] pro- posed He initialization. Xavier initialization works by initializing the weights of a given layer from a distribution with 0 mean and variance N1 where N is the number of input neurons to the current layer. He initialization is similar to Xavier initialization in that it also adapts the variance of the distribution that model weights are sampled from but with the difference being that the variance is larger by a factor of 2 (N2). Both Xavier and He initialization use a similar theoretical basis for deriving an appropriate variance for weight ini-

(22)

tialization, reasoning that the signal should have the same variance before and after passing through each layer of the network.

2.1.4 Overfitting and how it can be overcome

Overfitting occurs when the model becomes so good at the training data, that it no longer is able to generalize well on unseen data. Figure 2.8 illustrates over- fitting. In the following subsections three apporaches for reducing the amount of overfitting are explained.

Figure 2.8: An illustration of overfitting. The black line is the desired decision boundary, the green line is the current decision boundary as a consequence of overfitting, and the red and blue dots represent datapoints from each class.

Image collected from Chabacano [18]

L2 Regularization and Dropout

On a high level L2 regularization is a method that controls the flexibility of a model by punishing the use of large model weights. By making the cost of using large weights high, the model will stay within a range of configurations that is less prone to overfitting. [19] Regularization can take other forms as well, such as randomly dropping nodes from the network during training, a technique known as dropout. Dropout is a regularization technique specifi- cally made for neural networks. When using dropout, nodes are randomly and

(23)

temporarily removed along with their ingoing and outgoing connections from the network during train time. This technique prevents nodes from co-adapting too much leading to better generalization. Dropout was found by Srivastava et al. [20] to significantly reduce overfitting and give major improvements over other regularization techniques. A visualization of dropout can be seen in fig- ure 2.9.

Figure 2.9: A visualiztion of dropout. Image collected from Srivastava et al.

[20]

Data augmentation

When the amount of training data is limited, our model will not be able to learn invariances leading to worse generalization. In an ideal world it would be best to collect more data and create a training dataset that is better able to represent each class, but this is often not an option. Instead one can "simulate" more training data then what is actually available by creating more observations that are variations of the original training data while not destroying the features that the model is supposed to learn. This process can include operations such as mirroring and applying affine transformations to images, reversing sequences, and adding white noise to observations.

2.1.5 Convolutional Neural Networks

Ian Goodfellow describes convolutional neural networks as "Convolutional neural networks are a form of neural network made for processing data with grid-like topology" [14]. They achieve this ability through the use of two layers

(24)

convolutional layers, and pooling layers. In the following subsections these layers are explained.

Convolutional layers

0 24

0 20

73

45 49

0 0

70 82

0 0

79 0

23 26

96 55

53 0

70 0

0 16

0 73

1 85

21 57

0 0

0 0

78 65

0 0

51 34

72 12

73 0

95 0

72 78

0 0

39 0

0 0

0 0

0 13

0

0 93 95

95

1 2

0 -1

0 -2

1 0 -1 0 ⋅ 1 + 0 ⋅ 0 + 0 ⋅ (−1) + 0 ⋅ 2 + 13 ⋅ 0 + 95 ⋅ (−2) +0 ⋅ 1 + 93 ⋅ 0 + 95 ⋅ (−1) = −285

-285

Figure 2.10: A convolutional operation in the two dimensional discrete case with zero padding

Convolutional layers make use of a mathematical operation known as a convolution. When performing a convolution a filter f (a) is slides over all values of another function g(a) within a domain to produce a new function h(x). This process of sliding a filter over a function is known as convolving.

A one dimensional convolution in the continuous case takes the form [14]:

h(x) = (f ∗ g)(x) = Z

f (a)g(x − a)da, and in the discrete one dimensional form:

h(x) = (f ∗ g)(x) =X

∀a

f (a)g(x − a).

The convolutional operation can easily be extended to two dimensions, this takes the continuous form [14]:

h(x, y) = (f ∗ g)(x, y) = Z Z

f (a, b)g(x − a, y − b))da db, and the discrete form:

(25)

h(x, y) = (f ∗ g)(x, y) =X

∀a

X

∀b

f (a, b)g(x − a, y − b).

Convolutional layers learn the values for n filters f , where each filter has a kernel size and each filter produces a feature channel. The kernel size is the size of the window for the filter f , and feature channels are convolved representations of the input. These feature channels are then in turn fed into an activation function to produce the final output. Typically, a convolutional layer consists of three stages; 1. A number of convolutions are performed in parallel, 2. The activations from the convolutional operations are fed into a nonlinear activation function, and 3. a pooling layer is used to modify the output from the convolutional layer further [14].

Pooling layers

Pooling layers are a form of dimensionality reduction that extracts features, two common pooling operations can be seen in figure 2.11. To the left in figure 2.11a, average pooling can be seen. In average pooling the dimensionality is reduced by computing the average for an n × n window. To the right in figure 2.11b, max pooling can be seen. When performing max pooling the maximum value is extracted for each window.

95 39 36 35

1 54

38 77

98 4

70 4

66 96 44 29

51.25 42.5 44 58.75 Avg

Pool

(a) Average pooling

95 39

36 35

1 54

38 77

98 4

70 4

66 96 44 29

95 77

98 96

Max Pool

(b) Max pooling

Figure 2.11: Visualization of pooling layers with kernel size 2 × 2 and stride 2

2.1.6 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a family of neural networks that are used for processing sequences of data. There are a number of design patterns that an RNN can have, including but not limited to the following:

1. Sequence to vector models

(26)

2. Vector to sequence models 3. Sequence to sequence models

Sequence to vector models are used when the objective is to classify a sequence of data or find a vector representation of a sequence of data. Vector to sequence models are used for generating sequences of data from vectors, one example is image to sentence models. Sequence to sequence models are used when the objective is to generate a new sequence from observing a sequence, a typical application of sequence to sequence models is language translation. In this thesis, our problem is finding a class from an observed sequence of images, therefore we are interested in the first type of RNN, sequence to vector models.

y L

o

V h(t) h(τ)

x(t) x(τ) W

U

t ∈ {1, ⋯ , τ − 1}

W

Figure 2.12: A vanilla RNN for sequence to vector classification RNNs achieve their ability to process arbitrarily long sequences of data by using shared parameters over time. A vanilla RNN for sequence to vector classification can be seen in figure 2.12. The forward propagation for the RNN begins with a specification of the first initial state h(0), then the follwing set of equations are used for t ∈ {1, · · · , τ } [14]:

a(t) = Wh(t−1)+ Ux(t) (2.4)

ht= tanh a(t) , (2.5)

In which τ is the number of timesteps in the sequence, W is the weight matrix for the previous time steps hidden state h(t−1), U is the weight matrix for the current observation x(t), and a(t) is the tentative value for the current time steps hidden state. Finally, for the final element in the sequence:

(27)

o = c + Vh(τ ) (2.6) ˆy = SoftMax o(τ ) , (2.7) where c is the bias and V is the weight matrix for the final time steps hid- den state h(τ ), and ˆy is the predicted probability for each class. This gives the loss function L = − log p(y|{x(1), · · · , x(τ )}). The computation of gradi- ent with respect to the loss function is computationally expensive due to the need of both a forward pass and a backward pass that cannot be parallellized due to the inherent sequential form of the data. The backward pass is per- formed by propagating the loss backwards through time in a process that is known as Backpropagation Through Time (BPTT). In theory, RNNs should be able to learn dependencies of arbitrary length. However, in practice this is not the case. Since gradients are propagated through time, as the length of the sequences grows gradients vanish and explode [14]. In the the following sub- sections two nodes that where made for addressing the problem of vanishing and exploding gradients are explained.

Long Short Term Memory

= σ ( + )

ft xtUf ht−1Wf

= σ ( + )

it xtUi ht−1Wi

= σ ( + )

ot xtUo ht−1Wo

= tanh ( + )

C˜t xtUg ht−1Wg xt

ht−1

it

Ct

˜ ft

=

ht Ct ot

Ct

ot ht

Ct

= tanh ( ⊙ + ⊙ ) Ct ft Ct−1 it C˜t

ht

Ct−1

Figure 2.13: An LSTM node, is the element-wise product, and the visual- ization is based on the figures from Olah [21]

Long Short Term Memory (LSTM) addresses the issue of vanishing and exploding gradients for long-term dependencies by introducing an internal

(28)

self-loop through the use of a cell state Ct. This internal cell state is reg- ulated between time steps through the use of "gates", an additional internal hidden unit that uses element-wise products and sigmoid activation layers to determine what information gets to flow through time. An LSTM node con- tains three gates: 1 a forget gate ft, 2 and input gate it, and 3 an output gate ot

[14, 22]. The flow of computation for the variation of LSTM node used in the scope of this project is visualized in figure 2.13

The forget gate ftdecides what information is kept from the previous cell state Ct−1, the input gate it controls what new information should be intro- duced into the cell state Ct, and the output gate ot regulates the hidden state htfor the current time step should be. All gates take the previous time steps hidden state ht−1 and the current observation xt as input and are computed with the following set of equations:

ft = σ xtUf + ht−1Wf , (2.8) it= σ xtUi+ ht−1Wi , (2.9) ot= σ (xtUo+ ht−1Wo) , (2.10) in which Uf, Ui, and Uo are the weight matrices for the current observa- tion xt and Wf, Wi, and Wo are the weight matrices for the previous time step ht−1. A tentative cell state eCtis also computed in much the same manner with the exception of the use of a tanh activation fuction instead of a sigmoid activation function:

Cet= tanh (xtUg+ ht−1Wg) , (2.11) where once again Ug is the weight matrix for the current observation xt and Wgis the weight matrix for the previous time steps hidden state ht−1. The tentative cell state eCtis then combined with the previous time steps cell state Ct−1 and the input itand forget gate ftto compute the current time steps cell state Ct( is the element-wise product):

Ct = tanh

ft Ct−1+ it eCt

, (2.12)

and the hidden state for the current time step htis then finally computed:

ht= Ct ot. (2.13)

(29)

= σ ( + ) zt xtUz ht−1Wz

= σ ( + )

rt xtUr ht−1Wr xt

ht−1

rt

zt

ht

= tanh ( + ⊙ )

ht

˜ xtU rt Wht−1

= ⊙ + (1 − ) ⊙ ht zt ht−1 zt h˜t

ht

˜

ht

Figure 2.14: A GRU node, is the element-wise product, and the visualiza- tion is based on the figures from Olah [21]

Gated Recurrent Unit

The Gated Recurrent Unit (GRU) can be seen as a simplified version of an LSTM, with the main simplification being the use of a single gate for control- ling the flow of information from the current and previous time steps. There are two gates in the GRU cell, a reset gate rtand an update gate zt. Both gates are computed with the current time step xt’s observation and the previous time steps ht−1hidden state:

rt= σ (xtUr+ ht−1Wr) , (2.14) zt= σ (xtUz + ht−1Wz) , (2.15) where Ur and Uz are the weight matrices for the current observation xt, and Wr and Wz are the weight matrices for the previous time steps hidden state ht−1. As can be seen in figure 2.14, the reset gate rtis then combined with the previous time step ht−1’s to compute the tentative hidden state ght−1:

het= tanh

xtUeh+ rt Wehht−1

, (2.16)

in which Uehand Wehare the weight matrices for the current time steps ob- servation xtand the previous time steps hidden state ht−1respectively. Finally, the tentative hidden state eht is combined with the previous time steps hidden state ht−1 and the update gate zt to compute the hidden state for the current time step ht: [14, 23]

ht= zt ht−1+ (1 − zt) eht. (2.17)

(30)

As before this is one version of a number of different variations of the GRU node, but the version explained here is the one used in the scope of this thesis.

(31)

2.2 Related Work

This chapter consists of two main sections, the first give a glance into the field of remote sensing and the second gives a short overview of some related problems in the Biomedical Image segmentation, a relevant field since spruce bark beetle infestations can also be seen as the spread of a disease in a sense.

2.2.1 Remote Sensing

Deep Recurrent Neural Networks for Winter Vegetation Quality Mapping via Multitemporal SAR Sentinel-1

The Sentinel-1 mission is the first of five missions of the European Space Agency’s Copernicus initiative. Sentinel-1 data is produced by two polar- orbiting satellites performing C-band aperture synthetic radar imaging, en- abling them to aquire images regardless of weather. [24]c

In this paper Minh et al. [25] present a method for winter vegetation qual- ity mapping through the use two different types of Deep Recurrent Neural Networks (DRNNs) and compare the performance of their DRNNs to the performance of Random Forest (RF), and Support Vector Machines (SVMs).

Though one is able to vectorize the temporal aspect of the satellite data and then perform classifications with RF or SVMs on the resulting feature vectors, the more traditional machine learning models lack the explicit ability of learn- ing the temporal aspect of the satellite data that DRNNs have. Here Minh et al.

[25] show that this ability yields a consistent improvement over RF and SVMs for winter vegetation classification.

To be able to train the models, 13 Sentinel-1 (S1) radar images where ac- quired. From these 13 images vertical transmit horizontal receive (VH), and vertical transmit vertical recieve (VV) where calculated, giving a total of 26 images. The ground truth data was gathered by surveying 194 land plots.

The images where then preprocessed by first selecting a master image, and then coregistering all images while taking the Terrain Observation with Pro- gressive Scan (TOPS) mode of the master image into account. Five back scat- ter images where generated and calibrated, and the time series SAR S1 dataset was improved by using the temporal filtering introduced by [26] so that the noise was reduced as much as possible while retaining the structural informa-

(32)

tion. Finally, the dataset was orthorectified by creating a simulated SAR image and using it to coregister the two image sets.

The DRNN was constructed by stacking five units together. Through the use of multiple units, the model is able to extract high-level non-linear tempo- ral dependencies. Finally, a soft max layer was stacked on top of the last re- current layer to perform the final multi-class prediction. In total, two DRNNs where constructed using this high level architecture. One was built using Long Short Term Memory (LSTM) and another was built using Gated Recurrent Units (GRU). For validation, all models where trained using five-fold cross validation. The results from the experiments can be seen in table 2.3.

Classifier F-meassure Accuracy Kappa

RF 91.77% 91.79% 0.897

SVM 91.22% 91.79% 0.890

LSTM 98.83% 98.83% 0.985

GRU 99.05% 99.05% 0.988

Table 2.3: Observed performance for winter vegetation quality mapping

3D convolutional Neural Networks for Crop Classification with Multi- Temporal Remote Sensing Images

In the work presented here by Ji et al. [27], 3D convolutions are compared to 2D convolutions, Support Vector Machines (SVMs), k-Nearest-Neighbors (kNN), and Principle Component Analysis (PCA) + kNN.

2D CNNs lack the ability to extract information from the temporal dimen- sion since 2D convolutions collapse the temporal dimension through averag- ing. To overcome this Kussul et al. [28] introduced two different CNNs, one 2D CNN for spatial feature learning, and a 1D CNN for spectral feature learn- ing. However, this strategy requires additional empirical processing to com- bine the learned features. Conversely, 3D CNNs are able to naturally process spatio-temporal data without the need for aditional empirical processing.

3D CNNs are, at least in theory, able to extract both spatial and tempo- ral features in a more elegant manner then 2D CNNs. To demonstrate this Ji

(33)

et al. [27] develop a 3D CNN framework to extract information from multi- temporal satellite images and compare it to a 2D CNN, SVMs, kNN, and PCA + kNN. Due to the limited availibility of well labeled samples, they use a semi- automatic semi-supervised learning strategy. At the time of writing this paper, to the best of their knowledge 3D CNNs have not been applied to crop classi- fication of multi-temporal mutli-spectral satellite images before.

Inspired by the neural network structure developed by Oxfored’s Visual Geometry Group (VGGnet), Ji et al. [27] use the same high level structure to build their 3D CNN. The first three layers consist of convolution-pooling con- catenation, with 32, 32, and 64 neurons. In order to best preserve the temporal information, the temporal dimension remains uncompressed until the first fully connected layer through the use of a 1 × 2 × 2 average pooling strategy. In adition to the 3D CNN, a 2D CNN is also trained with the same structure as that of the 3D CNN.

In testing two GF (Gaofen 2) multi-temporal images aquired in June, July, August, and September of 2015, and May, June, July, September, and October of 2016 respectively. Aswell as one GF1 (Gaofen 1) multi-temporal image in April, May, June, July, and September of 2014. The GF2 images had a size of 1417 × 2652, and 1111 × 293 pxiel, with a spatial resolution of 4m, and the GF1 image was 5400 × 6500 pixels with a 15m spatial resolution. Both the GF1 and GF2 images where pre-processed using quick atmospheric cor- rection (QUAC) and geometric rectification. Handcrafted shape files of the GF2 images where used as a reference for pixelwise classification. 400 train- ing samples and 2 000 testing samples where selected from the 2015 shape file and 50 training samples and 1500 testing samples were selected from the 2016 shape file.

In testing, input size, convolutional kernel size, the number of layers, and the pooling strategy where each tested by averaging the results from each set- ting from 10 test runs. From this experimental setup, the kernel size of 3×3×3 was found to be best, a result that was consistent with the results found in Ji et al. [29]. A patch size of 8 × 8 was found to be best for both the GF1 and GF2 data. Larger patch sizes led to mixed pixel problems. Three convolu- tional layers performed slightly better then two and four layer networks, 4 or more layers where not considered due to the image space having been reduced to one after three 2x poolings. Average pooling performed slightly better then max pooling. The 2D CNN was tuned in the same manner as the 3D CNN to

(34)

ensure comparability. Finally, all models where tested on the test set and the 3D CNN was found to be the best performing model.

2.2.2 Biomedical Image segmentation

The growth of bark beetle infestations behave in a similar manner to that of a cancer. So much of the inspiration for the construction of the model in this thesis can be gathered from biomedical image segmentation.

Deep Neural Networks Segment Neuronal Membranes in Electron Mi- croscopy Images

Here Ciresan et al. [30] present a method for the automatic segmentation of neuronal structures from stacks of electron microscopy images. This 3D struc- ture is represented by a 512 × 512 × 30, and the ground truth segmentation is represented by another 512 × 512 × 30 image stack where neuron membranes have value 1 and all other pixels have value 0.

To achieve this, the raw intensity values for a window of width w with its center in the pixel p is given to to the Deep Convolutional Neural Network as input. The classifier then decides if the current pixel is membrane or non- membrane. If the window overlaps with the image border, missing pixels for the window are synthesized by mirroring pixels from the current window. The network architecture for the network with w = 95 can be seen in table 2.4

To train the classifier, all available slices in the image stack are used. For each slice, all membrane pixels are used as positive examples and the same amount of non-membrane pixels are selected at random from the image slice.

On average, there are 50 000 membrane pixels per image slice. So in total we have approximately three million training examples evenly balanced between the two classes. Furthermore, since the neuronal structures are not affected by their orientation the data is augmented at the beginning of each epoch by randomly mirroring and/or rotating the images by ±90. The input data was also further manipulated by using foveation and non-uniform sampling.

Foveation is a technique where the center of the image remains undistorted, but as the pixels move away from the image center, a blur is increasingly applied.

Nonuniform sampling is performed by letting center pixels map directly to the network neurons, but as the distance to the center increases the source pixels

(35)

Layer Type Maps and neurons Kernel Size

0 input 1 map of 95 × 95 neurons

1 Convolutional 48 maps of 92 × 92 neurons 4 × 4 2 Max Pooling 48 maps of 46 × 46 neurons 2 × 2 3 Convolutional 48 maps of 42 × 42 neurons 5 × 5 4 Max Pooling 48 maps of 21 × 21 neurons 2 × 2 5 Convolutional 48 maps of 18 × 18 neurons 4 × 4 6 max Pooling 48 maps of 9 × 9 neurons 2 × 2 7 Convolutional 48 maps of 6 × 6 neurons 4 × 4 8 Max Pooling 48 maps of 3 × 3 neurons 2 × 2

9 Fully Connected 200 neurons 1 × 1

10 Fully Connected 2 neurons 1 × 1

Table 2.4: The network architecture for w = 95 in [30]

are sampled with decreasing frequencies. Combining foveation and nonuni- form subsampling enables one to exploit the data at multiple resolutions.

Because the network is trained on a dataset with equal representation of each class, the network outputs cannot be directly interpreted as probabilities.

Since the base rate of membranes is much smaller in reality, the model tends to severely overestimate the probability of a membrane. To overcome this is- sue, a polynomial function post-processor is applied to the network outputs.

After performing this calibration, the outputs from the networks are spatially smoothed using a 2-pixel radius median filter.

Ciresan et al. [30] observed that there was variability in the outputs pro- duced by different networks despite being trained on the same dataset. This suggests that the classifiers have large variance and low bias. Therefore, the output from each of the models was averaged giving a sort of ensemble based classification.

The final segmentation is evaluated on three metrics:

1. Rand error: Defined as 1 − Frand where Frand represents the F1 score of the rand index, which meassures the accuracy at which pixels are associated with their respective neurons.

2. Warping error: A segmentation meassure that is designed to account for topological disagreements.

(36)

3. Pixel error: 1 − Fpixelwhere Fpixelis the F1 pixel similarity.

On total, four networks where trained three with w = 65, one with w = 95.

All networks except for one w = 65 network where trained using fovean and nonuniform sampling. The results can be seen in table 2.5.

Rand error Warping error Pixel error 48 · 10−3 434 · 10−6 60 · 10−3

Table 2.5: Results from [30]

(37)

Methods

In this chapter, the methods used for implementing and testing are explained.

First in section 3.1, the chosen software packages are explained and justified and the hardware used for training models is specified. Then in section 3.2, the steps taken in preprocessing the data for training the models is explained in detail. Next in section 3.3, the method used for selecting the recurrent and 3D convolutional models is explained. Finally in section 3.4 the chapter con- cludes with an explaination of the experimental setup.

3.1 Software packages and hardware

For the purpose of enabling reproducability this section specifies the software packages and harware that was used to complete the degree project.

3.1.1 Software packages

Arcgis

Arcgis is an geographic information system (GIS), it is used for collecting and managing geographic data, creating professional maps, and spatial anal- ysis [31]. ArcGIS was primarily used for visualizing results and converting GeoJSON [32] formatted data to shapefiles [33].

Arcpy

Arcpy is a python package for performing geographic data analysis, data con- version, data management, and map automation [34]. Arcpy was used for reading the harvester data, and clipping out regions of interest from the na- tional land cover data, height data, and ground level humidity data.

29

(38)

GDAL

GDAL is a translator library for geospatial raster and vector data [35]. GDAL was used for reading and writing geospatial raster data and creating the target rasters.

NumPy

NumPy is a python package for handeling N-dimensional arrays [36]. NumPy was used for performing elementary computations and augmenting the train- ing data in an online fashion.

Matplotlib

Matplotlib a a python package for creating visualizations in python [37]. Mat- plotlib was the primary tool used for creating the visualizations seen in this thesis.

SciPy

The SciPy library is an open-source python package for scientfic computing [38]. SciPy was used to solve a wide variety of problems ranging from poly- gonizing clusters, to histogram matching and interpolating.

Scikit-image

Scikit-image is a library that contains a myriad of algorithms for image pro- cessing [39]. Scikit-image was used for performing image translations with sub-pixel precision.

Xarray

Xarray is an open-source python package for labeled mutlidimensional arrays, Xarray introduces labels in the form of dimensions, coordinates, and attributes to produce a pandas-like experience for development with multi-dimensional data [40]. The Xarray package was used for handling geospatial raster data, and the imputation of missing satellite data.

Pandas

Pandas is an open-source library for data analysis and manipulation [41]. Pan- das was used for handling the harvester data metadata such as timestamps, product use e.t.c.

Geopandas

Geopandas is a pandas-like library for working with geospatial data in python [42]. Geopandas was used for working with cluster metadata, such as deter- mining when the time-series of satellite data should stop based on the average timestamp for a cluster.

PyProj

PyProj is a python package for performing cartographic projection transforma- tions [43]. Pyrpoj was used for converting geospatial data between coordinate

(39)

systems.

Keras

Keras is a high-level neural network API written in python [44]. Keras was used to train all models.

3.1.2 Hardware

Swedish Space Data Lab

The Swedish Space Data Lab is a cooperation between the Swedish National Space Agency, RISE, Luleå University of Technology and AI Innovation of Sweden. The project aims to increase the use of data from space to further develop society and industry for the benefit of the planet. The Swedish Space Data Lab will achieve this goal by creating a national infrastructure for the exploitation of space data that will in turn foster a national knowledge and data hub for working with earth observation data with AI-based analysis [45].

The Swedish Space Data Lab is accessed through a Jupyter lab login, and users have access to a Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz processor, a GeForce GTX 1080 Ti GPU, and 256 GB of ram. [45]

3.2 Data preprocessing

Sentinel 2 data

NDVI

Alignment

Histogram equalization

Crop roi's with infestations

Landcover

map Ground level

humidity Height map

Image stack

Crop roi's with infestations

(N, T, 1, Y, X) tensor

(N, 3, Y, X) tensor

Input data

Harvester data

Clustering

Rasterization

Target segmentation Imputation of

missing data Thresholding

Figure 3.1: Overview of data preprocessing

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

While firms that receive Almi loans often are extremely small, they have borrowed money with the intent to grow the firm, which should ensure that these firm have growth ambitions even

Effekter av statliga lån: en kunskapslucka Målet med studien som presenteras i Tillväxtanalys WP 2018:02 Take it to the (Public) Bank: The Efficiency of Public Bank Loans to

The aim of this thesis is to explore and elaborate how the practice of key account management (KAM) is colored by cultural conflicts, dilemmas and more