Semi-Supervised Methods for Classification of Hyperspectral Images with Deep Learning

(1)

INOM

EXAMENSARBETE DATALOGI OCH DATATEKNIK, AVANCERAD NIVÅ, 30 HP

STOCKHOLM SVERIGE 2020,

Semi-Supervised Methods for Classification of Hyperspectral Images with Deep Learning

OSCAR ÖRNBERG

KTH

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

(2)

Semi-Supervised Methods for Classiﬁcation of

Hyperspectral Images with Deep Learning

OSCAR ÖRNBERG

Master in Computer Science Date: November 23, 2020

Supervisor: Matteo Matteucci, Ruibo Tu Examiner: Hedvig Kjellström

School of Electrical Engineering and Computer Science Swedish title: Semi-Vägledda Metoder inom Klassiﬁcering av Hyperspektrala Bilder med Djupinlärning

(3)

(4)

iii

Abstract

Hyperspectral images (HSI) can reveal more patterns than regular images. The dimensionality is high with a wider spectrum for each pixel. Few labeled datasets exists while unlabeled data is abundant. This makes semi-supervised learning well suited for HSI classification. Leveraging new research in deep learning and semi-supervised methods, two models called FixMatch and Mean Teacher was adapted to gauge the effectiveness of consistency regularization methods for semi-supervised learning on HSI classification.

Traditional machine learning methods such as SVM, Random Forest and XGBoost was compared in conjunction with two semi-supervised machine learning methods, TSVM and QN-S3VM, as baselines. The semi-supervised deep learning models was tested with two networks, a 3D and 1D CNN.

To enable the use of consistency regularization several new data augmentation methods was adapted to the HSI data. Current methods are few and most rely on labeled data, which is not available in this setting. The data augmentation methods presented proved useful and was adapted in a automatic augmentation scheme.

The accuracy of the baseline and semi-supervised methods showed that the SVM was best in all cases. Neither semi-supervised method showed con- sistently better performance than their supervised equivalent.

(5)

iv

Sammanfattning

Hyperspektrala bilder (HSI) kan avslöja fler mönster än vanliga bilder. Dimen- sionaliteten är hög med ett bredare spektrum för varje pixel. Få dataset som är etiketter finns, medan rådata finns i överflöd. Detta gör att semi-vägledd inlär- ning är väl anpassad för HSI klassificering. Genom att utnyttja nya rön inom djupinlärning och semi-vägledda methods, två modeller kallade FixMatch och Mean Teacher adapterades för att mäta effektiviteten hos konsekvens regularisering metoder inom semi-vägledd inlärning på HSI klassifikation.

Traditionella maskininlärnings metoder så som SVM, Random Forest och XGBoost jämfördes i samband med two semi-vägledda maskininlärnings metoder, TSVM och QN-S3VM, som basnivå. De semi-vägledda djupinlärnings metoderna testades med två olika nätverk, en 3D och 1D CNN.

För att kunna använda konsekvens regularisering, ﬂera nya data augmenterings metoder adapterades till HSI data. Nuvarande metoder är få och förlitar sig på att datan har etiketter, vilket inte är tillgängligt i detta scenariot. Data augmenterings metoderna som presenterades visade sig vara användbara och adapterades i ett automatiskt augmenteringssystem.

Noggrannheten av basnivå och de semi-vägledda metoderna visade att SVM var bäst i alla fall. Ingen av de semi-vägledda metoderna visade konsekvent bättre resultat än deras vägledda motsvarigheter.

(6)

I would like extend my thanks to my supervisor Prof. Matteo Matteucci and Ph.D. Francesco Lattari. Your help and guidance in the diﬃcult times proved essential in showing me the right direction and supporting my work. Even though COVID-19 threw a spanner in the works the situation worked out with your help. I would also like to thank Politecnico di Milano and DEIB for receiving me as a visiting student.

Further, I am very grateful for my examiner Hedvig Kjellström and supervisor Ruibu Tu at KTH. Your great organization of the thesis process and support during our meetings. Thank you both for being there and helping at each step of the process.

Thank you to the members of the thesis group, Jade Cock and Sri Datta Budaraju, for the feedback and inspiration you both gave during the course.

Finally, I would like to express my appreciation to my girlfriend and other family for their unending support and encouragement. Without you nothing of this had been possible.

(11)

Chapter 1 Introduction

In the last decade, the amount of spectral, spatial and temporal information sensors are able to gather have sharply increased [1]. Hyperspectral images (HSI) utilises a combination of spectroscopy and imaging to capture almost a continuous spectrum of light. HSI produces an image cube with 2D spatial information, the images features, and 1D spectral information, the spectral bands. As shown in Fig. 1.1, the right ﬁgure shows the 2D spatial information, the image features, and the left ﬁgure shows the 1D spectral information, the spectral bands.

Figure 1.1: Hyperspectal cube with spectral signature [2].

The x-y axis is similar to a regular image, but instead of only red, green and blue colours a wide range of the radiations wavelength is used. This results in

2

(12)

CHAPTER 1. INTRODUCTION 3

a spectrum of colours for each pixel. Diﬀerent from multispectral images, HSI has an almost continuous spectrum instead of discrete, separated bands. More- over, it captures more features than just visible color, and include information towards the infrared and ultraviolet parts of the spectrum.

The HSI technique has been applied in several fields, such as remote sensing for agriculture, geological surveys, and environmental observations for classification. Classification of hyperspectral images is generally pixel based.

Each pixel is classified and then all are combined back into the image, and thus it is interchangeable with the term semantic segmentation used in some other fields. In other fields such as food processing and surveillance HSI has also seen a rise in use recently [3]. It is very useful in many of these fields since it captures information normal images can’t. The wavelengths not visible to the human eye can show imperfections in food, if a certain area in a satellite image consists of corn or wheat, and information about materials which can’t be seen by humans.

Because of the large amount of spectral bands, HSI has high dimensionality. This results in diﬃculties when analyzing the image and crafting models for interpretation. But since the amount of information in each image is huge it makes it possible to detect otherwise "invisible" features. Due to this high dimensionality most traditional classiﬁcation techniques applied to multispectral images cannot be used [2].

The main problem facing HSI analysis is large data dimensions, but also noise and spectral mixing can cause problems. Spectral mixing results from diﬀerent classes being combined into one pixel because of lack of resolution.

Different classes, like soil and vegetation for example, can be present in the area of one pixel. This mixes their spectral characteristics, making classification difficult. Fig. 1.2 shows the spectral mixing problem. Two pixels consists of more than one class which makes the spectrum to the right a mixture between the two.

Adding to this, HSI is notoriously hard to label and very few public, labeled datasets exists. That means, it is very diﬃcult for humans to see the data and be able to classify each pixel as the correct class, for example water or soil.

This plays a role in the Hughes Phenomena. Hughes Phenomena states that if you want classiﬁcation of each class, to ensure statistical signiﬁcance the amount of labeled sample you need scales with the data dimensions. Having few labeled samples and high dimensions makes this a key problem for HSI as well.

Supervised, unsupervised and semi-supervised techniques have all been applied with both traditional machine learning and deep learning methods

(13)

4 CHAPTER 1. INTRODUCTION

Figure 1.2: The concept of an hyperspectral image including the problem of mixing pixels. The spatial domain can be seen to the left with the landscape and the spectral domain of three pixels to the right [4].

to HSI classification. The interest in the field has increased during the last decade. The amount of data the HSI sensors can gather has sharply increased and given rise to more availability for research and industry, though the lack of labeled data still remains [1]. To reduce the amount of human work needed to analyze the gathered data, much focus has been given for automatic analysis and classification of HSI data. This has led to many papers on HSI classification methods. These works have previously mostly focused on traditional machine learning methods (section 1.1), but in the recent years a shift has been done toward deep learning methods (section 1.2).

1.1 Traditional Methods

Many traditional machine learning techniques have been applied to HSI classification, both supervised and unsupervised. Melgani and Bruzzone [5] applied Support Vector Machines (SVM) to HSI classification and it has since then been used widely as a supervised method. The authors SVM implementation was also compared to other traditional methods used, such as k-nearest neighbours, and it outperformed them all. An SVM can only take into account the spectral information of each pixel and can’t factor in the spatial relation of the image cube. I.e. it can’t learn that in satellite images, one pixel of forest is probably close to another forest pixel. Even though this removes much of information for the classifier it is still an easy to use and widely important method with good performance.

(14)

Several unsupervised methods have also been applied. Diﬀerent component analysis methods like principle component analysis (PCA) reduce the dimension of the input. This is very useful in HSI because of the high dimensionality. Using PCA is common to reduce the problem into smaller dimensions and partially overcoming this dimensionality issue [1]. Outside these common decomposition methods, a plethora of other methods have been used to train an unsupervised classiﬁer on HSI.

1.2 Deep Learning Methods

Deep learning (DL) has shown great promise in many ﬁelds including image classiﬁcation, object detection and natural language processing. The ability to capture more complicated behaviour and reducing the amount of hand-crafted features in the models has been alluring for remote sensing scientists in HSI.

Thus, many have turned to deep learning for solutions to problems like high dimensionality [6].

The DL methods used in HSI classification are mainly convolutional neural networks (CNN), autoencoders, deep belief networks, generative adversarial networks and recurrent neural networks [3]. Since the HSI data has both a spatial (the x-y plane) and a spectral (the z-direction) domain, different networks can focus on different domains as input data. Newer methods have also been applied where both the domains are used, Li et al [6] called these methods spatial-spectral feature networks. Furthermore, methods can do preprocess- ing, or postprocessing, of the input data, or output data. This is similar to what has been done with traditional methods, like applying PCA to the input data to reduce the dimensions. Methods that do no pre- or postprocessing can be called integrated networks.

Integrated spatial-spectral feature networks has been implemented by several researches, among them Hamida et al. [7] and Luo et al. [8], using 3D CNN’s to handle the three dimensions of the image. Thus, early stages of the process already handles both the spatial and spectral information directly. All information is used and don’t need to be analyzed by diﬀerent networks, reducing the number of otherwise needed parameters. This is preferable because having few parameters results in needing fewer labeled samples for training, and since labeled sample are lacking this is preferred in HSI.

Unsupervised attempts has also been made and showed comparatively good performance even compared with some supervised methods. These models do not need labeled data, but instead relies on ﬁnding patterns among the samples.

This is usually more diﬃcult since the lack of labels reduce the information we

(15)

6 CHAPTER 1. INTRODUCTION

have about the data. Several unsupervised works, such as Nalepa et al. [9] and Mou et al. [10], have applied diﬀerent techniques using integrated networks without the use of pre- or postprocessing steps. Still, most works has been in the supervised domain. Since HSI lack labeled data unsupervised methods are otherwise interesting since they do not need any labeled data.

1.3 Semi-Supervised Learning

Semi-supervised learning is the combination of methods using both labeled and unlabeled data for machine learning. It is common to call it a combination of supervised and unsupervised learning. This enables training schemes which can use a small fraction of labeled data and supplement it with a larger amount of unlabeled data [11]. When labeled data is difficult to acquire, but unlabeled data exists in abundance, using semi-supervised learning can be useful. This requires that the amount of unlabeled data is much higher than the labeled data, since it carries less information. The field of HSI fits into this description.

There are abundances of data available, that is unlabeled, since acquiring the images is not an issue. Labeling the acquired images takes time and money since they are generally very diﬃcult to label. Thus, semi-supervised learning methods have been used and is continuing to be researched as a way to solve Hughes Phenomenon within HSI.

In recent years the amount of work on semi-supervised learning in other fields than HSI has grown. Many new approaches and methods have been proposed, and they have outperformed supervised and unsupervised methods both. New methods such as FixMatch [12] and Mean Teacher [13] have been proven state-of-the-art in popular image classification datasets such as Ima- geNet. The HSI field has also seen new work in this field, and it has proven useful as well. Here many different approaches has been tried, like pseudo- labeling [14] and co-training [15] for example.

1.4 Contribution

Based on the recent advances in semi-supervised learning and the possible usefulness for HSI classification, this thesis will investigate how well modern semi-supervised methods from other field of image classification will perform for HSI classification. Two methods based on consistency regularization which has not been applied to HSI will be tested to see if they perform better than regular, supervised deep learning methods as well as traditional methods

(16)

such as SVM’s and decision trees.

1.5 Ethical Considerations

This thesis will only deal with data that is open source and available for research. This has been provided by research groups at diﬀerent universities to further research in this area. Thus, the data used in this work will not be subject to any ethical consideration.

In general the area of Hyperspectral image classification has not been faced with many ethical discussions. Instead, the ethical considerations can come in to the general area of classification using AI. This area is subject to several ethical considerations in different applications. By letting computers auto- matically do classification, things can go wrong. The big concerns are when bias is introduced to the models through the data, and the lack of explainability of the models. Using semi-supervised methods can further expand on the in- herent or inherited biases of the data or models. Semi-supervised methods can extrapolate the already existing biases and apply it to new data, which is then used to further enhance the biases. The models used also lack explainability behind why a certain class was predicted. This can cause problems where de- cisions cannot be explained and thus we don’t fully understand the decision making of the model.

The purpose of many works of HSI classiﬁcation revolves around accurate predictions of classes, often agriculture land as in the most popular datasets.

This can hopefully be used by farmers to reduce the amount of both pesti- cide, fertilizer and water used during farming. With increased yield food could hopefully be cheaper and available to more people. The downside is that such systems that these works could be used in are expensive and it might be restricted to richer countries. The use of satellites or UAV’s is not accessible in many parts of the world where an increased yield of food is needed. This could lead to increased gaps in the wealth and prosperity between countries who’s farmers can aﬀord AI assisted systems.

(17)

Chapter 2 Related Works

In this section, the recent attempts to apply different semi-supervised techniques within HSI classification will be reviewed. Recent advances in the general field of semi-supervised learning in other fields than HSI will also be investigated to gauge the state of the art of semi-supervised methods.

2.1 Semi-Supervised Learning in Hyperspec- tral Imaging

Several works have investigated using both labeled and unlabeled data to train deep learning classifiers for hyperspectral imaging. He et al. [16] uses GAN’s with a 3D bilateral filter for the volumetric input of the image cube. The 3D bilateral filter is able to extract the spatial-spectral features that are then fed to the GAN. Both labeled and unlabeled samples can be used to train the GAN to generate realistic sample. These generated samples can then be included in the training set as labeled samples, and used to train the classifier. In testing it showed that even though the 3D bilateral filter had an impact it was not that much. The application of the GAN classifier showed prominence but it is a complex structure with several network parameters that can be difficult to tune.

Kemker and Kanan [17] used unlabeled datasets to first train a spatial- spectral feature extractor using stacked multi-loss autoencoders, which then can be used to train a classifier with labeled data. Kemker et al. [18] then improved the idea by replacing the SVM classifier with a semi-supervised multi- layer perceptron for classification. They showed that even using datasets from separate sensors also improved performance, even though the datasets differ in the amount of spectral bands, the wavelengths they capture and spatial res-

8

(18)

CHAPTER 2. RELATED WORKS 9

olution.

Wu et al. [14] used pseudo-labeling to gather more data to train a convolutional recurrent neural network. The pseudo-labeling approach uses both labeled and unlabeled data, a Dirichlet Process Mixture Model is used to cluster the potential classes. The clustering was extended by using must-link and cannot-link connections between samples. Pixels within the same super-pixel had to belong to the same class, while sample of diﬀerent labels cannot belong to the same cluster. The DPMM clustering works well for this problem since it does not need to have the amount of clusters speciﬁed. Thus, unknown classes present in the unlabeled dataset will not corrupt the pseudo-labels. The CRNN treats each pixel as a spectral sequence and trains on the whole new dataset created after creating the pseudo-labels by clustering. The last layer is then replaced and retrained with only the labeled data. Using a CRNN only enables the network to see each pixel as a spectral sequence and thus it does not factor in the spatial relationship between samples.

Kemker et al. [15] introduces a co-training scheme for HSI classiﬁcation.

They use a regular ResNet structure with two "views", one for spectral and one for spatial information. Co-training requires that two independent datasets are used, and in the HSI case this could be assumed true since the spatial and spectral information is largely independent. The two networks are first trained on labeled data for classification, then are used to predict the labels of unlabeled data. If the confidence is high enough, the new sample is fed to the other "view" for training. To decide confidence on the unlabeled spectral samples the distance between the hierarchical representation of the most similar labeled sample is used. For the spatial samples they compare to the second order neighbourhood around that pixel for spatial consistency. The spectral samples are taken as 3x3x(amount of spectral bands) as input, while the spatial samples are decomposed using PCA to three spectral dimensions then a 27x27 patch is used. When using larger sample sizes the amount of iterations needed to achieve good classification results is increased, and in turn the com- putation time is drastically increased.

Other ways to overcome the issue of few labeled sample is to use data augmentation or transfer learning. Kemker et al [19] proposes to use synthet- ically generated images for pre-training. They create a new, publicly available dataset of synthetic multispectral images that is used to pre-train a deep networks. Several networks are trained on the synthetic dataset and the weights are then transfer to another network which is ﬁne-tuned to real, labeled data.

Their ﬁndings show that pre-training on these synthetic datasets of similar structure like MSI is preferable to pre-training on other large datasets like Im-

(19)

10 CHAPTER 2. RELATED WORKS

ageNet, which does not have the same structure as the MSI datasets. The MSI data is more similar to HSI that regular images and show that this might be a direction that is adaptable to HSI data as well.

Nalepa et al. [20] proposes a new data augmentation scheme which uses PCA to project new sample from the principal components. This was shown to create new, similar samples by injecting noise in the principal components and then project that new component into the original space to create a new sample. They also introduce online augmentation for HSI. Inference is done on a labeled sample with several synthetic samples created from it, and a voting scheme is used from all sample to ﬁnd the right label. Any data augmentation scheme can be used to create these new samples that is used in the inference.

2.2 Semi-Supervised Learning

In other classiﬁcation, recognition and detection tasks semi-supervised learning has also been applied and used to improve performance. Engelen and Hoos [21] classiﬁes wrapper methods to be some of the oldest and most widely known algorithms for semi-supervised learning. These algorithms include families like self-training, co-training and boosting. They use supervised base learners with both labeled and unlabeled data to iteratively improve performance. They are particularly useful since they can use virtually any base learner. Several wrapper methods have been utilized in recent years with strong regularization techniques to give state of the art performance on famous datasets.

Since over-ﬁtting and lack of generalization can become a problem when deep model increases in size and scope, regularization becomes very important. This is very true for semi-supervised learning methods as well, and they also face the issue of performance degradation when introducing unlabeled sample since this can cause bias in the model when using wrapper methods for example. These methods can propagate initial bias since they use a base supervised learning to extend the knowledge with the unlabeled data.

Zhang et al. [22] states that many recent models parameter sizes have grown in accordance with the size of the datasets, resulting in poorer generalization. Data augmentation has been a key part in increasing generalization by performing regularization by introducing virtual samples. The authors introduce a new data augmentation scheme to increase generalization, called MixUp. This scheme creates new samples by doing linear interpolation be-

(20)

CHAPTER 2. RELATED WORKS 11

tween samples (x), and their targets (y), ˆ

x = λxi+ (1− λ)x^j , ˆ

y = λyi+ (1− λ)yj ,

where the samples are randomly drawn from the training data and λ ∈ [0, 1].

The authors believe that MixUp implicitly controls model complexity, but they still lacks a good theoretical understanding for the "sweet spot" of this bias- variance trade-oﬀ. Using techniques like this in semi-supervised learning has increased lately with several methods that uses consistency regulation to improve performance. The consistency regularization method states that models should be invariant to small noises applied to input samples.

Xie et al. [23] introduces a method that uses consistency regularization with semi-supervised learning to increase performance and generalization.

The method utilizes data augmentation to minimize the loss between real and virtual samples. The authors investigate different augmentations techniques to see the effect of strong augmentation on the data and what effect it has on performance. Using their method, unsupervised data augmentation (UDA), several tasks showed better performance with fewer sample than supervised methods.

Other methods also taps into consistency regularization to improve performance. Berthelot et al. [24] looks at the big picture and combines several domain approaches of semi-supervised learning to create their model, Mix- Match. Labeled and unlabeled datasets are combined by performing predictions on augmented unlabeled samples after training on the labeled dataset.

The prediction made on the augmented versions of the same unlabeled sample is averaged and sharpened to give a pseudo-label. Further training can then be done after further increasing the amount of data by using a version of the data augmentation technique MixUp. The authors showed that this holistic approach reduced the error rate often by a factor of two or more.

Sohn et al. [12] takes inspiration from Mixmatch and UDA to create another semi-supervised learning approach with self-training and consistency regularization they call FixMatch. To ensure consistency the method is fed one weakly, and one strongly, augmented version of the same image. If prediction done on the weak augmentation then it is converted to a one-hot label, the loss will then become the prediction on the strong augmentation and this label. This enables the method to use unlabeled images for training as well.

The pseudo-labeling approach, present in Wu et al. [14] and Babakhin et al. [25] for example, has also been investigated more closely as a method and expanded upon. For example, Arzano et al. [26] proposes a new method to use

(21)

12 CHAPTER 2. RELATED WORKS

when training a network using pseudo-labeling to overcome conﬁdence bias.

Soft pseudo-labels are used instead of one-hot encoding. Since the predictions made might be wrong, MixUp data augmentation is used on the predicted labels to reduce the introduced conﬁrmation bias. Pairing this idea with min- imum amount of labeled samples per mini-batch shows that pseudo-labeling can outperform some consistency regularization methods.

Xie et al. [27] also shows that pseudo-labeling and self-training can achieve high state of the art performance. The authors introduce a method they call Noisy student. The semi-supervised learning method consists of one teacher and one student network. The teacher network trains on labeled samples, which is then used to infer pseudo-labels on unlabeled data. Noise is injected in this new dataset, and the student network trains on this dataset combined with the labeled dataset. The student network needs to be larger or equal to the teacher so it can better learn from larger datasets, and since noise is injected the student network needs to be better at generalization. Iterating this process some times by replacing the teacher with the student resulted in achieving state of the art on ImageNet. The regular dataset was extended with a large, unlabeled dataset that the teacher inferred pseudo-labels on.

Another method that uses a similar teacher-student method is Mean Teacher [13]. Both labeled and unlabeled samples are used by having one supervised loss and an unsupervised loss consisting of the mean square error of the prediction of the teacher and student networks. The authors draws from previous methods using self-ensembling methods [28]. Self-ensembling methods draws on the idea that an average of several predictions is better than just one, but reduces this ensemble of networks to subsets of one network. Temporal Ensembling uses the predictions made during diﬀerent parts of the training process and averages this prediction to give a better view [28]. Mean Teacher extends this to have two models where the teacher is an average of the student during diﬀerent parts of the training process.

(22)

Chapter 3 Background

This chapter deals with some background information about the techniques used in this thesis. First, two families of traditional machine learning methods are presented in 3.1 and 3.2. Then some background on deep learning networks (section 3.3), data augmentation (section 3.4) and semi-supervised learning (section 3.5) is given.

3.1 Decision Trees

Decision trees are popular machine learning tools that uses a tree structure to perform classiﬁcation or regression. It is simple and quite eﬀective in many tasks [1]. Decision trees make binary splits, splitting one way or the other based on the value of a certain feature, which creates a branching structure.

At the end of the branches are the leaves. This is where the models makes a classiﬁcation. For example, a simple decision tree could take the age of a person as the feature. If the age is over 18 the decision tree could classify that the person is an adult, if the age is lower it could classify it as child. One advantage is that decision trees are white box models (in contrast to black box models) so the user can understand why a decision was made. In this thesis two decision tree models was used as baseline methods to compare the deep models against, Random Forest and XGBoost.

3.1.1 Random Forest

Random Forest is a decision tree model that uses ensemble learning, or bagging methods, to better reduce overﬁtting to training data which is a problem of regular decision trees. Ensamble learning is when several classiﬁers are used

13

(23)

14 CHAPTER 3. BACKGROUND

instead of one, the average result of all is usually better than only the result from one. This increases the bias of the model a bit, but performs much better than standard decision trees. In contrast to regular bagging which chooses a random set with replacement of the samples for each tree in the ensemble, Random Forest uses attribute baggning. Random subsets of the features are chosen to train each tree to reduce the correlation between the trees in the model.

3.1.2 XGBoost

XGBoost is a decision tree model using gradient boosting. It has been applied in many machine learning competitions with great results and is thus very popular. Gradient boosting uses an ensemble of several weak classiﬁers, like Random Forest, but improves the model at each step by observing the residual error between the prediction and the true class. Thus, XGBoost has a loss function which can be used to improve the predictor.

3.2 Support Vector Machine

Support vector machines (SVM’s) are a staple of machine learning tools for prediction. It is a non-probabilistic binary classiﬁer, choosing the class of a sample based on the value of the decision function. SVM’s ﬁts a boundary function to the dataset by maximizing the margin, which is the distance between the boundary function and the support vectors. Each support vector is the samples closest to the boundary on both the positive and negative side (class -1 and 1 if you would like). The decision function that created the boundary can be expressed as,

f (x) = w· x + b , (3.1)

and the sign of this function decides the class. The original optimization problem posed can be solved when separating the two classes by as wide a margin as possible. This is not always possible when samples might be too mixed, a non-separable problem. Thus it is possible to introduce slack variables to the objective function so the problem can still be solved. The cost function to

(24)

CHAPTER 3. BACKGROUND 15

minimize thus becomes,

minw,ξ

1

2||w||²+C�^N

i=1

ξi (3.2a)

subject to y_i(w· xi+b)1 − ξi i = 1,...,N (3.2b)

ξi i=1,...,N . (3.2c)

This takes care of non-linearly separable cases.

Another way to further handle non-linear cases is to introduce a non-linear decision function. The classiﬁcation does not need to be linear, but the samples can be mapped to another feature space and kernels such as a Radial Basis Function can be used. This would normally introduce many more calculations, but the kernel trick can be used for faster computations. To summarize the kernel trick, it can be applied when using certain feature mappings to solve the dual problem where the inner product of the feature x in the transformed space. Thus the dual problem to solve instead becomes,

maxα

�N i=1

αi− 1 2

�N i=1

�N j=1

αiαjyiyjK(xi, xj) (3.3a)

subject to �^N

i=1

αiyi0 i = 1,...,N (3.3b)

0≤ αⁱ ≤ C i=1,...,N . (3.3c)

Since the classification is binary, multiclass classification has to be done by construction of several classifiers. Several SVM’s can be created and treated either in a one-against-one (OAO) or one-against-all (OAA) scheme. With M classes, the OAO setup will create M(M − 1)/2 SVM’s where each will classify if the sample belongs to one of two classes. Here the winner takes it all. The class with the most classifications by all the SVM’s will be the predicted class. The OAA scheme will have M SVM’s which will classify if the sample belongs to one class or not. The SVM which outputs the highest confidence in its class, the decision function with the highest value, will be the final prediction.

3.2.1 Transductive SVM

Transductive support vector machines (TSVM’s) are an extension of the traditional SVM to include unlabeled samples. This semi-supervised approach

(25)

uses a transduction method where unlabeled samples are given pseudo-labels and then used in an iterative training fashion. Another slack constraint is added for the unlabeled samples which has been given a pseudo-label based on the current SVM’s prediction.

Bruzzone et al. [29] implemented a version for remote sensing problems which included a more extensive way to chose the unlabeled samples to give a label. The method takes a ﬁxed amount, based on either the initial support vectors or a threshold, of samples from the unlabeled data which are given pseudo-labels. These samples that gives the most information, thus are closest to the margins, are used in re-training the SVM’s. Iteratively more samples from the unlabeled data are taken and used to train the SVM’s. When the boundary has changed the pseudo-label of these samples can also change.

3.2.2 Quasi-Newton Semi-Supervised SVM

Quasi-Newton Semi-Supervised SVM (QN-S3VM) uses a quasi-newton optimization solver to easier solve the difficult optimization problem that a traditional S3VM poses. Here the unlabeled samples are taken into the optimization problem and the optimization is done over both the decision function as well as the classification of the unlabeled samples. This can be formulated as a non-convex and non-differential optimization problem, which poses a difficult problem to solve. Using a quasi-Newton optimization framework the only thing that needs to be given is the gradient. The posed loss in the original problem can be substitute with a gradient surrogate and used to solve the problem.

Gieseke et al. [30] provides a framework to solve this non-convex problem with computational accelerations for both sparse and non-sparse data.

3.3 Convolutional Neural Networks

Several neural networks and methods exists for deep learning. Convolution neural networks (CNN) takes inspiration from the way the visual cortex works.

Instead of using whole layers of weights that takes the shape of the data, kernels are used by convolving them with the input data. This is the main diﬀerence between CNN and regular neural networks. Fewer weights are needed when using several kernels and striding over the input. The networks can be made much deeper without having too many weights which need to be ﬁtted. The architecture of a typical CNN network can be seen in Fig. 3.1.

These types of networks have been experimented with for a long time. Le- Cun et al. [31] applied gradients with back-propagation to learn handwrit-

(26)

Figure 3.1: Spectral classiﬁer based on a CNN for HSI classiﬁcation [1]

ten numbers, thus making the learning process automatic. This is the basis for most modern computer vision using deep learning. The method was constrained by the computational powers of the time, being restricted to low- resolution, simple images. With the rise of cheaper and more available computational resources, as well as being able to use GPU’s to accelerate learning more available methods was developed. For example, AlexNet was developed with more layers than before, but accelerated by training with GPU’s [32].

3.4 Data Augmentation

Deep learning relies on large amount of data for training. This large amount is needed to avoid overﬁtting [33]. In many applications and areas it is both diﬃcult and expensive to gather large amounts of data.

In regular image classification one of the techniques used to reduce the overfitting problem has been data augmentation. It attacks the problem at its roots, by increasing the amount of samples by augmenting them. This means that existing samples are changed in different ways to create new samples. For example, for regular images a deep learning network have to overcome differ- ences in lightning conditions, viewpoint and more. Data augmentation techniques can thus distort and change the available samples so they show other lightning conditions by making the image brighter, or changing the viewpoint by flipping the image. Thus, data augmentation creates more samples leading to a larger dataset and also supplies the training set with different conditions that can be useful to learn.

3.5 Semi-Supervised Learning

Semi-supervised learning is a mixture of supervised and unsupervised learning. It enables the use of both labeled and unlabeled samples for training to

(27)

better utilize all of the data available. See Fig. 3.2 for a basic example where unlabeled samples can help reach the optimal decision boundary.

Figure 3.2: Basic example of binary classiﬁcation in presence of unlabeled samples (dots). Using only labeled samples (cross and triangle) the supervised decision boundary is not optimal. Using all samples the optimal decision boundary could be found [21].

The data can thus be divided into two parts. Let X = {(xb, pb) : b ∈ (1, ..., B)} be a batch of B samples where x^b are the training samples and pb

the labels, in this case one-hot encoding. Let U = {(u^b) : b∈ (1, ..., µB)} be a batch of µB unlabeled samples where µ is a hyperparameter of the propor- tional size of the unlabeled data. Thus the aim of a semi-supervised system is to infer the predicted class distribution pm(y|x) based on both X and U.

To do this some assumptions have to be made. Two common assumptions in semi-supervised learning is the smoothness and cluster assumptions [11]. The smoothness assumption is also present for supervised learning, stating that if two points x1, x2 are close, then so should also their corresponding outputs y1, y2. This is extended for semi-supervised learning by factoring in the density of the data points. The semi-supervised smoothness assumption states that

(28)

the prediction function is smoother in high-density regions than in low-density regions:

Semi-supervised smoothness assumption: If two points x1, x2 in a high- density region are close, then so should be the corresponding outputs y1, y₂.

The cluster assumption is similar but instead treats the class of the data.

It is an early form of assumptions made for semi-supervised learning and it states:

Cluster assumption: If points are in the same cluster, they are likely to be of the same class.

These assumptions are necessary for semi-supervised algorithms to function. Labeled and unlabeled data have to be connected, and by using the

Figure 3.3: Illustration of the semi-supervised smoothness assumption. The cluster assumption is also present in the form of the low-density region where the decision boundary should pass [21].

smoothness and cluster assumption the use of both labeled and unlabeled data can be utilized in semi-supervised methods. Decision boundaries can be assumed to go through low-density regions, since dividing a high-density region (a cluster) would violate the smoothness and cluster assumption both for ex-

(29)

ample, see Fig. 3.3. Different methods use these assumptions differently but they all hold true nonetheless. Three important methods for are presented here that takes different approaches to semi-supervised learning.

• Consistency regularization - where predictions on one sample distorted in diﬀerent ways should be the same.

• Pseudo-labeling - where a conﬁdent prediction on a sample can generate a label to be used for further training.

• Self-ensembling - combining subsets of a network to reach an ensamble prediction

3.5.1 Consistency Regularization

The main idea of consistency regularization is that a prediction a sample, distorted in two diﬀerent ways, should be the same. It was presented by Sajjadi et al. [34] with a regular supervised loss function and an unsupervised loss function on the unlabeled samples as:

�µB b=1

||p^m(y|α(u^b))− p^m(y|α(u^b))||² (3.4) , where α(·) is a weak augmentation. Since the augmentation is stochastic, as is the prediction, the values of pm(y|α(ub)) will not be the same. Since this method relies on the distortion of samples, data augmentations plays a large role. Data augmentation distorts, or augments, the sample in ways that increases robustness of a model. It also makes it possible to use unlabeled samples. Having a label is not needed since the main idea is that the prediction should only be the same. With the use of labeled data to "kick-start" the model, it is possible to make predictions on unlabeled samples and minimize the diﬀerence these predictions make on the same sample.

3.5.2 Pseudo-Labeling

Pseudo-labeling relies on the idea that if we have a predictor, we can use this model to predict the label of unlabeled samples. Thus we can get artiﬁcial, or pseudo-, labels on the unlabeled dataset. This idea was introduced long ago by Scudder [35], but recently it was re-introduced for image classiﬁcation [36].

The predictor is given an unlabeled sample and if the prediction, qb = pm(y|x),

(30)

is over a certain threshold it will be taken as the "truth". The loss function with cross-entropy loss H can be stated as:

1 µB

�µB b=1

I(max(qb) > τ )H(ˆq_b, q_b) (3.5) where ˆqb = argmax(qb)and τ is the threshold hyperparameter.

3.5.3 Self-Ensembling

The Ensemble method approach works with the idea that multiple predictions averaged will give a better result than only one. It dates back a long time, a recent implementation and extension to self-ensembling are the Π and Temporal Ensembling models [28]. This recent approach builds upon other works, like dropout [37] where the ensemble can be seen as multiple subsets of one network instead of many networks. These ideas are combined with consistency regularization and pseudo-labeling.

Self-ensambling is similar to pseudo-labeling in that it uses a network to supply a label that can be used. But instead of having a threshold and giving a one-hot label, a soft label is used. The prediction is used in its full, the rea- soning that it gives more information than just a hard, one-hot label. Temporal Ensembling uses the network during different stages of the training period as the ensemble. The predictions made at different time steps are considered and averaged as the final output.

(31)

Chapter 4 Method

4.1 Data

The data structure of HSI is one of the problems that makes it a hard problem to solve. The dimensions are high and labeled data is generally scarce. The main benchmark tests consists of only one image each. Thus each dataset contains very little data, even fewer samples that are labeled.

4.1.1 Validation

Nalepa et al. [38] explains the problem many current researchers in the HSI area faces, when validating new methods, many results are overly optimistic.

Methods that consider a patch larger than one pixel, typically many CNN methods, can use the spatial relations which is beneﬁcial. Many sampling methods randomly sample training, validation and testing sets from the same image, thus a training pixel might be right next to a testing pixel. If methods consid- ers a patch larger than one pixel, large parts of the testing set might be included in the patches that are used during training. See Fig. 4.1. This information leak gives the networks that are trained information about the test set, thus resulting in overly optimistic results. The authors shows that this is the case for recent research and introduces a sampling scheme to reduce these problems with validation.

Three prominent datasets, among them Salinas and Pavia University, are divided and presented for use for better validation results as well. Several disjoint patches are extracted in diﬀerent folds to create a dataset available for cross-validation. By extracting disjoint patches no data that is available for testing can be seen during training. One thing to note is that not all class

22

(32)

CHAPTER 4. METHOD 23

Figure 4.1: Training (ti) and test (φi) pixels randomly drawn might be spatially close in the dataset. Thus resulting in an information leak (the red-shadowed pixels) [38].

samples are represented in each training fold. The authors justify this by saying that this scenario is somewhat realistic and that in real life scenarios some rare classes might not show up in the training data. This artefact can be taking into account when analyzing the results by knowing that in some folds some classes are missing.

4.1.2 Datasets

The two datasets that are used in this thesis is Salinas and Pavia University.

More detail about the datasets are given below.

• Salinas Valley: This dataset was captured over the Salinas Valley in Cal- ifornia, USA. The datasets consists of one 512x217 pixel image with a spatial resolution of 3.7 meters. It was captured by the AVIRIS sensor and contains 224 spectral bands (20 bands are dominated by water absorption and are removed beforehand, i.e. the dataset used has 204 spectral bands). The image is over a rural area with 16 diﬀerent classes of vegetation. The dataset presented by Nalepa et al contains ﬁve folds with patches of size 22x10 pixels, see Fig. 4.4.

• Pavia University: This dataset was captured over Pavia University in Lombardy, Italy. The datasets consists of one 610x340 pixel image with

(33)

24 CHAPTER 4. METHOD

Figure 4.2: Pavia University dataset. a) Three-band false colour composite with the hypercube. b) Ground truth. c) Legend of classes [1].

a spatial resolution of 1.3 meters. It was captured by the ROSIS sensor and contains 115 spectral bands (12 bands are dominated by water absorption and are removed beforehand, i.e. the dataset that is used has 103 spectral bands). The image is over an urban area with 9 diﬀerent classes of ground material. See Fig. 4.2 for the image and ground truth.

The dataset presented by Nalepa et al contains 5 fold with patches of size 30x17 pixels.

The AVIRIS sensor captures wavelengths in a span of 400 to 2450 nm with 10 nm bandwidth. The ROSIS sensor captures wavelengths in a span of 430 to 850 nm with 4 nm bandwidth.

When sampling data for training, two diﬀerent schemes are used. To test the capabilities of the diﬀerent methods when having a low amount of labeled data, one scheme samples a maximum of 40 samples per class and leaves the rest as unlabeled data. The other scheme samples all the available, labeled samples as labeled samples and uses the ones which has no label as unlabeled.

When testing the models each is trained and tested two times of each fold, and all results are averaged to get a good performance index. This gives a robust result when testing and comparing the results, since the diﬀerence in folds can give very diﬀerent results.

More over, the Pavia University dataset has an almost identical twin dataset.

(34)

Figure 4.3: Salinas dataset with the 5 folds of the validation split from Nalepa et al [38]. a) The RGB color composition of the dataset. b) Ground truth of the dataset. c)-g) The ﬁve folds of the validation split, the black rectangles are the training set while the rest compose the test set.

Figure 4.4: Pavia dataset with the 5 folds of the validation split from Nalepa et al [38]. a) The RGB color composition of the dataset. b) Ground truth of the dataset. c)-g) The ﬁve folds of the validation split, the black rectangles are the training set while the rest compose the test set.

This dataset is called Pavia Centre and the data was captured by the same sensor over a similar area, and contains the same classes. The diﬀerence is that one more band was removed during processing so it only has 102 spectral bands.

Because of the similarity, this dataset can be used as extra unlabeled data. It has been shown that dataset that diﬀer much more than the two Pavia versions can be used to improve the performance when used for semi-supervised training as unlabeled data [18]. Thus, the Pavia Centre dataset can be used as unlabeled data for training on the Pavia University dataset. To make the dimensions correct the last spectral band is duplicated to increase the amount of spectral bands to 103. This dataset is larger than the others with a size of 1096x715 pixels, resulting in a total amount of 783640 extra pixels available.

The amount of labeled and unlabeled data will diﬀer for the sampling schemes and datasets. It will also diﬀer from fold to fold because of the vali-

(35)

Sampling Scheme Fixed All

Data types Labeled Unlabeled Labeled Unlabeled

Salinas 451 4451 4656 246

Pavia 240 2505 2607 138

Pavia U+C 240 786145 2607 783778

Table 4.1: Amount of labeled and unlabeled samples for the two sampling techniques. The numbers vary between folds, these are for the ﬁrst fold of both datasets.

dation method used. The resulting split between labeled and unlabeled data is important for the methods and is shown in Tab. 4.1.

4.2 DeepHyperX

Audebert, Saux and Lefèvre [39] presents a software toolbox to use for exper- imentation of diﬀerent methods for classiﬁcation of HSI. This toolbox is used as a foundation in this work, and customized to better suit the needs of this thesis. It is based in Python 3.5+ and PyTorch.

4.3 Data augmentation

Data augmentation is the process of changing the data by some transformation to keep core features intact while creating a new sample. This can mainly be used in two cases. The most common case is to increase the amount of data available by augmenting it. Examples include to ﬂip an image or add Gaussian blur. These examples keep the samples similar to the original so they stay in the same distribution, but gives a slightly new look on it so it can be used during training. The second case data augmentation can be used is for consistency regularization. It relies on the assumption that a prediction on augmented samples should be the same.

Consistency regularization was ﬁrst proposed by Sajjadi et al. [34] as an unsupervised loss function to reduce the loss between two augmented samples.

Many augmentation techniques rely on stochastic processes where diﬀerent augmentation transformations are used randomly, to further give a larger pool of possible samples. Because of the stochastic nature, one sample that is augmented two times can be used as input to the loss function. I.e. one sample can be augmented by the same function two times, since the augmentation is

(36)

stochastic this will generate two diﬀerent outputs which can be used in the loss function.

Data augmentation has generally not been used that much in the HSI space as noted by Nalepa et al. [20], and several common data augmentation techniques that have been used in other works are not always applicable to HSI since the data differs. Thus, these augmentation techniques have to be mod- ified and tested to see if they work for HSI as well. In works like FixMatch which uses data augmentation for consistency regularization [12], automated augmentation methods like RandAugment [40] and CutOut [41] is used. Ran- dAugment uses several off-the-shelf augmentation techniques used for regular images. These include methods like solarization and histogram equalization.

These are not directly applicable to HSI and other methods have to be tested instead. The literature that exists on data augmentation techniques for HSI mostly rely on having the labels present for each sample. Recent works are from Wang et al. [42], inspired by MixUp [22], and Slavkovikj et al. [43] and Li et al. [44] which creates new augmentation techniques which rely on labels.

To perform augmentations on unlabeled samples other techniques have to be used, similar to the ones from RandAugment and other similar methods.

New methods and adaptations of existing methods for regular images was instead applied to see the eﬀect. It was tested by applying the data augmentations with a randomness of 50% for each data sample drawn during training and recording the aﬀected accuracy when cross validating. The network trained used 5x5 patches of data as input and it was these patches that was augmented online during training.

4.3.1 CutOut

To test how the CutOut method would work for HSI two tests were conducted.

It is very similar to test carried out with data augmentation under the name random occlusion data augmentation, [45]. The authors focused on larger patches, even though they stated that they used smaller ones than others. One part of their success could therefore still be contributed to information leak from testing pixels being present during random sampling of the training set. Looking past this it is still useful results that show that CutOut, or random occlusion, could be beneﬁcial for HSI classiﬁcation as a data augmentation technique.

The methods used in this thesis uses smaller patches than the ones by Haut et al. [45] and the methods need to be veriﬁed to work.

Setting certain pixels or parts of the spectral band to zero was investigated as an alternative in the HSI space. Removing patches of pixels like the original

(37)

CutOut method proposed worked only in cases of individual pixels removed.

Removing larger patches when only observing this small 5x5 patch resulted in bad results. Instead focusing on the spectral domain it was tested to randomly set a band of spectral values to zero for each pixel. The amount of values in each band and the amount of bands to set to zero was varied. Removing spectral bands resulted in poor results and an augmentation technique that did not work. Thus, only a CutOut method which randomly removes one pixel of the 5x5 patch was implemented.

4.3.2 Spatial combinations

Since labels are not present, but we have a quite strong spatial correlation between samples, we can adapt the linear combination of samples from the same class as a spatial combination of nearby samples. In this case samples from the direct neighbourhood was combined with random weights drawn from a uniform distribution like,

y_new= 1

�

iαi

�

i

α_iy_i , (4.1)

with all yisamples being drawn from a direct neighborhood around the sample to be augmented and α ∈ (0, 1). Samples that had no data was not included in the augmentation.

4.3.3 Band combinations

Instead of doing a combination of the whole spectrum, thus taking a sort of average of them, another method was also tested. It consists of splicing together different spectrum of nearby pixels. Choosing the neighborhood size and length of slices to splice as parameters this augmentations technique keeps the original spectrum values, but combines different parts from different pixels to form a new sample.

4.3.4 Spectral averaging

Instead of focusing on the spatial domain the spectral domain can also be distorted to get new samples. Common methods here include to add noise to the spectral dimension by multiplying the spectrum with a noise constant and adding a constant term. Extending this idea and incorporating other ideas like posterization, where the amount of bits per image is reduced, two other data

(38)

augmentation methods were tested. First, a sort of moving average scheme was implemented. For each pixel, each part of the spectral band is averaged with the closest values. Second, a method was implemented were the average value of a range of the spectral band was set as the value for all the values in that range.

4.3.5 Spectral shift

Another augmentation technique tested for HSI was to shift the spectrum by a certain amount. The augmentation shifts the spectrum values along the spectral axis by a certain amount decided by the magnitude of the augmentation.

This should give robustness to both errors in assigning values to the right band and since the spectrum is reasonably continuous, a small perturbation should not disturb the output too much.

4.3.6 Random Augmentation

To implement the augmentation methods presented a random augmentation scheme like RandAugment was implemented by Cubuk et al. [40]. The automated augmentation strategy randomly samples a fixed amount of augmentation techniques, n, from a chosen pool with a random magnitude, m. At each draw of a sample where the augmentations are called there is a 50% chance that an augmentation is used. Before this augmentation each sample is flipped randomly if the patch is larger than just one pixel, and after the augmentation CutOut is applied in accordance with the original implementation [40]. The suggested methods presented above were used to create different augmentation pools for each case. This selection was based on what previous research suggested as optimal. The total pool of augmentations to chose from consisted of Tab. 4.2.

4.4 Traditional Methods

Traditional methods should always be tested in conjunction with newer methods to give a good baseline performance to match against. Common methods applied to HSI classiﬁcation in other works are for example SVM’s and Ran- dom Forest. XGBoost is another popular decision tree method akin to Random Forest. These three methods will be applied to the classiﬁcation problem to act as a supervised baseline to compare against. To not only compare against

(39)

Method Description

Radiation Noise Gaussian noise with a small bias Spatial Combination Combination of spatially close pixels

Spectral Mean "Pixelating" the spectral curve by setting slices to their mean value Moving Average Averaging the spectral values by their closest neighbours

Spectral Shift Shifting the spectral values by a set magnitude forward Band Combination Splicing together close pixels

Identity No change to the sample

Table 4.2: Total available augmentation pool.

supervised methods two semi-supervised SVM methods will also be applied to compare against.

4.4.1 Supervised

Using sklearn’s library both a traditional SVM and Random Forest was implemented. The SVM was called using SVM.SVC and it applies a one-versus-all decision strategy to handle the multi-class problem. Since the problem can be non-linear a radial basis function kernel was used. The hyperparameters was found by a small grid-search. Similarly done for Random Forest the hyperparameters was found by searching a small space.

The XGBoost classiﬁer was implemented using the XGBoost library. Based on ﬁner grid search for good parameters made by Samat et al. [46] a rough grid search was made for this application. The dart booster was chosen as it proved the best and similar methods have been presented in other works [1].

4.4.2 Semi-Supervised

The two semi-supervised methods used was a TSVM and a QN-S3VM. The implementation for the TSVM was taken from Bruzzone et al. [29] and implemented using the same SVM instance as the supervised version except extending the weights to be different for labeled and unlabeled samples. The QN-S3VM uses the Python implementation the authors supplied by Gieseke et al. [30], adapted for a multi-class problem. The OAA classification method is used by creating as many classifiers as classes with training samples. This is the same as for the TSVM.

(40)

4.5 Semi-Supervised Scheme

Enabling the use of unlabeled samples in HSI is paramount to increase the available data that can be used for training deep networks. Semi-supervised schemes can help utilize all the data to increase performance. In this thesis the semi-supervised FixMatch and Mean Teacher will be used to test the eﬀect of semi-supervised schemes on the HSI classiﬁcation problem. FixMatch utilizes pseudo-labeling with consistency regularization to enable the use of unlabeled data, while Mean Teacher uses a self-ensembling approach with consistency regularization.

4.5.1 FixMatch

FixMatch utilizes two loss functions, one for the labeled data and one for the unlabeled data [12]. It utilizes consistency regularization and pseudo-labeling.

The supervised loss is a regular cross-entropy loss function on weakly augmented labeled data:

ls = 1 B

�B b=1

H(pb, pm(y|α(xb))) . (4.2) The unlabeled loss function uses two version of the unlabeled sample, one that is strongly augmented (A(·)) and one that is weakly augmented (α(·)).

The weakly augmented sample is used to make a prediction of the label: qb = pm(y|α(u^b)). We only consider a one-hot encoding of the label and thus takes the most probable label: ˆqb = argmax(qb), as the pseudo-label. The loss is then constructed with a cross-entropy loss with the strongly augmented sample:

lu = 1 µB

�µB b=1

I(max(qb)≥ τ)H(ˆqb, pm(y|A(ub))) (4.3) where τ is the same threshold hyperparameter which decides if we will keep a pseudo-label or not. The ﬁnal loss is then the sum of the two functions, with the unlabeled loss weighted as ls+ λl_u. λ is kept as one in the experiments based on initial tests. Fig. 4.5 shows an unlabeled sample, it is weakly augmented in the upper row and the prediction is made. The prediction becomes the pseudo- label if over the dotted threshold. The bottom row shows how the sample is strongly augmented and the loss is calculated as the cross-entropy between prediction on this strongly augmented version and the pseudo-label.

(41)

Figure 4.5: Example how FixMatch handles an unlabeled sample. The upper row is the weakly augmented sample and how the prediction is made into a label. The bottom row is the strongly augmented sample and how the pseudo- label is used with cross-entropy [12].

4.5.2 Mean Teacher

The Mean Teacher model utilizes consistency regularization and self-ensembling to use both labeled and unlabeled data [13]. The authors take inspiration from the Temporal Ensembling model [28] and extends it by averaging model weights during training instead of averaging the predictions made during training.Mean Teacher initializes two models, one being the student which learns fast, and the other being the teacher which is initialized as an Exponential Mov- ing Average (EMA) model. An EMA model is a model which takes the exponential moving average of another model during training, this is the "slow"

learning of the teacher.

Fig. 4.6 shows how the method functions with the two models, student to the left in orange and teacher to the right in blue. The sample is augmented once for each model, giving each model a slightly diﬀerent input. If the sample is labeled as this example is, two losses are calculated, otherwise only the consistency loss is calculated. The regular supervised loss is a cross-entropy, and this is only calculated if we have a label and taken between the label and the prediction of the student. The second loss is the consistency loss. It is a divergence loss between the prediction of the two models. The idea behind the divergence loss is that for any sample the prediction of the two models should be the same, and the divergence loss is used to minimize this diﬀerence.

The EMA model weights, θ^�in Fig. 4.6, of the teacher will be updated by

(42)

Figure 4.6: The example shows a batch with a single labeled sample. During training both the student and teacher gives a prediction on the sample, with two diﬀerent noises (η, η^�). Two losses are calculated, the classiﬁcation cost and the consistency cost. The student network is then optimized by minimizing over the sum of these two loss functions to get the weights θ. The teachers weights, θ^�, is then updated as an exponential moving average of the students weights. Figure from [13].

using the students weights, θ,

θ^�_t= αθ_t−1^� + (1− α)θt (4.4) where α is a smoothing coeﬃcient hyperparameter. Thus the teacher will be a sort of mean of the student over the training period, thus Mean Teacher.

4.6 Network

Two deep networks will be used for the supervised deep leaning, FixMatch and Mean Teacher. One state-of-the-art 3D CNN proposed by Hamida et al, made for hyperspectral image classiﬁcation by Hamida et al. [7], and one simpler 1D CNN as used by Nalepa et al. [38]. These networks will be tested with the diﬀerent semi-supervised methods as well as with purely supervised training.

The 3D CNN proposed by Hamida et al. [7] uses both the spectral and spatial information in the data by using 3D patches to sample from. The input volume in Fig. 4.7 shows that the input is 3D. The light blue blocks shows the following ﬁlters which passes over the input to get the features. At the end a fully connected layer is used to predict the classes.

Semi-Supervised Methods for Classification of Hyperspectral Images with Deep Learning

Semi-Supervised Methods for Classification of Hyperspectral Images with Deep Learning

OSCAR ÖRNBERG

Semi-Supervised Methods for Classiﬁcation of

Hyperspectral Images with Deep Learning

OSCAR ÖRNBERG

Abstract

Sammanfattning

Contents

Acknowledgments

Chapter 1 Introduction

1.1 Traditional Methods

1.2 Deep Learning Methods

1.3 Semi-Supervised Learning

1.4 Contribution

1.5 Ethical Considerations

Chapter 2

Related Works

2.1 Semi-Supervised Learning in Hyperspec- tral Imaging

2.2 Semi-Supervised Learning

Chapter 3 Background

3.1 Decision Trees

3.1.1 Random Forest

3.1.2 XGBoost

3.2 Support Vector Machine

3.2.1 Transductive SVM

3.2.2 Quasi-Newton Semi-Supervised SVM

3.3 Convolutional Neural Networks

3.4 Data Augmentation

3.5 Semi-Supervised Learning

3.5.1 Consistency Regularization

3.5.2 Pseudo-Labeling

3.5.3 Self-Ensembling

Chapter 4 Method

4.1 Data

4.1.1 Validation

4.1.2 Datasets

4.2 DeepHyperX

4.3 Data augmentation

4.3.1 CutOut

4.3.2 Spatial combinations

4.3.3 Band combinations

4.3.4 Spectral averaging

4.3.5 Spectral shift

4.3.6 Random Augmentation

4.4 Traditional Methods

4.4.1 Supervised

4.4.2 Semi-Supervised

4.5 Semi-Supervised Scheme

4.5.1 FixMatch

4.5.2 Mean Teacher

4.6 Network