A comparison of different machine learning algorithms applied to hyperspectral data analysis
Axel Vikström axvi0004@student.umu.se
June 15, 2021
Master’s Thesis in Engineering Physics
Supervisors: Andreas Vidman (andreas@prediktera.se)
Abstract
Hyperspectral image analysis works with image data where each pixel con- tains hundreds of wavelengths acquired from spectral measurements. It is a growing field of research in the sciences and industries because it can distin- guish visually similar objects. While many machine-learning methods work well for analysing regular images, little is known about how they perform on hyperspectral data. Standard methods for quantifying and classifying hyper- spectral data include the chemometric methods PLS, PLS-DA and SIMCA.
They provide rapid computations along with intuitive modelling and diag-
nostic tools, but cannot capture more complex data. I benchmarked the
chemometric methods against machine learning methods from Microsoft’s
ML.NET library on six classification and two quantification problems. The
ML.NET methods proved to be good complements to the chemometric meth-
ods. In particular, the decision tree methods provided accurate classifica-
tion and quantification while the maximum entropy classification methods
balanced between accuracy and computational time the best. While the re-
maining ML.NET methods performed equally well or better than the chemo-
metric methods, finding their use requires testing on data sets with a wider
range of properties. The best ML.NET methods are suitable for analysing
more complex hyperspectral images by capturing nonlinearities disregarded
by standard image analysis.
Acknowledgements
Primarily I would like to thank Andreas, Oskar and Thomas of Prediktera, guiding and aiding me through the process of conducting a full master’s thesis from home.
I could not have asked for a more genuine and helpful setting to work in than what I was given. Additionally, I thank Lucas and Martin from the physics institution;
Lucas for guiding a nervous student in the beginning of the project and Martin
for showing me what’s what in terms of academic writing.
Contents
1 Introduction 1
2 Theory 3
2.1 Hyperspectral data . . . . 3
2.2 The chemometric methods . . . . 4
2.2.1 PLS and PLS-DA . . . . 5
2.2.2 SIMCA . . . . 5
2.3 Machine learning methods . . . . 6
2.3.1 Linear regression . . . . 6
2.3.2 Logistic regression . . . . 7
2.3.3 Poisson regression . . . . 8
2.3.4 SVM . . . . 9
2.3.5 Decision trees . . . . 9
2.3.6 Maximum entropy . . . 11
2.3.7 Neural networks . . . 11
2.4 Evaluation metrics . . . 13
2.4.1 Classification metrics . . . 13
2.4.2 Quantification metrics . . . 14
3 Method 15 3.1 The data sets . . . 15
3.1.1 The waste data . . . 15
3.1.2 The cheese data . . . 15
3.1.3 The Indian Pines data . . . 16
3.1.4 The nuts data . . . 17
3.1.5 The powder data . . . 17
3.1.6 The strips data . . . 18
3.1.7 Metadata of the sets . . . 18
3.2 Examined complement methods . . . 18
3.3 Settings . . . 23
3.3.1 Software . . . 23
3.3.2 Data segmentation . . . 28
3.3.3 Model settings . . . 28
4 Results 30
4.1 Classification . . . 30
4.1.3 The Indian Pines study . . . 34
4.1.4 The nuts study . . . 35
4.1.5 Computational speed . . . 36
4.1.6 Introducing the convolutional neural network . . . 39
4.2 Regression . . . 40
4.2.1 The powder study . . . 40
4.2.2 The strip study . . . 40
5 Discussion 42 5.1 Future work . . . 43
5.2 My recommendation . . . 44
6 Conclusion 45 A Appendix i A.1 Data segmentation . . . . i
A.2 PLS-DA model settings . . . . ii
A.3 SIMCA model settings . . . iii
A.4 PLS model settings . . . iv
A.5 ML.NET model settings . . . . v
1 Introduction
Much data is produced through imaging, where image data consists of measures of light intensity in each pixel in one or more channels. In a grey scale image, each pixel consists of one channel while an RGB image consists of three channels containing red, green and blue light intensities. Including more channels in each pixel increases the amount of information and a typical procedure is including wavelengths from outside the visible spectrum. This is known as hyperspectral imaging, where hundreds of channels may be included in the pixels [14]. Hy- perspectral imaging implies that visibly similar objects may be distinguished by analysing the reflected wavelengths from outside the visible spectrum. The tech- nique therefore excels when the visible spectrum gives little to no information, such as realtime classification of waste materials or quantification of the content in food samples. While many studies have examined the performance of methods on regular image data, the amount of available methods for analysing hyperspectral image data is less investigated.
Prediktera AB is a software company that specializes in data analysis of hyperspec- tral images. At the beginning of this project, they employed the PLS, PLS-DA and SIMCA methods. These methods are chemometric methods, able to pixel- wise quantify and classify hyperspectral images. PLS is a quantification method for regression analysis [39], while PLS-DA and SIMCA are methods for solving classification problems [12]. While being quick solvers with satisfactory diagnostic tools and intuitive modelling, their performance is reduced when handling more complex data. This implies that for larger, non-linear data sets these methods may not be satisfactory. Therefore, investigating additional methods that may complement the chemometric methods may let Prediktera offer efficient solutions for problematic data sets. The machine learning research field is likely to contain such complements, due to machine learning algorithms’ ability to make efficient data-driven recommendations and decisions purely based on input data.
This thesis aims to determine whether specific machine learning algorithms out-
perform other methods, when analysing hyperspectral data, by creating and con-
ducting a benchmark test. These aims are met by comparing the performance of
machine learning algorithms from Microsoft’s ML.NET library [21], and the chemo-
metric methods. I utilized data sets from the hyperspectral industry; two quantifi-
cation and six classification data sets. Evaluating the performance of a classifica-
tion method was done by measuring its total accuracy and speed of training and
ness of fit and prediction. Examining the performance I hope to find methods that, compared to the chemometric methods, allow for analysis of more complex data.
Consequently, expanding the amount of usable methods within hyperspectral data
analysis.
2 Theory
2.1 Hyperspectral data
As a way to aid remote sensing studies, in 1985 Alexander Goetz introduced hy- perspectral imaging [13]. The technique combines the spatial information given by imaging and spectral information of spectroscopy. Each pixel is allotted a mul- tiple of values, called channels, corresponding to the spectroscopically measured wavelengths. This is illustrated in figure 1 along with a comparison with a grey scale and an RGB image.
Figure 1 – Comparison of grey scale, RGB and hyperspectral images. The dig- ital representation corresponds to the hypercube of data that is characteristic in hyperspectral image analysis. In this case, the pixels in the hyperspectral image have 256 channels, corresponding to 256 spectroscopically measured wavelengths.
Source: Prediktera AB [27].
One of the most common ranges of wavelengths to measure hyperspectral data
from is the visual and near infrared spectrums. The near infrared spectrum ranges
between the visible and the radio wave spectrums, with wavelengths from 780 nm
ference in energy levels intrinsic to the molecules [37].
There are a number of methods used to sample hyperspectral data; point scan- ning, line scanning, area scanning and single shot [28]. Point scanning measures all channels in a single pixel, one pixel at a time. Line scanning implies scanning all channels of a single row of pixels, one line at a time. Area scanning is scan- ning the entire image, one channel at a time. Lastly, single shot is the acquiring of all channels for all pixels at the same time. Utilizing either of these methods produces the three dimensional hyperspectral cube, or hypercube, seen in figure 1, that contains the spatial and spectral data.
As a final step before analysis the hypercube typically undergoes spectral pre- processing. Initially, the raw intensity measurements can be adjusted by using white and dark reference values sampled by the measurement system. They are measurements of the brightest and darkest intensities, often used for calibration.
Denoting the reference values respectively as W and D, the absorbance A is com- puted according to
A = log
10I
0− D W − D
, (1)
where I
0is the raw intensity values [37]. Having converted the raw intensities to absorbance according to equation 1, another standard procedure is to apply mean-centering, which places the origin at the center of the data. Then, it is suitable to apply one or more pre-treatments to the centered data. Two common approaches are applying Savitzky-Golay filtering for de-noising and/or standard normal variate (SNV) for scatter correction [1].
2.2 The chemometric methods
The chemometric methods PLS, PLS-DA and SIMCA are well used for spectral data analysis. The methods take a set of independent variables as input and returns a predicted variable, also known as a dependent variable. In pixel-wise hyperspectral imaging, these methods are used for quantification or classification of the pixels. The independent variables are the channels of a pixel, while the returned dependent variable is a quantification value or a class.
A trait shared by the methods is that they all are based on principal component
analysis (PCA) modelling [8]. PCA modelling removes covariance by making new
orthogonal dimensions of the independent data through linear orthogonal trans-
forms. These new dimensions are the principal components where the first com-
ponent explains the largest amount of variance among the independent variables,
et cetera. SIMCA inherently uses PCA modelling for each class while PLS and PLS-DA utilize a computation of orthogonal components in a similar manner to PCA.
2.2.1 PLS and PLS-DA
Partial least squares (PLS) was introduced to spectral data analysis by Svante Wold as a way to extend traditional linear regression. PLS computes a linear re- gression model on a space defined by the components that best explain the variance of the independent variables [39]. These components are known as latent variables and are orthogonal linear combinations of the independent variables given to the algorithm. Essentially, PLS creates a traditional regression model, but on a space with orthogonal components computed from the independent and dependent data correlations.
PLS is also suitable for classification problems, when the dependent variables are categorical. In those cases the method is called partial least square discriminant analysis (PLS-DA) [3].
The method provides simple yet powerful modelling as long as the data does not grow too complex [39]. As the data becomes more complicated, non-linearities override the possibility for a single regression model to be well established. To combat this the analyst can introduce a series of connected PLS models, known as hierarchical modelling, in which latter models predict using the results from previous models. Hierarchical modelling is however tedious and in many cases difficult to efficiently build due to the propagation of predictions over the models.
2.2.2 SIMCA
Soft independent modelling of class analogy (SIMCA) is a classification method [4]. A PCA model is generated for each class and a distance metric is used to measure the distance between data points and the hyperplane defined by the PCA model. A threshold, the critical distance, is set on the distance metric, defining whether the entry is close enough to the span in order for it to be classified as that class. One PCA model per class implies that an entry may be classified as several, or none, of the classes.
SIMCA is likely to perform better than PLS-DA as the amount of classes increase.
However, besides being a lot of work, it suffers from a lack of interpretation due
of accuracy.
2.3 Machine learning methods
Here follows the mathematical bases that each of the possible complement meth- ods are built upon. The process of machine learning is that input data is predicted through a model. The model is then evaluated by computing a metric of choice, called loss function, that is dependent on the generated prediction. If the loss func- tion is below a certain threshold value, the model is deemed accurate. Otherwise, the model is re-configured. This is re-iterated until the model stops improving.
The methods will be considered generating a dependent variable y from a set of n independent variables x = (x
1, x
2, . . . , x
n) . When analysing hyperspectral image data this corresponds to quantifying or classifying a pixel as y from its set of channels x.
2.3.1 Linear regression
Linear regression implies employing a linear model for predicting continuous data.
The predicted variable is computed as
y = β
0+ β
1x
1+ β
2x
2+ · · · + β
nx
n, (2) where (β
0, β
1, . . . , β
n) is a set of unknown parameters. To determine these param- eters, the model undergoes optimization by minimizing a metric partially defined by the predicted value of the model and the observed values. A commonly used optimization metric is the sum of least squares, defined as
minimize
N
X
i=1
(y
i,obs− y
i,pred)
2, (3)
where N is the number of data points, y
obsan observed value and y
predthe value
predicted by the model. Note that the difference is measured in the vertical dis-
tance between the observed value and the produced model [15]. Figure 2 illustrates
this by showing a linear regression defined by equation 2 and optimized by equation
3 over a set of data.
Figure 2 – A linear regression, the blue line, over a set of data. The red dotted line illustrates the vertical distance used to compute the sum of least squares metric.
2.3.2 Logistic regression
While linear regression best suits predicting continuous data, logistic regression performs better when handling binary categorical data. The regression utilizes the sigmoid function
y = 1
1 + e
−(β0+β1x1+···+βnxn), (4)
which is optimized to minimize a metric of choice, the sum of least squares for
example [5]. This yields a situation as illustrated in figure 3, where a logistic
regression in the form of equation 4 is fitted to a set of categorical data with two
classes. A threshold is set on the logistic model vertically halfway between the two
classes, and the data on each side of the threshold is classified accordingly.
Figure 3 – A logistic regression, the blue line, optimized to fit binary categorical data. The method classifies all items right of the threshold as the circle class and everything to the left as the square class. Thus, due to overlapping, the model will misclassify the left-most circle and the right-most square.
2.3.3 Poisson regression
Poisson regression, used for quantification, stems from the Poisson distribution with an expected value µ
P (y = k|x) = µ
kk! e
−µ, k = 0, 1, 2 . . . (5) In turn, the Poisson regression model assumes that
ln (µ) = β
0+ β
1x
1+ · · · + β
nx
n, (6) which yields the combination of equation 5 and 6 as
P (y = k|x) = e
k(β0+β1x1+···+βnxn)k! e
−e(β0+β1x1+···+βnxn). k = 0, 1, 2 . . . (7)
The Poisson distribution in equation 7 is optimized by tuning the unknown β
parameters and thereafter the predictions will thus follow this Poisson distribution,
with a mean dictated by equation 6 [6].
2.3.4 SVM
Support vector machines (SVM) is a binary classification technique known to pro- vide high classification accuracy when classifying hyperspectral data [20]. The linear case consists of placing a linear hyperplane decision boundary between the classes, as seen in figure 4. The closest data points of either class are called the support vectors to the hyperplane. The method tries to maximize the distance between each side’s support vector, reducing the margin of error.
Figure 4 – A support vector machine classifier separating two classes. The dotted lines are the support vectors, defined by the closest data points to the classifier, for which the area between is to be maximized.
2.3.5 Decision trees
Decision tree modelling divides the data set into sub-regions [30]. The sub-regions
are defined by splits in the data set that correspond to certain values or functions
of the independent variables. To traverse the decision tree to a sub-region, boolean
decisions are made over these splits. When the data point to be predicted reaches
one of these sub-regions, it is classified or quantified as the content of that sub-
region. A common quantification technique is giving the data point to be predicted
the mean value of the sub-region. A classification tree example is illustrated in
figure 5.
sx,1 s x,2
x1 sy,1
x2
Figure 5 – Illustration of a classification decision tree formed over a data set.
The methods previously discussed are optimized by optimizing parameters of mathematical models. Here, the model is optimized by finding splits that partition the data set to a certain degree, which often is measured in entropy reduction [35].
Entropy reduction optimization tries to eliminate uncertainty between each level of the decision tree. At the top of the tree, there is full uncertainty as to where the new data point belongs. At the bottom, there is no uncertainty were it belongs.
The entropy reduction algorithms tries to place the splits in the data set such that
uncertainty is removed as early as possible.
2.3.6 Maximum entropy
Maximum entropy is a distribution-based model and can be seen as a generalization of logistic regression from the binary case to the multiclass case [43]. This is done by extending the denominator in equation 4 by summing over exponential functions created for each class. The probability p that a set of independent variables x belongs to the class y
ithus becomes
p(y
i|x) = e
βTixP
Cc=1
e
βTcx. (8)
The method is built upon the principle of maximum entropy, which states that the probability distribution with the largest informational entropy best describes the system. An example is coin flipping in which a fair coin gives maximum entropy by being the most uncertain, while a rigged coin has bias and as a consequence, less entropy.
2.3.7 Neural networks
Neural network modelling acquired its intuition from the biological nervous system [2]. The system is mimicked by letting the input data pass through several layers of nodes, called neurons, transforming the data into the dependent variable. In math- ematical modelling, neural networks are referenced to as artificial neural networks.
The layer-based architecture of artificial neural networks, seen in figure 6, consists of three segments; the input, the hidden layers and the output. The input data is fed to the first hidden layer such that each independent variable is connected to all neurons of the hidden layer with different sets of weights. Thus, the value in each first layer neuron is a weighted sum of the independent variables. Additionally, the hidden layers employ activation functions [34], allowing for extraction of non- linearities in the data. This process of layers being connected with weights to the neurons of the upcoming layer is repeated throughout the hidden layers until it reaches the output layer, where the network returns the dependent variable.
During training this result is checked against the actual value and if the model is
not sufficiently accurate, the weights in the model are updated.
Figure 6 – An artificial neural network with two hidden layers.
Source: Dertat [7].
An alternative to artificial neural networks are convolutional neural networks [24].
Convolutional neural networks are analogous to artificial neural networks in the sense of layering and iteration. However, the layering is slightly different. The layers are traversed in strides by a kernel, also called filter, of a certain size that adds adjacent values together with some weighting, reducing the dimensions of the data for the upcoming layer. The process is illustrated in figure 7. For hyperspec- tral data, convolutional neural networks can be used in a one-dimensional manner, predicting the image pixel-wise.
Figure 7 – A one-dimensional convolution from the input layer to the first con-
volutional layer. A size three kernel with weights strides across the input, adding
adjacent values and thus reducing the dimensions from p to s [26].
2.4 Evaluation metrics
2.4.1 Classification metrics
The total accuracy and F1-score can be determined from the confusion matrix, of which an example is illustrated in table 1.
Table 1 – Multiclass confusion matrix of the classes A, B and C. a-i represent in- tegers explaining classification amounts. As an example, A was correctly classified as A a times and was misclassified as B b times.
Predicted A B C A a b c True B d e f C g h i
In a multiclass confusion matrix, a value can be true positive (TP), false positive (FP) or false negative (FN) depending on what class is examined. Examining class A, a true positive is the correct classification, A classified as A. A false positive is when another class is classified as the class you are examining, B or C classified as A. A false negative is when the class you are examining is classified as another class, A classified as B or C. The total accuracy is computed by summing the true positives for all classes and dividing by the sum of all elements in the matrix.
The F1-score of a class is constructed from the precision and recall, which in the multiclass case are defined as
Precision = T P
T P + P F P (9)
Recall = T P T P + P F N
where the false positives and negatives over the other classes are summed. In turn, the F1-score is defined as
F
1= 2 × Precision × Recall
Precision + Recall . (10)
This definition of the F1-score makes it possible to examine a method’s perfor-
mance regarding a single class, allowing for pinpointing of problematic classes [9].
2.4.2 Quantification metrics
Quantification, or regression, can be analysed by examining the goodness of fit R
2and goodness of prediction Q
2. Goodness of fit can be computed by
R
2= 1 − SS
resSS
tot(11)
where SS
resis the sum of squares between the model and the observed training values and SS
totthe sum of squares between the mean and the observed training values [40]. Thus, if the model is perfect, SS
resapproaches zero and R
2approaches 1. When the model is poor the metric may become close to zero or even negative, performing worse than using the mean value as your model.
The goodness of prediction is computed similarly to equation 11 as
Q
2= 1 − SS
pSS
tot, (12)
where SS
pinstead is the sum of squares between the values predicted by the model
and the observed test data [38].
3 Method
3.1 The data sets
I examined six classification and two quantification data sets. The classification segment were primarily in focus. The goal was to acquire data sets that originate from different areas of the industry, from food processing to waste sorting.
3.1.1 The waste data
Being the first classification problem, acquired from Mälardalens högskola and inspired by Ševčíks thesis [31], the waste data consisted of waste material collected from three sub-groups; organics, plastics and in-combustibles. The objective in this problem was to separate between classes form these sub-groups. Pseudo-RGB images, images with three channels from the spectral data, of a four classes from the waste study are illustrated in figure 8.
Figure 8 – Pseudo-RGB images of the cardboard, ceramics, LDPE (low-density polyethylene) and PP (polypropylene) classes, in row-wise order.
3.1.2 The cheese data
arate data sets; uncoated, wax and paraffin or cheese (U), cheese (W) and cheese (P). They corresponded to different treatments of the cheese, yielding different surfaces to inspect. The problems consisted of detecting mold and other deficien- cies. Pseudo-RGB images of a few examples from the cheese study are illustrated in figure 9.
Figure 9 – Pseudo-RGB image examples from the cheese, wax and paraffin data sets, in row-wise order.
3.1.3 The Indian Pines data
The Indian Pines data set belongs to NASAs AVIRIS data collection [22], a com- mon benchmarking data set for classification algorithms. Indian Pines is a remote sensing data set, implying that the data was measured at a large distance from an aircraft or satellite. The data set is a single image consisting of a variety of agri- culture, grass fields, forests and man-made structures which were to be classified.
A pseudo-RGB image and the ground truth are shown in figure 10.
Figure 10 – Pseudo-RGB image of the Indian Pines data set along with its col-
orized ground truth, the classes of the pixels in color coding.
3.1.4 The nuts data
Being Prediktera’s data set for their classification tutorial, the nuts study contains measurements for a couple of nut types and their respective shells. The task consisted of separating the different nut types as well as distinguishing if pixels were shell pixels or not. Pseudo-RGB examples of measurements from the study are illustrated in figure 11.
Figure 11 – Pseudo-RGB images of the hazelnut, hazelnut shell, pecan and pecan shell measurements.
3.1.5 The powder data
The powder data set is used for Prediktera’s quantification tutorial. It consisted
of bags in which three types of powder had been inserted; vanilla powder, baking
soda and potato starch. Some bags contained purely one type of powder while
others were mixtures between two powders or all three. Pseudo-RGB examples
are shown in figure 12.
3.1.6 The strips data
The second quantification data set is a part of a current project at Research Insti- tutes of Sweden (RISE). The data set was made up out of long strips of paper on which an amount of physical properties, such as thickness, varied. Two pseudo- RGB examples of the strips are shown in figure 13.
Figure 13 – Pseudo-RGB images of two of the examined strips.
3.1.7 Metadata of the sets
In tables 2 and 3 follows the properties of the classification and quantification data sets.
3.2 Examined complement methods
The possible complements to the chemometric methods were acquired from Mi-
crosoft’s ML.NET library [21], with the exception of the self-made TensorFlow con-
volutional neural network. Table 4 describes each classification method. The
overview includes which machine learning basis it belongs to, a short description
that highlights what makes the method different along with an in-depth source,
Table 2 – Metadata of the classification data sets.
Waste IP Nuts
Classes 18 16 5
Channels 208 184 266
Min channel (nm) 963.27 476.7 999.06
Max channel (nm) 1691.59 2231.5 2486.96
Cheese (U) Cheese (W) Cheese (P)
Classes 5 6 3
Channels 191 191 191
Min channel (nm) 1000.18 1000.18 1000.18
Max channel (nm) 2198.04 2198.04 2198.04
Table 3 – Metadata of the quantification data sets.
Powder Strips
Variables 3 8
Channels 236 236
Min channel (nm) 1061.92 997.33 Max channel (nm) 2454.64 2438.7
as well as whether it utilizes the One-Versus-All (OVA) approach. The OVA ap-
proach uses binary classifiers for multiclass problems by classifying the examined
class versus all other classes, making all the other classes the second "class" in
the binary classification. With a similar overview, the regression algorithms are
described in table 5.
Table 4 – Overview of the classification methods from the ML.NET library and the convolutional neural network. The method names are the abbreviations of the machine learning methods from the ML.NET library, often named after the optimization technique. OVA was applied to binary classifiers, generalizing them to multiclass classification.
Method Base method OVA Description
BFGS-LR Log. reg. Yes Utilizes the BFGS, or L-BFGS, opti- mization technique. BFGS is a quasi- Newtonian method that reduces compu- tational cost of the Hessian yet keep the fast convergence rate of an ordinary New- ton method [10].
SGD Log. reg. Yes Stochastic gradient descent, in this case the Hogwild version, is a stochastic vari- ant of the classic gradient descent opti- mization method. The stochasticity im- plies replacing the actual gradient with an estimation of it, alleviating computational load [23].
SSGD Log. reg. Yes Being an extension of SGD, Symbolic SGD replaces the sequential process by lo- cal models in separate threads. A proba- bilistic model combines the local models to produce an expectation of what a reg- ular SGD would have produced [19].
LSVM SVM Yes LSVM, or Linear SVM, implements the
PEGASOS optimization technique. PE- GASOS is a modified SGD method where each SGD step is accompanied by a pro- jection step. For SVM problems, this pro- duced a higher convergence rate [32].
Cont. next
page
LGBM Dec. tree No Light Gradient Boosting Machine, LGBM, is a gradient boosting decision tree which is a high performing method for multiclass classification. Gradient boosting is trained as a sequence of decision trees improving upon one an- other, making it an ensemble model.
LGBM adds to this model by excluding data instances with small gradients and excluding less significant independent variables, resulting in higher computation speeds [16].
FT Dec. tree Yes Like LGBM, Fast Tree (FT) is built upon the gradient boosting decision tree method. FT uses the MART algorithm which specializes in broadening the en- semble process, such that later iterations does not impact the prediction of only a few of the independent variables [29].
FF Dec. tree Yes Fast Forest, FF, implements the random forest technique. Random forest creates an ensemble of independent decision trees and creates a distribution type model from the decision trees that make up the forest [18].
SDCA-ME Max. Ent. No Stochastic dual coordinate ascent is an op- timization technique that solves the dual problem. The dual problem sets the lower bound of the primal problem, where the primal problem is the one that usually is solved by optimization [33].
BFGS-ME Max. Ent. No See BFGS-LR.
Cont. next
page
AP Neu. Net. Yes Averaged perceptron is a single-layer neu- ral network. Being an online algorithm, it updates its weights after each training in- stance if the label is incorrect and keeps them otherwise. The averaging in this method comes from that the final predic- tion is calculated by averaging the result of each iteration [11].
CNN Neu. Net. No Made by myself in Python using Tensor-
Flow. Starts with one convolutional layer
with the ReLU activation function fol-
lowed by max pooling and flattening. The
resulting vector is reduced to 64 neurons
with a ReLU activation and the output
layer utilizes the Softmax activation func-
tion, providing the probability vector for
the classes. Further information in section
3.3.3.
Table 5 – Overview of the examined regression algorithms.
Method Base method Description
OLS Lin. reg. Ordinary least squares, OLS, is the traditional linear regression that optimizes by minimizing the sum of squares as described by the linear regression section [15].
SDCA-LR Lin. reg. See SDCA-ME in table 4.
OGD Lin. reg. The online gradient descent method in the ML.NET library corresponds to utilizing stochas- tic gradient descent optimization on linear re- gression. For more information on SGD, see table 4.
BFGS-PR Poi. reg. See BFGS in table 4.
LGBM Dec. tree See LGBM in table 4.
FT Dec. tree See FT in table 4.
FTT Dec. tree FTT, Fast Tree Tweedie, is an alternative FT (see table 4) that minimizes the Tweedie loss function. The loss function is constructed from the Tweedie distribution, which is a distribution where a majority of the samples are at the origin and a separate normal or Gaussian distribution is situated further down the tail [42].
FF Dec. tree See FF in table 4.
3.3 Settings
3.3.1 Software
I conducted the project in Prediktera’s modelling and prediction tool Breeze [27].
Breeze allows for image segmentation as well as sample, quantification and classifi- cation modelling of hyperspectral data, making it an ideal environment to conduct the benchmark study.
Starting by importing the data, the user is situated in the Record segment of
the software. In Record you work with the segmentation; which pixels in the
such that only the objects in the image remain. Sample modelling (PCA) provides a solution where you may exclude pixels from a variance scatter plot, illustrated in figure 14. Due to PCA finding the principal components that explains the variance the most, the variance scatter plot often splits up into different clusters and the unwanted pixels can easily be identified and excluded. Having removed any unwanted data, a segmentation type that extracts data is selected. The two segmentation types I used were representative spectrum and grids and insets. The former corresponds to randomly picking an amount of pixels from each sample and the latter places a grid upon each sample and each rectangle in the grid becomes the observations, averaging the channels over the pixels in each rectangle.
Figures 15 and 16 illustrate the results of removal of background pixels through PCA modelling as well as the usage of the two mentioned segmentation types, respectively.
Figure 14 – A variance scatter plot generated by a PCA model with SNV pre-
treatment and its corresponding measurement. The yellow selected pixels in the
scatter plot is to the cluster that corresponds to the relevant data in the measure-
ment, colored brown.
Figure 15 – Sampling of the textile measurement from the waste study along with
the choice of pixels through the representative spectrum segmentation type. The
yellow line in the upper image encloses the area that the PCA model deemed to
not be background. In this secluded area a representative spectrum extracts pixels
at random, visible as the red dots. This particular representative spectrum have a
Gaussian setting, drawing randomly from a two-dimensional Gaussian distribution.
Figure 16 – Sampling of one of the powder measurement from the powder study along with the choice of pixels through the grids and insets segmentation type.
The yellow line in the left image encloses the area that the PCA model deemed to not be background and the grids in the right image correspond to the areas of pixels to be investigated.
Having extracted the data to be examined you should introduce the ground truth, the quantification variables or classification classes, to the pixels. This can be done in any of the stages of the segmentation; the entire measurement, the PCA sampling or the final segmentation type.
The next segment of Breeze is the Model segment. Here, you create your sam- pling, quantification or classification models. When creating a quantification or classification model the first three steps are the same:
1. Variables - In the first step, you choose which segmentation from the Record that you want to include and which set of variables or classes that you are interested in modelling.
2. Samples - In the second step, you choose which of the pixels from the
segmentation that should be included and individually set them as training
or test data. Note that this can be done by forming groups in the Record
step that correspond to a training and test group.
3. Wavelengths - In the third step, you choose which channels you want to utilize. Additionally, the pre-treatments of the data are applied here. Mean- centering is by default always applied.
At this point the data is ready for modelling. Therefore, the remainder of the process varies between the models:
1. PLS - Consists of manually excluding outliers and setting the amount of components of the model. The goodness of prediction Q
2reaches a maxi- mum after a certain amount of components, corresponding to the amount of components needed for the best PLS model for your data. In Breeze there exists an automatic feature that estimates the amount of components needed for an acceptable model.
2. PLS-DA - As PLS-DA is PLS for classification problems, the process is similar. However there is an added feature of confidence level, which cor- responds to the level at which the model classifies pixels as "No class". A broad confidence level forces the model to classify problematic pixels as one of the defined classes, erasing the "No class" prediction from the model.
3. SIMCA - Similarly as the PLS methods you can exclude outliers, but the modelling consists of setting a critical distance for each of the classes. The critical distance of a class is set such that all pixels belonging to that class precisely are within the critical distance. Breeze has an automatic feature for acceptable critical distances. Additionally, Breeze automatically determines the amount of principal components needed in the PCA models of each class.
4. ML.NET - In Breeze, you choose an algorithm from the ML.NET library and a time it will train. However, this time is not the time that the same model will be trained against the data. The time you enter is the amount of time that ML.NET utilizes its self-updating process, which implies that several models are created and that the process optimizes the input parameters of the model. Letting the process run for a minute means that the process might create a couple of models of the chosen algorithm and then output the model with the input parameters that yielded the best performance.
The results of a quantification model is given in R
2and Q
2while a classification
model outputs a confusion matrix. For analysis and plotting, the results were
exported to MATLAB.
3.3.2 Data segmentation
Having removed the background, observations from the classification data sets were selected. The representative spectrum segmentation type was applied on all data sets excluding the strip data set. On the strip data set, grids and insets were used due to the variables being measured segment-wise by RISE. The amount of examined pixels for each class in the classification studies is shown in appendix A.1. In the same appendix you find the amount of pixels, or segments in the strip study, that contained each variable of the quantification problems.
3.3.3 Model settings
The settings of a model depends on the chosen channels, their pre-treatments as well as the parameters intrinsic to the model. The chosen channel ranges and amounts can be seen in section 3.1.7. To determine which pre-treatments to use, each method was subject to testing with four combinations of centering, SNV and Savitzky-Golay (SGF). As centering was applied for all cases, the four combina- tion were; purely centering, centering and SGF, centering and SNV as well as centering, SNV and SGF. The pre-treatments providing the highest accuracy for a specific set of method and data set were the ones I used. Keeping to one com- bination of pre-treatments was sufficient for the ML.NET methods in each of the classification studies. It was not sufficient for the quantification problems, as the performance of the ML.NET methods varied heavily on the choice of pre-treatments.
All model settings are noted in tables in the appendices. Appendix A.2 contains the settings of the used PLS-DA models and appendix A.3 the settings of the SIMCA models. The settings of the PLS models used for quantification analysis are situated in appendix A.4. Lastly, the ML.NET settings for classification and quantification are both placed in appendix A.5.
The self-made classification convolutional neural network was made as shown by
table 6. It is a simple network starting by the input of batch size 16, corresponding
to 16 pixels with the amount of channels defined by the problem. The convolutional
layer increases the dimensions from (16 × features × 1) to (16 × features × 128)
with the values in the neurons being subject to a ReLU activation function. Then,
the network is max pooled, bringing the dimensions down to (16 × features/2 ×
128) which there after is flattened out to a regular one-dimensional layer of size
(16 × features/2 × 128) × 1. This layer is then connected to a layer of 64 neurons
with a ReLU activation function. Lastly, the 64 neuron layer is connected to a
layer the size of the amount of classes of the problem. The neurons of the last
layer employ the Softmax activation function, implying that the resulting layer is a vector of probabilities. The neuron with the highest probability is the class that the convolutional neural network predicted the input data as. The network was an extra task in the thesis and was used to model the waste, Indian Pines and nuts data. It was modelled and tested in Google Colab, and only the accuracy was measured.
Table 6 – Summary of convolutional neural network used to classify the waste, Indian Pines and nuts data. Input was batches of 16 pixels from the examined data set. Columns correspond to the input to the TensorFlow functions. Modelling settings are found in the bottom half of the table.
Type Filters Kernel size Stride Padding Units
Convolutional 1D (ReLU) 128 3 1 2
Max pooling 2 1 0
Flatten
Dense (ReLU) 64
Dense (Softmax) no. classes
Optimizer Adam
Loss function Cross entropy
Batch size 16
Epochs 40
Validation split 5%
4 Results
The results show that there exists several machine learning methods in the ML.NET library that perform equally well or better than the chemometric methods PLS, PLS-DA and SIMCA.
In particular, the classification methods that build upon the decision tree basis generally provided the highest accuracy and F1-scores. However, they also had the longest training and prediction time. Finding the best balance between com- putation time and accuracy did the maximum entropy methods, being slightly less accurate than the decision tree methods yet quick enough when predicting to be on par with PLS-DA and SIMCA.
Regarding the regression problems, the decision tree methods performed marginally better than PLS. PLS in turn performed on par with the remainder of the ML.NET methods.
4.1 Classification
The methods have been assigned markers in the plots that correspond to their ma- chine learning basis. As two examples; PLS-DA and SIMCA, being chemometric methods, were assigned the square marker. Similarly, the decision tree methods LGBM, FT and FF were assigned the star marker ∗. The same colors and markers were assigned in the corresponding F1-score subplots.
4.1.1 The waste study
The results of the waste study, illustrated in figure 17, showed the prowess of the decision tree methods. The most accurate method, LGBM, almost reached a 90%
accuracy. It was closely followed by FT and some percentages further down FF
came in third. Slightly less accurate than FF were the maximum entropy methods
SDCA and BFGS-ME. They in turn were slightly more accurate than BFGS-
LR, LSVM, AP and SGD, a collection of logistic regression, neural network and
support vector machine methods. In last came the chemometric methods SIMCA
and PLS-DA as well as the logistic regression method SSGD. Thus, the five most
accurate methods were the five available non-linear classifiers.
0.65 0.7 0.75 0.8 0.85 0.9
Accuracy [1]
PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP
0 0.5
1 LGBM - FT - FF - SDCA
0 0.5 1
F1-score [1]
BFGS-ME - BFGS-LR - LSVM - AP
cardboard food
ceramicsglass hdpe hdpe2 ldpe metal paper (hygenic)paper (white)
pet pp
print (recycled) print (white)
ps pvc textile wood 0
0.5
1 SGD - SIMCA - SSGD - PLS-DA
Figure 17 – The total accuracy and generated F1-scores for the methods classifi- cation of the waste study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.
Examining the F1-score we find that there were large differences in the classifica- tions of the cardboard, HDPE and recycled print classes. The main issue for the less accurate methods was that they misclassifed these classes as a similar class.
The principal misclassification for the cardboard pixels and recycled print pixels were each other and the HDPE pixels were mostly misclassified as HDPE2. This indicates that one problem in this data set is the distinguishing between very sim- ilar classes, which the non-linear classifiers could handle well. For the remaining classes the more accurate methods performed marginally better.
4.1.2 The cheese studies
While being studies of the same object, cheese, the uncoated, wax and paraffin studies yielded a variety of results, not particularly favouring one or the other method. A common trait of the studies were the imbalanced data, in which the cheese, wax and paraffin classes were in a strong majority in their respective stud- ies.
The uncoated cheese study resulted similarly to that of the waste study, as shown
in figure 18. The decision tree and maximum entropy methods, with the addition
of AP and closely followed by BFGS-LR, provided the most accurate results with
a total accuracy of just above 95%. Within a span of 87 and 94% PLS-DA, LSVM
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Accuracy [1]
PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP
0 0.5
1 LGBM - SDCA - FT - AP
0 0.5 1
F1-score [1]
FF - BFGS-ME - BFGS-LR - SSGD
Mold Cheese
Reflection
White Edge
0 0.5 1
LSVM - PLS-DA - SGD - SIMCA
Figure 18 – The total accuracy and generated F1-scores for the methods classi- fication of the uncoated cheese study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.
Judging from accuracy, SIMCA underperformed. Additionally, SIMCA consis- tently provided a smaller F1-score than all other methods for all classes but the edge class. SIMCA’s problem was the misclassification of mold and reflection as cheese and simultaneously misclassifying cheese as white stain pixels. The second least accurate method, SGD, also misclassified mold and reflection as cheese, but could correctly classify the cheese pixels to a far greater extent. This was the case for all methods but SIMCA, meaning that the performance depended on what degree the minority classes were classified as the cheese majority class.
The wax study highlighted that no method always is the greatest as a new most
accurate method was introduced. Here, as seen from figure 19, AP yielded the
highest accuracy at approximately 92%. The span of 84 to 90% contained all
methods but SIMCA, with no pattern as to if one group of methods were better
than another. SIMCA achieved an accuracy of roughly 78%.
0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92
Accuracy [1]
PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP
0 0.5
1 AP - BFGS-ME - PLS-DA - LGBM
0 0.5 1
F1-score [1]
BFGS-LR - LSVM - FT - SSGD
Wax Salt
White Reflection
Mold Dirt
0 0.5 1
SGD - SDCA - FF - SIMCA
Figure 19 – The total accuracy and generated F1-scores for the methods classi- fication of the wax cheese study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.
Except for SIMCA, which struggled with the reflection class, the F1-score be- haviour was consistent over the methods. The performance became dependent on the amount of misclassifications of the minority classes as the wax majority class, similarly to the uncoated cheese study. Additionally, no method could correctly identify a single pixel of the dirt class, making the F1-score undefined.
Finally, the paraffin cheese study was handled well by all methods. The most
accurate method was LGBM with an accuracy of 99.3% and the least accurate
method was SSGD with a 95.7% accuracy. The variety in performance depended
on the classification of the edge class, as seen by the F1-score in figure 20. The
minority class edge was misclassifed as the paraffin majority class, similar to the
previous cheese studies.
0.95 0.96 0.97 0.98 0.99 1
Accuracy [1]
PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP
0 0.5 1
LGBM - SIMCA - FT - FF
0 0.5 1
F1-score [1]
LSVM - BFGS-ME - AP - SDCA
Mold Paraffin Edge
0 0.5 1
BFGS-LR - PLS-DA - SGD - SSGD
Figure 20 – The total accuracy and generated F1-scores for the methods clas- sification of the paraffin cheese study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.
4.1.3 The Indian Pines study
Having nearly the same amount of classes but less included data, the Indian Pines
study provided results similar to the waste study. The decision tree methods
LGBM and FT had the highest accuracy, reaching just over 90%, followed by
FF at 84%. FF was at the top of a set with five other methods; BFGS-ME,
SDCA, BFGS-LR, LSVM and AP, ranging between 79 and 84% accuracy. The
least accurate methods were SGD and SSGD at roughly 73%, PLS-DA at just
below 60% and most notably, SIMCA at almost 0%.
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Accuracy [1]
PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP
0 0.5
1 LGBM - FT - FF - BFGS-ME
0 0.5 1
F1-score [1]
BFGS-LR - SDCA - LSVM - AP
Alfalfa Corn-notillCorn-mintill
Corn
Grass-pastureGrass-trees Gr.-past.-mowedHay-windrowed
Oats
Soybean-notillSoybean-mintillSoybean-clean Wheat Woods BGTD
Sto.-ste.-to.
0 0.5
1 SGD - SSGD - PLS-DA - SIMCA
Figure 21 – The total accuracy and generated F1-scores for the methods classi- fication of the Indian Pines study. SIMCA is excluded from the accuracy plot as it misclassified almost all pixels. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.
As seen from the F1-scores illustrated in figure 21, SIMCA achieved one defined F1-score. SIMCA predicted all pixels as the oats class, resulting in practically zero accuracy as well as zero F1-score for that class. This is likely the case due to the PCA model for the oats class failing to produce an accurate model, which in combination with a large critical distance classifies all pixels as that class. Be- sides SIMCA, the other less accurate methods struggled with some classes. As an example, PLS-DA failed to generate a defined F1-score for the alfalfa, corn, green-pasture-mowed, oats, BGTD (buildings-grass-trees-drives) and stone-steel- tower classes. The inability to generate a defined F1-score stems from the precision being undefined due to the method not classifying any pixel as that class.
Among the rest of the methods, the largest performance factor were the classifi- cation of the corn and soybean field classes. Consisting of three classes each, the corn fields tended to misclassify as each other and as the soybean fields, and vice versa. The decision tree methods could handle this distinction best, similarly to the case of the waste study.
4.1.4 The nuts study
The final classification study most methods could handle well, as seen in figure
between the hazelnut and almond classes. The degree at which the methods were capable to distinguish between them determined the accuracy.
0.85 0.9 0.95
Accuracy [1]
PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP
0 0.5
1 LGBM - FT - SDCA - BFGS-ME
0 0.5 1
F1-score [1]
LSVM - BFGS-LR - AP - SGD
Shell Walnut
Hazelnut Almond Pecan 0
0.5
1 FF - SIMCA - SSGD - PLS-DA
Figure 22 – The total accuracy and generated F1-scores for the methods classifi- cation of the nuts study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.
4.1.5 Computational speed
In any realtime classifying operation, the training and especially the prediction
speed is of importance. The training and prediction times of the classification
methods from four of the studies are shown in figure 23 and 24, respectively.
LGBM
FT FF SDCA
BFGS-ME BFGS-LR
SSGD SGD AP LSVM 0
50 100 150 200
Time [s]
Waste Wax
Indian Pines Nuts
Figure 23 – Training time of the models of the ML.NET methods that were chosen
by the self-updating iterative process. PLS-DA and SIMCA excluded due to close
to instant training.
PLS-DASIMCALGBM FT FF
SDCA BFGS-MEBFGS-LR
SSGDSGD APLSVM 0
100 200 300 400
Time [ms]
Waste study prediction times
PLS-DASIMCALGBM FT FF
SDCA BFGS-MEBFGS-LR
SSGDSGD APLSVM 0
100 200 300 400
Time [ms]
Cheese (wax) study prediction times
PLS-DASIMCALGBM FT FF
SDCA BFGS-MEBFGS-LR
SSGDSGD APLSVM 0
100 200 300 400
Time [ms]
Indian Pines study prediction times
PLS-DASIMCALGBM FT FF
SDCA BFGS-MEBFGS-LR
SSGDSGD APLSVM 0
100 200 300 400
Time [ms]
Nuts study prediction times
Figure 24 – Prediction times of the models of the ML.NET methods on the test data of each problem. The range covers one standard deviation.
Examining the training time of the machine learning methods, we find that the Indian Pines and nuts studies yielded a larger variety of training times while the waste and wax cheese studies were more consistent. In all cases the linear classi- fiers, the ones built on logistic regression as well as AP and LSVM, roughly reached the same training times. They were quickest in the waste study by training for just around 10 s. The slowest case was Indian Pines, where they achieved train- ing times at around 30 s. The maximum entropy methods BFGS-ME and SDCA achieved similar results, with the exception of SDCA reaching a training time of approximately a minute in the Indian Pines study. However, the decision tree methods proved to require 2-4 times as long training times over these data sets.
Worst were FT and FF, where FT for example needed up to 200 seconds training time for the Indian Pines data set.
The behaviours of the training times were translated into the prediction times, as
most methods remained on par while the decision tree methods required 2-4 times
as much prediction time. PLS-DA and SIMCA generally provided the quickest prediction times, but they were closely followed by the maximum entropy meth- ods and the linear classifiers.
4.1.6 Introducing the convolutional neural network
As an extra segment, the intent was to examine hand-picked methods from other libraries than ML.NET. This was narrowed down to constructing a simple convolu- tional neural network, as that is something the ML.NET library lacks. As seen from the results in figure 25, the simple network was the most accurate in the waste study, third in the Indian Pines study and second to last in the nuts study.
0.65 0.7 0.75 0.8 0.85 0.9
Accuracy [1]
PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP CNN
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Accuracy [1]
PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP CNN
0.85 0.9 0.95
Accuracy [1]
PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP CNN
Figure 25 – The total accuracy of the examined methods with the addition of
the self-made convolutional neural network. The upper left, right and lower plots
represent the waste, Indian Pines and nuts studies, respectively.
4.2 Regression
4.2.1 The powder study
The powder study was well modelled by most methods. Seen in figure 26, the goodness of fit and prediction lie between 0.9 and 1 for all methods and variables except the combination of the potato starch variable and the OGD method, in which the metrics lie between 0.85 and 0.9. However, more interesting to the comparison is that the decision tree methods FF, FT, FTT and LGBM performed slightly better than PLS and the remaining ML.NET methods. They in turn were on par, excluding OGD which performed a bit worse.
PLS FF FT FTT
LGBM OGD OLS
BFGS-PR SDCA 0.8
0.85 0.9 0.95 1
R2
Vanilla Baking soda Potato starch
PLS FF FT FTT
LGBM OGD OLS
BFGS-PR SDCA 0.75
0.8 0.85 0.9 0.95 1
Q2
Vanilla Baking soda Potato starch
Figure 26 – The goodness of fit R
2along with the goodness of prediction Q
2of the three variables of the powder study.
4.2.2 The strip study
The methods struggled with the variables of the strip study, as illustrated in
figure 27. The only moderately well modelled variables were TSI CD and TSI
Min, in contrast to the Air Permeance and Thickness variables which did not yield
a significant, sometimes negative, R
2or Q
2from any method. In fact, OGD could
not produce a single positive value for either metric. More positively, three of the
decision tree methods could improve the two TSO Angle variables in comparison
to PLS. FF, FT and LGBM produced R
2and Q
2values between 0.5-0.6 whereas
PLS barely reached 0.2. Besides this, PLS could rival any of the ML.NET methods
on this data set.
0 0.5
1 TSI CD - TSI Min
0 0.5 1
R2
TSI Max - TSI MD
PLS FF FT FTT
LGBM OGD OLS BFGS-PR
SDCA 0
0.5
1 TSO Angle+MD - TSO Angle-MD
0 0.5
1 TSI CD - TSI Min
0 0.5 1
Q2
TSI Max - TSI MD
PLS FF FT FTT
LGBM OGD OLS BFGS-PR
SDCA 0
0.5
1 TSO Angle+MD - TSO Angle-MD