• No results found

A comparison of different machine learning algorithms applied to hyperspectral data analysis

N/A
N/A
Protected

Academic year: 2021

Share "A comparison of different machine learning algorithms applied to hyperspectral data analysis"

Copied!
59
0
0

Loading.... (view fulltext now)

Full text

(1)

A comparison of different machine learning algorithms applied to hyperspectral data analysis

Axel Vikström axvi0004@student.umu.se

June 15, 2021

Master’s Thesis in Engineering Physics

Supervisors: Andreas Vidman (andreas@prediktera.se)

(2)

Abstract

Hyperspectral image analysis works with image data where each pixel con- tains hundreds of wavelengths acquired from spectral measurements. It is a growing field of research in the sciences and industries because it can distin- guish visually similar objects. While many machine-learning methods work well for analysing regular images, little is known about how they perform on hyperspectral data. Standard methods for quantifying and classifying hyper- spectral data include the chemometric methods PLS, PLS-DA and SIMCA.

They provide rapid computations along with intuitive modelling and diag-

nostic tools, but cannot capture more complex data. I benchmarked the

chemometric methods against machine learning methods from Microsoft’s

ML.NET library on six classification and two quantification problems. The

ML.NET methods proved to be good complements to the chemometric meth-

ods. In particular, the decision tree methods provided accurate classifica-

tion and quantification while the maximum entropy classification methods

balanced between accuracy and computational time the best. While the re-

maining ML.NET methods performed equally well or better than the chemo-

metric methods, finding their use requires testing on data sets with a wider

range of properties. The best ML.NET methods are suitable for analysing

more complex hyperspectral images by capturing nonlinearities disregarded

by standard image analysis.

(3)

Acknowledgements

Primarily I would like to thank Andreas, Oskar and Thomas of Prediktera, guiding and aiding me through the process of conducting a full master’s thesis from home.

I could not have asked for a more genuine and helpful setting to work in than what I was given. Additionally, I thank Lucas and Martin from the physics institution;

Lucas for guiding a nervous student in the beginning of the project and Martin

for showing me what’s what in terms of academic writing.

(4)

Contents

1 Introduction 1

2 Theory 3

2.1 Hyperspectral data . . . . 3

2.2 The chemometric methods . . . . 4

2.2.1 PLS and PLS-DA . . . . 5

2.2.2 SIMCA . . . . 5

2.3 Machine learning methods . . . . 6

2.3.1 Linear regression . . . . 6

2.3.2 Logistic regression . . . . 7

2.3.3 Poisson regression . . . . 8

2.3.4 SVM . . . . 9

2.3.5 Decision trees . . . . 9

2.3.6 Maximum entropy . . . 11

2.3.7 Neural networks . . . 11

2.4 Evaluation metrics . . . 13

2.4.1 Classification metrics . . . 13

2.4.2 Quantification metrics . . . 14

3 Method 15 3.1 The data sets . . . 15

3.1.1 The waste data . . . 15

3.1.2 The cheese data . . . 15

3.1.3 The Indian Pines data . . . 16

3.1.4 The nuts data . . . 17

3.1.5 The powder data . . . 17

3.1.6 The strips data . . . 18

3.1.7 Metadata of the sets . . . 18

3.2 Examined complement methods . . . 18

3.3 Settings . . . 23

3.3.1 Software . . . 23

3.3.2 Data segmentation . . . 28

3.3.3 Model settings . . . 28

4 Results 30

4.1 Classification . . . 30

(5)

4.1.3 The Indian Pines study . . . 34

4.1.4 The nuts study . . . 35

4.1.5 Computational speed . . . 36

4.1.6 Introducing the convolutional neural network . . . 39

4.2 Regression . . . 40

4.2.1 The powder study . . . 40

4.2.2 The strip study . . . 40

5 Discussion 42 5.1 Future work . . . 43

5.2 My recommendation . . . 44

6 Conclusion 45 A Appendix i A.1 Data segmentation . . . . i

A.2 PLS-DA model settings . . . . ii

A.3 SIMCA model settings . . . iii

A.4 PLS model settings . . . iv

A.5 ML.NET model settings . . . . v

(6)

1 Introduction

Much data is produced through imaging, where image data consists of measures of light intensity in each pixel in one or more channels. In a grey scale image, each pixel consists of one channel while an RGB image consists of three channels containing red, green and blue light intensities. Including more channels in each pixel increases the amount of information and a typical procedure is including wavelengths from outside the visible spectrum. This is known as hyperspectral imaging, where hundreds of channels may be included in the pixels [14]. Hy- perspectral imaging implies that visibly similar objects may be distinguished by analysing the reflected wavelengths from outside the visible spectrum. The tech- nique therefore excels when the visible spectrum gives little to no information, such as realtime classification of waste materials or quantification of the content in food samples. While many studies have examined the performance of methods on regular image data, the amount of available methods for analysing hyperspectral image data is less investigated.

Prediktera AB is a software company that specializes in data analysis of hyperspec- tral images. At the beginning of this project, they employed the PLS, PLS-DA and SIMCA methods. These methods are chemometric methods, able to pixel- wise quantify and classify hyperspectral images. PLS is a quantification method for regression analysis [39], while PLS-DA and SIMCA are methods for solving classification problems [12]. While being quick solvers with satisfactory diagnostic tools and intuitive modelling, their performance is reduced when handling more complex data. This implies that for larger, non-linear data sets these methods may not be satisfactory. Therefore, investigating additional methods that may complement the chemometric methods may let Prediktera offer efficient solutions for problematic data sets. The machine learning research field is likely to contain such complements, due to machine learning algorithms’ ability to make efficient data-driven recommendations and decisions purely based on input data.

This thesis aims to determine whether specific machine learning algorithms out-

perform other methods, when analysing hyperspectral data, by creating and con-

ducting a benchmark test. These aims are met by comparing the performance of

machine learning algorithms from Microsoft’s ML.NET library [21], and the chemo-

metric methods. I utilized data sets from the hyperspectral industry; two quantifi-

cation and six classification data sets. Evaluating the performance of a classifica-

tion method was done by measuring its total accuracy and speed of training and

(7)

ness of fit and prediction. Examining the performance I hope to find methods that, compared to the chemometric methods, allow for analysis of more complex data.

Consequently, expanding the amount of usable methods within hyperspectral data

analysis.

(8)

2 Theory

2.1 Hyperspectral data

As a way to aid remote sensing studies, in 1985 Alexander Goetz introduced hy- perspectral imaging [13]. The technique combines the spatial information given by imaging and spectral information of spectroscopy. Each pixel is allotted a mul- tiple of values, called channels, corresponding to the spectroscopically measured wavelengths. This is illustrated in figure 1 along with a comparison with a grey scale and an RGB image.

Figure 1 – Comparison of grey scale, RGB and hyperspectral images. The dig- ital representation corresponds to the hypercube of data that is characteristic in hyperspectral image analysis. In this case, the pixels in the hyperspectral image have 256 channels, corresponding to 256 spectroscopically measured wavelengths.

Source: Prediktera AB [27].

One of the most common ranges of wavelengths to measure hyperspectral data

from is the visual and near infrared spectrums. The near infrared spectrum ranges

between the visible and the radio wave spectrums, with wavelengths from 780 nm

(9)

ference in energy levels intrinsic to the molecules [37].

There are a number of methods used to sample hyperspectral data; point scan- ning, line scanning, area scanning and single shot [28]. Point scanning measures all channels in a single pixel, one pixel at a time. Line scanning implies scanning all channels of a single row of pixels, one line at a time. Area scanning is scan- ning the entire image, one channel at a time. Lastly, single shot is the acquiring of all channels for all pixels at the same time. Utilizing either of these methods produces the three dimensional hyperspectral cube, or hypercube, seen in figure 1, that contains the spatial and spectral data.

As a final step before analysis the hypercube typically undergoes spectral pre- processing. Initially, the raw intensity measurements can be adjusted by using white and dark reference values sampled by the measurement system. They are measurements of the brightest and darkest intensities, often used for calibration.

Denoting the reference values respectively as W and D, the absorbance A is com- puted according to

A = log

10

 I

0

− D W − D



, (1)

where I

0

is the raw intensity values [37]. Having converted the raw intensities to absorbance according to equation 1, another standard procedure is to apply mean-centering, which places the origin at the center of the data. Then, it is suitable to apply one or more pre-treatments to the centered data. Two common approaches are applying Savitzky-Golay filtering for de-noising and/or standard normal variate (SNV) for scatter correction [1].

2.2 The chemometric methods

The chemometric methods PLS, PLS-DA and SIMCA are well used for spectral data analysis. The methods take a set of independent variables as input and returns a predicted variable, also known as a dependent variable. In pixel-wise hyperspectral imaging, these methods are used for quantification or classification of the pixels. The independent variables are the channels of a pixel, while the returned dependent variable is a quantification value or a class.

A trait shared by the methods is that they all are based on principal component

analysis (PCA) modelling [8]. PCA modelling removes covariance by making new

orthogonal dimensions of the independent data through linear orthogonal trans-

forms. These new dimensions are the principal components where the first com-

ponent explains the largest amount of variance among the independent variables,

(10)

et cetera. SIMCA inherently uses PCA modelling for each class while PLS and PLS-DA utilize a computation of orthogonal components in a similar manner to PCA.

2.2.1 PLS and PLS-DA

Partial least squares (PLS) was introduced to spectral data analysis by Svante Wold as a way to extend traditional linear regression. PLS computes a linear re- gression model on a space defined by the components that best explain the variance of the independent variables [39]. These components are known as latent variables and are orthogonal linear combinations of the independent variables given to the algorithm. Essentially, PLS creates a traditional regression model, but on a space with orthogonal components computed from the independent and dependent data correlations.

PLS is also suitable for classification problems, when the dependent variables are categorical. In those cases the method is called partial least square discriminant analysis (PLS-DA) [3].

The method provides simple yet powerful modelling as long as the data does not grow too complex [39]. As the data becomes more complicated, non-linearities override the possibility for a single regression model to be well established. To combat this the analyst can introduce a series of connected PLS models, known as hierarchical modelling, in which latter models predict using the results from previous models. Hierarchical modelling is however tedious and in many cases difficult to efficiently build due to the propagation of predictions over the models.

2.2.2 SIMCA

Soft independent modelling of class analogy (SIMCA) is a classification method [4]. A PCA model is generated for each class and a distance metric is used to measure the distance between data points and the hyperplane defined by the PCA model. A threshold, the critical distance, is set on the distance metric, defining whether the entry is close enough to the span in order for it to be classified as that class. One PCA model per class implies that an entry may be classified as several, or none, of the classes.

SIMCA is likely to perform better than PLS-DA as the amount of classes increase.

However, besides being a lot of work, it suffers from a lack of interpretation due

(11)

of accuracy.

2.3 Machine learning methods

Here follows the mathematical bases that each of the possible complement meth- ods are built upon. The process of machine learning is that input data is predicted through a model. The model is then evaluated by computing a metric of choice, called loss function, that is dependent on the generated prediction. If the loss func- tion is below a certain threshold value, the model is deemed accurate. Otherwise, the model is re-configured. This is re-iterated until the model stops improving.

The methods will be considered generating a dependent variable y from a set of n independent variables x = (x

1

, x

2

, . . . , x

n

) . When analysing hyperspectral image data this corresponds to quantifying or classifying a pixel as y from its set of channels x.

2.3.1 Linear regression

Linear regression implies employing a linear model for predicting continuous data.

The predicted variable is computed as

y = β

0

+ β

1

x

1

+ β

2

x

2

+ · · · + β

n

x

n

, (2) where (β

0

, β

1

, . . . , β

n

) is a set of unknown parameters. To determine these param- eters, the model undergoes optimization by minimizing a metric partially defined by the predicted value of the model and the observed values. A commonly used optimization metric is the sum of least squares, defined as

minimize

N

X

i=1

(y

i,obs

− y

i,pred

)

2

, (3)

where N is the number of data points, y

obs

an observed value and y

pred

the value

predicted by the model. Note that the difference is measured in the vertical dis-

tance between the observed value and the produced model [15]. Figure 2 illustrates

this by showing a linear regression defined by equation 2 and optimized by equation

3 over a set of data.

(12)

Figure 2 – A linear regression, the blue line, over a set of data. The red dotted line illustrates the vertical distance used to compute the sum of least squares metric.

2.3.2 Logistic regression

While linear regression best suits predicting continuous data, logistic regression performs better when handling binary categorical data. The regression utilizes the sigmoid function

y = 1

1 + e

−(β01x1+···+βnxn)

, (4)

which is optimized to minimize a metric of choice, the sum of least squares for

example [5]. This yields a situation as illustrated in figure 3, where a logistic

regression in the form of equation 4 is fitted to a set of categorical data with two

classes. A threshold is set on the logistic model vertically halfway between the two

classes, and the data on each side of the threshold is classified accordingly.

(13)

Figure 3 – A logistic regression, the blue line, optimized to fit binary categorical data. The method classifies all items right of the threshold as the circle class and everything to the left as the square class. Thus, due to overlapping, the model will misclassify the left-most circle and the right-most square.

2.3.3 Poisson regression

Poisson regression, used for quantification, stems from the Poisson distribution with an expected value µ

P (y = k|x) = µ

k

k! e

−µ

, k = 0, 1, 2 . . . (5) In turn, the Poisson regression model assumes that

ln (µ) = β

0

+ β

1

x

1

+ · · · + β

n

x

n

, (6) which yields the combination of equation 5 and 6 as

P (y = k|x) = e

k(β01x1+···+βnxn)

k! e

−e(β0+β1x1+···+βnxn)

. k = 0, 1, 2 . . . (7)

The Poisson distribution in equation 7 is optimized by tuning the unknown β

parameters and thereafter the predictions will thus follow this Poisson distribution,

with a mean dictated by equation 6 [6].

(14)

2.3.4 SVM

Support vector machines (SVM) is a binary classification technique known to pro- vide high classification accuracy when classifying hyperspectral data [20]. The linear case consists of placing a linear hyperplane decision boundary between the classes, as seen in figure 4. The closest data points of either class are called the support vectors to the hyperplane. The method tries to maximize the distance between each side’s support vector, reducing the margin of error.

Figure 4 – A support vector machine classifier separating two classes. The dotted lines are the support vectors, defined by the closest data points to the classifier, for which the area between is to be maximized.

2.3.5 Decision trees

Decision tree modelling divides the data set into sub-regions [30]. The sub-regions

are defined by splits in the data set that correspond to certain values or functions

of the independent variables. To traverse the decision tree to a sub-region, boolean

decisions are made over these splits. When the data point to be predicted reaches

one of these sub-regions, it is classified or quantified as the content of that sub-

region. A common quantification technique is giving the data point to be predicted

the mean value of the sub-region. A classification tree example is illustrated in

figure 5.

(15)

sx,1 s x,2

x1 sy,1

x2

Figure 5 – Illustration of a classification decision tree formed over a data set.

The methods previously discussed are optimized by optimizing parameters of mathematical models. Here, the model is optimized by finding splits that partition the data set to a certain degree, which often is measured in entropy reduction [35].

Entropy reduction optimization tries to eliminate uncertainty between each level of the decision tree. At the top of the tree, there is full uncertainty as to where the new data point belongs. At the bottom, there is no uncertainty were it belongs.

The entropy reduction algorithms tries to place the splits in the data set such that

uncertainty is removed as early as possible.

(16)

2.3.6 Maximum entropy

Maximum entropy is a distribution-based model and can be seen as a generalization of logistic regression from the binary case to the multiclass case [43]. This is done by extending the denominator in equation 4 by summing over exponential functions created for each class. The probability p that a set of independent variables x belongs to the class y

i

thus becomes

p(y

i

|x) = e

βTix

P

C

c=1

e

βTcx

. (8)

The method is built upon the principle of maximum entropy, which states that the probability distribution with the largest informational entropy best describes the system. An example is coin flipping in which a fair coin gives maximum entropy by being the most uncertain, while a rigged coin has bias and as a consequence, less entropy.

2.3.7 Neural networks

Neural network modelling acquired its intuition from the biological nervous system [2]. The system is mimicked by letting the input data pass through several layers of nodes, called neurons, transforming the data into the dependent variable. In math- ematical modelling, neural networks are referenced to as artificial neural networks.

The layer-based architecture of artificial neural networks, seen in figure 6, consists of three segments; the input, the hidden layers and the output. The input data is fed to the first hidden layer such that each independent variable is connected to all neurons of the hidden layer with different sets of weights. Thus, the value in each first layer neuron is a weighted sum of the independent variables. Additionally, the hidden layers employ activation functions [34], allowing for extraction of non- linearities in the data. This process of layers being connected with weights to the neurons of the upcoming layer is repeated throughout the hidden layers until it reaches the output layer, where the network returns the dependent variable.

During training this result is checked against the actual value and if the model is

not sufficiently accurate, the weights in the model are updated.

(17)

Figure 6 – An artificial neural network with two hidden layers.

Source: Dertat [7].

An alternative to artificial neural networks are convolutional neural networks [24].

Convolutional neural networks are analogous to artificial neural networks in the sense of layering and iteration. However, the layering is slightly different. The layers are traversed in strides by a kernel, also called filter, of a certain size that adds adjacent values together with some weighting, reducing the dimensions of the data for the upcoming layer. The process is illustrated in figure 7. For hyperspec- tral data, convolutional neural networks can be used in a one-dimensional manner, predicting the image pixel-wise.

Figure 7 – A one-dimensional convolution from the input layer to the first con-

volutional layer. A size three kernel with weights strides across the input, adding

adjacent values and thus reducing the dimensions from p to s [26].

(18)

2.4 Evaluation metrics

2.4.1 Classification metrics

The total accuracy and F1-score can be determined from the confusion matrix, of which an example is illustrated in table 1.

Table 1 – Multiclass confusion matrix of the classes A, B and C. a-i represent in- tegers explaining classification amounts. As an example, A was correctly classified as A a times and was misclassified as B b times.

Predicted A B C A a b c True B d e f C g h i

In a multiclass confusion matrix, a value can be true positive (TP), false positive (FP) or false negative (FN) depending on what class is examined. Examining class A, a true positive is the correct classification, A classified as A. A false positive is when another class is classified as the class you are examining, B or C classified as A. A false negative is when the class you are examining is classified as another class, A classified as B or C. The total accuracy is computed by summing the true positives for all classes and dividing by the sum of all elements in the matrix.

The F1-score of a class is constructed from the precision and recall, which in the multiclass case are defined as

Precision = T P

T P + P F P (9)

Recall = T P T P + P F N

where the false positives and negatives over the other classes are summed. In turn, the F1-score is defined as

F

1

= 2 × Precision × Recall

Precision + Recall . (10)

This definition of the F1-score makes it possible to examine a method’s perfor-

mance regarding a single class, allowing for pinpointing of problematic classes [9].

(19)

2.4.2 Quantification metrics

Quantification, or regression, can be analysed by examining the goodness of fit R

2

and goodness of prediction Q

2

. Goodness of fit can be computed by

R

2

= 1 − SS

res

SS

tot

(11)

where SS

res

is the sum of squares between the model and the observed training values and SS

tot

the sum of squares between the mean and the observed training values [40]. Thus, if the model is perfect, SS

res

approaches zero and R

2

approaches 1. When the model is poor the metric may become close to zero or even negative, performing worse than using the mean value as your model.

The goodness of prediction is computed similarly to equation 11 as

Q

2

= 1 − SS

p

SS

tot

, (12)

where SS

p

instead is the sum of squares between the values predicted by the model

and the observed test data [38].

(20)

3 Method

3.1 The data sets

I examined six classification and two quantification data sets. The classification segment were primarily in focus. The goal was to acquire data sets that originate from different areas of the industry, from food processing to waste sorting.

3.1.1 The waste data

Being the first classification problem, acquired from Mälardalens högskola and inspired by Ševčíks thesis [31], the waste data consisted of waste material collected from three sub-groups; organics, plastics and in-combustibles. The objective in this problem was to separate between classes form these sub-groups. Pseudo-RGB images, images with three channels from the spectral data, of a four classes from the waste study are illustrated in figure 8.

Figure 8 – Pseudo-RGB images of the cardboard, ceramics, LDPE (low-density polyethylene) and PP (polypropylene) classes, in row-wise order.

3.1.2 The cheese data

(21)

arate data sets; uncoated, wax and paraffin or cheese (U), cheese (W) and cheese (P). They corresponded to different treatments of the cheese, yielding different surfaces to inspect. The problems consisted of detecting mold and other deficien- cies. Pseudo-RGB images of a few examples from the cheese study are illustrated in figure 9.

Figure 9 – Pseudo-RGB image examples from the cheese, wax and paraffin data sets, in row-wise order.

3.1.3 The Indian Pines data

The Indian Pines data set belongs to NASAs AVIRIS data collection [22], a com- mon benchmarking data set for classification algorithms. Indian Pines is a remote sensing data set, implying that the data was measured at a large distance from an aircraft or satellite. The data set is a single image consisting of a variety of agri- culture, grass fields, forests and man-made structures which were to be classified.

A pseudo-RGB image and the ground truth are shown in figure 10.

Figure 10 – Pseudo-RGB image of the Indian Pines data set along with its col-

orized ground truth, the classes of the pixels in color coding.

(22)

3.1.4 The nuts data

Being Prediktera’s data set for their classification tutorial, the nuts study contains measurements for a couple of nut types and their respective shells. The task consisted of separating the different nut types as well as distinguishing if pixels were shell pixels or not. Pseudo-RGB examples of measurements from the study are illustrated in figure 11.

Figure 11 – Pseudo-RGB images of the hazelnut, hazelnut shell, pecan and pecan shell measurements.

3.1.5 The powder data

The powder data set is used for Prediktera’s quantification tutorial. It consisted

of bags in which three types of powder had been inserted; vanilla powder, baking

soda and potato starch. Some bags contained purely one type of powder while

others were mixtures between two powders or all three. Pseudo-RGB examples

are shown in figure 12.

(23)

3.1.6 The strips data

The second quantification data set is a part of a current project at Research Insti- tutes of Sweden (RISE). The data set was made up out of long strips of paper on which an amount of physical properties, such as thickness, varied. Two pseudo- RGB examples of the strips are shown in figure 13.

Figure 13 – Pseudo-RGB images of two of the examined strips.

3.1.7 Metadata of the sets

In tables 2 and 3 follows the properties of the classification and quantification data sets.

3.2 Examined complement methods

The possible complements to the chemometric methods were acquired from Mi-

crosoft’s ML.NET library [21], with the exception of the self-made TensorFlow con-

volutional neural network. Table 4 describes each classification method. The

overview includes which machine learning basis it belongs to, a short description

that highlights what makes the method different along with an in-depth source,

(24)

Table 2 – Metadata of the classification data sets.

Waste IP Nuts

Classes 18 16 5

Channels 208 184 266

Min channel (nm) 963.27 476.7 999.06

Max channel (nm) 1691.59 2231.5 2486.96

Cheese (U) Cheese (W) Cheese (P)

Classes 5 6 3

Channels 191 191 191

Min channel (nm) 1000.18 1000.18 1000.18

Max channel (nm) 2198.04 2198.04 2198.04

Table 3 – Metadata of the quantification data sets.

Powder Strips

Variables 3 8

Channels 236 236

Min channel (nm) 1061.92 997.33 Max channel (nm) 2454.64 2438.7

as well as whether it utilizes the One-Versus-All (OVA) approach. The OVA ap-

proach uses binary classifiers for multiclass problems by classifying the examined

class versus all other classes, making all the other classes the second "class" in

the binary classification. With a similar overview, the regression algorithms are

described in table 5.

(25)

Table 4 – Overview of the classification methods from the ML.NET library and the convolutional neural network. The method names are the abbreviations of the machine learning methods from the ML.NET library, often named after the optimization technique. OVA was applied to binary classifiers, generalizing them to multiclass classification.

Method Base method OVA Description

BFGS-LR Log. reg. Yes Utilizes the BFGS, or L-BFGS, opti- mization technique. BFGS is a quasi- Newtonian method that reduces compu- tational cost of the Hessian yet keep the fast convergence rate of an ordinary New- ton method [10].

SGD Log. reg. Yes Stochastic gradient descent, in this case the Hogwild version, is a stochastic vari- ant of the classic gradient descent opti- mization method. The stochasticity im- plies replacing the actual gradient with an estimation of it, alleviating computational load [23].

SSGD Log. reg. Yes Being an extension of SGD, Symbolic SGD replaces the sequential process by lo- cal models in separate threads. A proba- bilistic model combines the local models to produce an expectation of what a reg- ular SGD would have produced [19].

LSVM SVM Yes LSVM, or Linear SVM, implements the

PEGASOS optimization technique. PE- GASOS is a modified SGD method where each SGD step is accompanied by a pro- jection step. For SVM problems, this pro- duced a higher convergence rate [32].

Cont. next

page

(26)

LGBM Dec. tree No Light Gradient Boosting Machine, LGBM, is a gradient boosting decision tree which is a high performing method for multiclass classification. Gradient boosting is trained as a sequence of decision trees improving upon one an- other, making it an ensemble model.

LGBM adds to this model by excluding data instances with small gradients and excluding less significant independent variables, resulting in higher computation speeds [16].

FT Dec. tree Yes Like LGBM, Fast Tree (FT) is built upon the gradient boosting decision tree method. FT uses the MART algorithm which specializes in broadening the en- semble process, such that later iterations does not impact the prediction of only a few of the independent variables [29].

FF Dec. tree Yes Fast Forest, FF, implements the random forest technique. Random forest creates an ensemble of independent decision trees and creates a distribution type model from the decision trees that make up the forest [18].

SDCA-ME Max. Ent. No Stochastic dual coordinate ascent is an op- timization technique that solves the dual problem. The dual problem sets the lower bound of the primal problem, where the primal problem is the one that usually is solved by optimization [33].

BFGS-ME Max. Ent. No See BFGS-LR.

Cont. next

page

(27)

AP Neu. Net. Yes Averaged perceptron is a single-layer neu- ral network. Being an online algorithm, it updates its weights after each training in- stance if the label is incorrect and keeps them otherwise. The averaging in this method comes from that the final predic- tion is calculated by averaging the result of each iteration [11].

CNN Neu. Net. No Made by myself in Python using Tensor-

Flow. Starts with one convolutional layer

with the ReLU activation function fol-

lowed by max pooling and flattening. The

resulting vector is reduced to 64 neurons

with a ReLU activation and the output

layer utilizes the Softmax activation func-

tion, providing the probability vector for

the classes. Further information in section

3.3.3.

(28)

Table 5 – Overview of the examined regression algorithms.

Method Base method Description

OLS Lin. reg. Ordinary least squares, OLS, is the traditional linear regression that optimizes by minimizing the sum of squares as described by the linear regression section [15].

SDCA-LR Lin. reg. See SDCA-ME in table 4.

OGD Lin. reg. The online gradient descent method in the ML.NET library corresponds to utilizing stochas- tic gradient descent optimization on linear re- gression. For more information on SGD, see table 4.

BFGS-PR Poi. reg. See BFGS in table 4.

LGBM Dec. tree See LGBM in table 4.

FT Dec. tree See FT in table 4.

FTT Dec. tree FTT, Fast Tree Tweedie, is an alternative FT (see table 4) that minimizes the Tweedie loss function. The loss function is constructed from the Tweedie distribution, which is a distribution where a majority of the samples are at the origin and a separate normal or Gaussian distribution is situated further down the tail [42].

FF Dec. tree See FF in table 4.

3.3 Settings

3.3.1 Software

I conducted the project in Prediktera’s modelling and prediction tool Breeze [27].

Breeze allows for image segmentation as well as sample, quantification and classifi- cation modelling of hyperspectral data, making it an ideal environment to conduct the benchmark study.

Starting by importing the data, the user is situated in the Record segment of

the software. In Record you work with the segmentation; which pixels in the

(29)

such that only the objects in the image remain. Sample modelling (PCA) provides a solution where you may exclude pixels from a variance scatter plot, illustrated in figure 14. Due to PCA finding the principal components that explains the variance the most, the variance scatter plot often splits up into different clusters and the unwanted pixels can easily be identified and excluded. Having removed any unwanted data, a segmentation type that extracts data is selected. The two segmentation types I used were representative spectrum and grids and insets. The former corresponds to randomly picking an amount of pixels from each sample and the latter places a grid upon each sample and each rectangle in the grid becomes the observations, averaging the channels over the pixels in each rectangle.

Figures 15 and 16 illustrate the results of removal of background pixels through PCA modelling as well as the usage of the two mentioned segmentation types, respectively.

Figure 14 – A variance scatter plot generated by a PCA model with SNV pre-

treatment and its corresponding measurement. The yellow selected pixels in the

scatter plot is to the cluster that corresponds to the relevant data in the measure-

ment, colored brown.

(30)

Figure 15 – Sampling of the textile measurement from the waste study along with

the choice of pixels through the representative spectrum segmentation type. The

yellow line in the upper image encloses the area that the PCA model deemed to

not be background. In this secluded area a representative spectrum extracts pixels

at random, visible as the red dots. This particular representative spectrum have a

Gaussian setting, drawing randomly from a two-dimensional Gaussian distribution.

(31)

Figure 16 – Sampling of one of the powder measurement from the powder study along with the choice of pixels through the grids and insets segmentation type.

The yellow line in the left image encloses the area that the PCA model deemed to not be background and the grids in the right image correspond to the areas of pixels to be investigated.

Having extracted the data to be examined you should introduce the ground truth, the quantification variables or classification classes, to the pixels. This can be done in any of the stages of the segmentation; the entire measurement, the PCA sampling or the final segmentation type.

The next segment of Breeze is the Model segment. Here, you create your sam- pling, quantification or classification models. When creating a quantification or classification model the first three steps are the same:

1. Variables - In the first step, you choose which segmentation from the Record that you want to include and which set of variables or classes that you are interested in modelling.

2. Samples - In the second step, you choose which of the pixels from the

segmentation that should be included and individually set them as training

or test data. Note that this can be done by forming groups in the Record

step that correspond to a training and test group.

(32)

3. Wavelengths - In the third step, you choose which channels you want to utilize. Additionally, the pre-treatments of the data are applied here. Mean- centering is by default always applied.

At this point the data is ready for modelling. Therefore, the remainder of the process varies between the models:

1. PLS - Consists of manually excluding outliers and setting the amount of components of the model. The goodness of prediction Q

2

reaches a maxi- mum after a certain amount of components, corresponding to the amount of components needed for the best PLS model for your data. In Breeze there exists an automatic feature that estimates the amount of components needed for an acceptable model.

2. PLS-DA - As PLS-DA is PLS for classification problems, the process is similar. However there is an added feature of confidence level, which cor- responds to the level at which the model classifies pixels as "No class". A broad confidence level forces the model to classify problematic pixels as one of the defined classes, erasing the "No class" prediction from the model.

3. SIMCA - Similarly as the PLS methods you can exclude outliers, but the modelling consists of setting a critical distance for each of the classes. The critical distance of a class is set such that all pixels belonging to that class precisely are within the critical distance. Breeze has an automatic feature for acceptable critical distances. Additionally, Breeze automatically determines the amount of principal components needed in the PCA models of each class.

4. ML.NET - In Breeze, you choose an algorithm from the ML.NET library and a time it will train. However, this time is not the time that the same model will be trained against the data. The time you enter is the amount of time that ML.NET utilizes its self-updating process, which implies that several models are created and that the process optimizes the input parameters of the model. Letting the process run for a minute means that the process might create a couple of models of the chosen algorithm and then output the model with the input parameters that yielded the best performance.

The results of a quantification model is given in R

2

and Q

2

while a classification

model outputs a confusion matrix. For analysis and plotting, the results were

exported to MATLAB.

(33)

3.3.2 Data segmentation

Having removed the background, observations from the classification data sets were selected. The representative spectrum segmentation type was applied on all data sets excluding the strip data set. On the strip data set, grids and insets were used due to the variables being measured segment-wise by RISE. The amount of examined pixels for each class in the classification studies is shown in appendix A.1. In the same appendix you find the amount of pixels, or segments in the strip study, that contained each variable of the quantification problems.

3.3.3 Model settings

The settings of a model depends on the chosen channels, their pre-treatments as well as the parameters intrinsic to the model. The chosen channel ranges and amounts can be seen in section 3.1.7. To determine which pre-treatments to use, each method was subject to testing with four combinations of centering, SNV and Savitzky-Golay (SGF). As centering was applied for all cases, the four combina- tion were; purely centering, centering and SGF, centering and SNV as well as centering, SNV and SGF. The pre-treatments providing the highest accuracy for a specific set of method and data set were the ones I used. Keeping to one com- bination of pre-treatments was sufficient for the ML.NET methods in each of the classification studies. It was not sufficient for the quantification problems, as the performance of the ML.NET methods varied heavily on the choice of pre-treatments.

All model settings are noted in tables in the appendices. Appendix A.2 contains the settings of the used PLS-DA models and appendix A.3 the settings of the SIMCA models. The settings of the PLS models used for quantification analysis are situated in appendix A.4. Lastly, the ML.NET settings for classification and quantification are both placed in appendix A.5.

The self-made classification convolutional neural network was made as shown by

table 6. It is a simple network starting by the input of batch size 16, corresponding

to 16 pixels with the amount of channels defined by the problem. The convolutional

layer increases the dimensions from (16 × features × 1) to (16 × features × 128)

with the values in the neurons being subject to a ReLU activation function. Then,

the network is max pooled, bringing the dimensions down to (16 × features/2 ×

128) which there after is flattened out to a regular one-dimensional layer of size

(16 × features/2 × 128) × 1. This layer is then connected to a layer of 64 neurons

with a ReLU activation function. Lastly, the 64 neuron layer is connected to a

layer the size of the amount of classes of the problem. The neurons of the last

(34)

layer employ the Softmax activation function, implying that the resulting layer is a vector of probabilities. The neuron with the highest probability is the class that the convolutional neural network predicted the input data as. The network was an extra task in the thesis and was used to model the waste, Indian Pines and nuts data. It was modelled and tested in Google Colab, and only the accuracy was measured.

Table 6 – Summary of convolutional neural network used to classify the waste, Indian Pines and nuts data. Input was batches of 16 pixels from the examined data set. Columns correspond to the input to the TensorFlow functions. Modelling settings are found in the bottom half of the table.

Type Filters Kernel size Stride Padding Units

Convolutional 1D (ReLU) 128 3 1 2

Max pooling 2 1 0

Flatten

Dense (ReLU) 64

Dense (Softmax) no. classes

Optimizer Adam

Loss function Cross entropy

Batch size 16

Epochs 40

Validation split 5%

(35)

4 Results

The results show that there exists several machine learning methods in the ML.NET library that perform equally well or better than the chemometric methods PLS, PLS-DA and SIMCA.

In particular, the classification methods that build upon the decision tree basis generally provided the highest accuracy and F1-scores. However, they also had the longest training and prediction time. Finding the best balance between com- putation time and accuracy did the maximum entropy methods, being slightly less accurate than the decision tree methods yet quick enough when predicting to be on par with PLS-DA and SIMCA.

Regarding the regression problems, the decision tree methods performed marginally better than PLS. PLS in turn performed on par with the remainder of the ML.NET methods.

4.1 Classification

The methods have been assigned markers in the plots that correspond to their ma- chine learning basis. As two examples; PLS-DA and SIMCA, being chemometric methods, were assigned the square marker. Similarly, the decision tree methods LGBM, FT and FF were assigned the star marker ∗. The same colors and markers were assigned in the corresponding F1-score subplots.

4.1.1 The waste study

The results of the waste study, illustrated in figure 17, showed the prowess of the decision tree methods. The most accurate method, LGBM, almost reached a 90%

accuracy. It was closely followed by FT and some percentages further down FF

came in third. Slightly less accurate than FF were the maximum entropy methods

SDCA and BFGS-ME. They in turn were slightly more accurate than BFGS-

LR, LSVM, AP and SGD, a collection of logistic regression, neural network and

support vector machine methods. In last came the chemometric methods SIMCA

and PLS-DA as well as the logistic regression method SSGD. Thus, the five most

accurate methods were the five available non-linear classifiers.

(36)

0.65 0.7 0.75 0.8 0.85 0.9

Accuracy [1]

PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP

0 0.5

1 LGBM - FT - FF - SDCA

0 0.5 1

F1-score [1]

BFGS-ME - BFGS-LR - LSVM - AP

cardboard food

ceramicsglass hdpe hdpe2 ldpe metal paper (hygenic)paper (white)

pet pp

print (recycled) print (white)

ps pvc textile wood 0

0.5

1 SGD - SIMCA - SSGD - PLS-DA

Figure 17 – The total accuracy and generated F1-scores for the methods classifi- cation of the waste study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.

Examining the F1-score we find that there were large differences in the classifica- tions of the cardboard, HDPE and recycled print classes. The main issue for the less accurate methods was that they misclassifed these classes as a similar class.

The principal misclassification for the cardboard pixels and recycled print pixels were each other and the HDPE pixels were mostly misclassified as HDPE2. This indicates that one problem in this data set is the distinguishing between very sim- ilar classes, which the non-linear classifiers could handle well. For the remaining classes the more accurate methods performed marginally better.

4.1.2 The cheese studies

While being studies of the same object, cheese, the uncoated, wax and paraffin studies yielded a variety of results, not particularly favouring one or the other method. A common trait of the studies were the imbalanced data, in which the cheese, wax and paraffin classes were in a strong majority in their respective stud- ies.

The uncoated cheese study resulted similarly to that of the waste study, as shown

in figure 18. The decision tree and maximum entropy methods, with the addition

of AP and closely followed by BFGS-LR, provided the most accurate results with

a total accuracy of just above 95%. Within a span of 87 and 94% PLS-DA, LSVM

(37)

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Accuracy [1]

PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP

0 0.5

1 LGBM - SDCA - FT - AP

0 0.5 1

F1-score [1]

FF - BFGS-ME - BFGS-LR - SSGD

Mold Cheese

Reflection

White Edge

0 0.5 1

LSVM - PLS-DA - SGD - SIMCA

Figure 18 – The total accuracy and generated F1-scores for the methods classi- fication of the uncoated cheese study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.

Judging from accuracy, SIMCA underperformed. Additionally, SIMCA consis- tently provided a smaller F1-score than all other methods for all classes but the edge class. SIMCA’s problem was the misclassification of mold and reflection as cheese and simultaneously misclassifying cheese as white stain pixels. The second least accurate method, SGD, also misclassified mold and reflection as cheese, but could correctly classify the cheese pixels to a far greater extent. This was the case for all methods but SIMCA, meaning that the performance depended on what degree the minority classes were classified as the cheese majority class.

The wax study highlighted that no method always is the greatest as a new most

accurate method was introduced. Here, as seen from figure 19, AP yielded the

highest accuracy at approximately 92%. The span of 84 to 90% contained all

methods but SIMCA, with no pattern as to if one group of methods were better

than another. SIMCA achieved an accuracy of roughly 78%.

(38)

0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92

Accuracy [1]

PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP

0 0.5

1 AP - BFGS-ME - PLS-DA - LGBM

0 0.5 1

F1-score [1]

BFGS-LR - LSVM - FT - SSGD

Wax Salt

White Reflection

Mold Dirt

0 0.5 1

SGD - SDCA - FF - SIMCA

Figure 19 – The total accuracy and generated F1-scores for the methods classi- fication of the wax cheese study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.

Except for SIMCA, which struggled with the reflection class, the F1-score be- haviour was consistent over the methods. The performance became dependent on the amount of misclassifications of the minority classes as the wax majority class, similarly to the uncoated cheese study. Additionally, no method could correctly identify a single pixel of the dirt class, making the F1-score undefined.

Finally, the paraffin cheese study was handled well by all methods. The most

accurate method was LGBM with an accuracy of 99.3% and the least accurate

method was SSGD with a 95.7% accuracy. The variety in performance depended

on the classification of the edge class, as seen by the F1-score in figure 20. The

minority class edge was misclassifed as the paraffin majority class, similar to the

previous cheese studies.

(39)

0.95 0.96 0.97 0.98 0.99 1

Accuracy [1]

PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP

0 0.5 1

LGBM - SIMCA - FT - FF

0 0.5 1

F1-score [1]

LSVM - BFGS-ME - AP - SDCA

Mold Paraffin Edge

0 0.5 1

BFGS-LR - PLS-DA - SGD - SSGD

Figure 20 – The total accuracy and generated F1-scores for the methods clas- sification of the paraffin cheese study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.

4.1.3 The Indian Pines study

Having nearly the same amount of classes but less included data, the Indian Pines

study provided results similar to the waste study. The decision tree methods

LGBM and FT had the highest accuracy, reaching just over 90%, followed by

FF at 84%. FF was at the top of a set with five other methods; BFGS-ME,

SDCA, BFGS-LR, LSVM and AP, ranging between 79 and 84% accuracy. The

least accurate methods were SGD and SSGD at roughly 73%, PLS-DA at just

below 60% and most notably, SIMCA at almost 0%.

(40)

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Accuracy [1]

PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP

0 0.5

1 LGBM - FT - FF - BFGS-ME

0 0.5 1

F1-score [1]

BFGS-LR - SDCA - LSVM - AP

Alfalfa Corn-notillCorn-mintill

Corn

Grass-pastureGrass-trees Gr.-past.-mowedHay-windrowed

Oats

Soybean-notillSoybean-mintillSoybean-clean Wheat Woods BGTD

Sto.-ste.-to.

0 0.5

1 SGD - SSGD - PLS-DA - SIMCA

Figure 21 – The total accuracy and generated F1-scores for the methods classi- fication of the Indian Pines study. SIMCA is excluded from the accuracy plot as it misclassified almost all pixels. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.

As seen from the F1-scores illustrated in figure 21, SIMCA achieved one defined F1-score. SIMCA predicted all pixels as the oats class, resulting in practically zero accuracy as well as zero F1-score for that class. This is likely the case due to the PCA model for the oats class failing to produce an accurate model, which in combination with a large critical distance classifies all pixels as that class. Be- sides SIMCA, the other less accurate methods struggled with some classes. As an example, PLS-DA failed to generate a defined F1-score for the alfalfa, corn, green-pasture-mowed, oats, BGTD (buildings-grass-trees-drives) and stone-steel- tower classes. The inability to generate a defined F1-score stems from the precision being undefined due to the method not classifying any pixel as that class.

Among the rest of the methods, the largest performance factor were the classifi- cation of the corn and soybean field classes. Consisting of three classes each, the corn fields tended to misclassify as each other and as the soybean fields, and vice versa. The decision tree methods could handle this distinction best, similarly to the case of the waste study.

4.1.4 The nuts study

The final classification study most methods could handle well, as seen in figure

(41)

between the hazelnut and almond classes. The degree at which the methods were capable to distinguish between them determined the accuracy.

0.85 0.9 0.95

Accuracy [1]

PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP

0 0.5

1 LGBM - FT - SDCA - BFGS-ME

0 0.5 1

F1-score [1]

LSVM - BFGS-LR - AP - SGD

Shell Walnut

Hazelnut Almond Pecan 0

0.5

1 FF - SIMCA - SSGD - PLS-DA

Figure 22 – The total accuracy and generated F1-scores for the methods classifi- cation of the nuts study. Methods with the same machine learning basis have the same markers and which methods that belong to each subplot in the F1-segment of the figure is dependent on their accuracy ranking.

4.1.5 Computational speed

In any realtime classifying operation, the training and especially the prediction

speed is of importance. The training and prediction times of the classification

methods from four of the studies are shown in figure 23 and 24, respectively.

(42)

LGBM

FT FF SDCA

BFGS-ME BFGS-LR

SSGD SGD AP LSVM 0

50 100 150 200

Time [s]

Waste Wax

Indian Pines Nuts

Figure 23 – Training time of the models of the ML.NET methods that were chosen

by the self-updating iterative process. PLS-DA and SIMCA excluded due to close

to instant training.

(43)

PLS-DASIMCALGBM FT FF

SDCA BFGS-MEBFGS-LR

SSGDSGD APLSVM 0

100 200 300 400

Time [ms]

Waste study prediction times

PLS-DASIMCALGBM FT FF

SDCA BFGS-MEBFGS-LR

SSGDSGD APLSVM 0

100 200 300 400

Time [ms]

Cheese (wax) study prediction times

PLS-DASIMCALGBM FT FF

SDCA BFGS-MEBFGS-LR

SSGDSGD APLSVM 0

100 200 300 400

Time [ms]

Indian Pines study prediction times

PLS-DASIMCALGBM FT FF

SDCA BFGS-MEBFGS-LR

SSGDSGD APLSVM 0

100 200 300 400

Time [ms]

Nuts study prediction times

Figure 24 – Prediction times of the models of the ML.NET methods on the test data of each problem. The range covers one standard deviation.

Examining the training time of the machine learning methods, we find that the Indian Pines and nuts studies yielded a larger variety of training times while the waste and wax cheese studies were more consistent. In all cases the linear classi- fiers, the ones built on logistic regression as well as AP and LSVM, roughly reached the same training times. They were quickest in the waste study by training for just around 10 s. The slowest case was Indian Pines, where they achieved train- ing times at around 30 s. The maximum entropy methods BFGS-ME and SDCA achieved similar results, with the exception of SDCA reaching a training time of approximately a minute in the Indian Pines study. However, the decision tree methods proved to require 2-4 times as long training times over these data sets.

Worst were FT and FF, where FT for example needed up to 200 seconds training time for the Indian Pines data set.

The behaviours of the training times were translated into the prediction times, as

most methods remained on par while the decision tree methods required 2-4 times

(44)

as much prediction time. PLS-DA and SIMCA generally provided the quickest prediction times, but they were closely followed by the maximum entropy meth- ods and the linear classifiers.

4.1.6 Introducing the convolutional neural network

As an extra segment, the intent was to examine hand-picked methods from other libraries than ML.NET. This was narrowed down to constructing a simple convolu- tional neural network, as that is something the ML.NET library lacks. As seen from the results in figure 25, the simple network was the most accurate in the waste study, third in the Indian Pines study and second to last in the nuts study.

0.65 0.7 0.75 0.8 0.85 0.9

Accuracy [1]

PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP CNN

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Accuracy [1]

PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP CNN

0.85 0.9 0.95

Accuracy [1]

PLS-DA SIMCA LGBM FT FF BFGS-ME SDCA SSGD SGD BFGS-LR LSVM AP CNN

Figure 25 – The total accuracy of the examined methods with the addition of

the self-made convolutional neural network. The upper left, right and lower plots

represent the waste, Indian Pines and nuts studies, respectively.

(45)

4.2 Regression

4.2.1 The powder study

The powder study was well modelled by most methods. Seen in figure 26, the goodness of fit and prediction lie between 0.9 and 1 for all methods and variables except the combination of the potato starch variable and the OGD method, in which the metrics lie between 0.85 and 0.9. However, more interesting to the comparison is that the decision tree methods FF, FT, FTT and LGBM performed slightly better than PLS and the remaining ML.NET methods. They in turn were on par, excluding OGD which performed a bit worse.

PLS FF FT FTT

LGBM OGD OLS

BFGS-PR SDCA 0.8

0.85 0.9 0.95 1

R2

Vanilla Baking soda Potato starch

PLS FF FT FTT

LGBM OGD OLS

BFGS-PR SDCA 0.75

0.8 0.85 0.9 0.95 1

Q2

Vanilla Baking soda Potato starch

Figure 26 – The goodness of fit R

2

along with the goodness of prediction Q

2

of the three variables of the powder study.

4.2.2 The strip study

The methods struggled with the variables of the strip study, as illustrated in

figure 27. The only moderately well modelled variables were TSI CD and TSI

Min, in contrast to the Air Permeance and Thickness variables which did not yield

a significant, sometimes negative, R

2

or Q

2

from any method. In fact, OGD could

not produce a single positive value for either metric. More positively, three of the

decision tree methods could improve the two TSO Angle variables in comparison

to PLS. FF, FT and LGBM produced R

2

and Q

2

values between 0.5-0.6 whereas

PLS barely reached 0.2. Besides this, PLS could rival any of the ML.NET methods

on this data set.

(46)

0 0.5

1 TSI CD - TSI Min

0 0.5 1

R2

TSI Max - TSI MD

PLS FF FT FTT

LGBM OGD OLS BFGS-PR

SDCA 0

0.5

1 TSO Angle+MD - TSO Angle-MD

0 0.5

1 TSI CD - TSI Min

0 0.5 1

Q2

TSI Max - TSI MD

PLS FF FT FTT

LGBM OGD OLS BFGS-PR

SDCA 0

0.5

1 TSO Angle+MD - TSO Angle-MD

Figure 27 – The goodness of fit R

2

along with the goodness of prediction Q

2

of six of the variables of the powder study. Air permeance and thickness was left out as no method could produce models that provided positive values of the metrics.

The first mentioned variable in the title of the subplots correspond to the squares

and the second the circles.

References

Related documents

By comparing general data quality dimensions with machine learning requirements, and the current industrial manufacturing challenges from a dimensional data quality

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Machine Learning, Image Processing, Structural Health Management, Neural Networks, Convolutional Neural Networks, Concrete Crack Detection, Öresund

Several simulation models have been developed to generate such signals with the results fed into five machine-learning algorithms for classification: decision tree, adaboost of

The results from the implementation of the five algorithms will be presented in this section, using a bar-chart to display their respective total mean accuracy, from ten

Re-examination of the actual 2 ♀♀ (ZML) revealed that they are Andrena labialis (det.. Andrena jacobi Perkins: Paxton & al. -Species synonymy- Schwarz & al. scotica while

Representation-based hardness results are interesting for a number of rea- sons, two of which we have already mentioned: they can be used to give formal veri cation to the importance