Classification of a Sensor Signal Attained By Exposure to a Complex Gas Mixture

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2021 | LIU-IDA/STAT-A–21/002–SE

Classiﬁcation of a Sensor Signal

Attained By Exposure to a

Com-plex Gas Mixture

Rabnawaz Jan Sher

Supervisor : Annika Tillander Examiner : Jolanta Pielaszkiewicz

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

This thesis is carried out in collaboration with a private company, DANSiC AB. This study is an extension of a research work started by DANSiC AB in 2019 to classify a source. This study is about classifying a source into two classes with the sensitivity of one source higher than the other as one source has greater importance. The data provided for this thesis is based on sensor measurements on different temperature cycles. The data is high-dimensional and is expected to have a drift in measurements. Principal component analy-sis (PCA) is used for dimensionality reduction. “Differential”, “Relative” and “Fractional” drift compensation techniques are used for compensating the drift in data. A compara-tive study is performed using three different classification algorithms, which are “Linear Discriminant Analysis (LDA)”, “Naive Bayes classifier (NB)” and “Random forest (RF)”. The highest accuracy achieved is 59%, with 57% sensitivity. Random forest is observed to perform better than the other classifiers.

keywords — SiC-FET, classification, random forest, linear discriminant analysis, naive Bayes, principal component analysis, drift, baseline compensation, normalization, sensor.

(4)

Acknowledgments

First of all, I thank Allah for rebuilding the paths when I saw no way, supporting me spiritu-ally when I was struggling, and for excellent health during this whole journey of my master’s program. I would also like to pay my appreciation to my supervisor, Annika Tillander, for guiding me throughout this thesis. Her helpful advice and assistance have been very up-lifting for me. I am privileged enough to have such a supervisor. I would also like to give credit to the examiner Jolanta Pielaszkiewicz and opponent Omkar Bhutra to give feedback on my thesis timely, which helped a lot, produce such a quality report. I would also thank DANSiC AB for giving me this chance to do my thesis, and external supervisor Anita Lloyd Spetz helped me understand the sensor’s working and properties.

I would also like to thank my family for continually inspiring me with the immense love and support they have given me throughout my life.

I would like to express my profound and heartfelt love to my friend Saman Zahid for her consistent guidance, encouragement, and extraordinary affection throughout this master’s program.

(5)

Abstract iii Acknowledgments iv Contents v Glossary vii Notation viii List of Figures ix List of Tables xi 1 Introduction 1 1.1 Sensor . . . 1 1.2 Motivation . . . 2 1.3 Objective . . . 2 1.4 Research questions . . . 2 1.5 Background . . . 2 2 Theory 4 2.1 Principal Component Analysis . . . 4

2.2 Naive Bayes Classifier . . . 6

2.3 Linear Discriminant Analysis . . . 6

2.4 Random Forest . . . 7 2.5 Performance Metrics . . . 8 2.5.1 Accuracy . . . 9 2.5.2 Precision . . . 9 2.5.3 Recall (Sensitivity) . . . 9 2.5.4 F1-score . . . 9 3 Method 11 3.1 Data Collection . . . 11 3.2 Data Description . . . 12 3.3 Data Preprocessing . . . 14 3.3.1 Baseline Compensation . . . 15 3.3.1.1 Differential method . . . 15 3.3.1.2 Relative method . . . 15 3.3.1.3 Fractional Method . . . 16 3.3.2 Cycles Selection . . . 16 3.3.3 Averaging Cycles . . . 16 3.3.4 Normalization . . . 16

(6)

3.4 Features Selection & Extraction . . . 17

3.5 Classification . . . 17

3.6 Environment . . . 19

4 Results 20 4.1 Differential Method . . . 20

4.1.1 K=10 Folds with Five cycles . . . 20

4.1.2 K=10 Folds with Fifteen cycles . . . 22

4.2 Relative Method . . . 23

4.3 Fractional Method . . . 25

4.4 Overall Analysis . . . 28

4.4.1 K=10 Folds with five cycles . . . 28

4.4.2 K=10 Folds with fifteen cycles . . . 30

5 Discussion 33 6 Conclusion 35 6.1 Future Work . . . 35

Bibliography 36 Appendix 39 A.1 Comparison of accuracy and sensitivity of classifier with five cycles . . . 39

(7)

Glossary

ANN Artificial Neural Network. 2, 3 CSV comma separated values. 12 CV Cross-validation. 18, 21–28 DNA deoxyribonucleic acid. 1 Hz Hertz unit of frequency. 12 KNN K-nearest-neighbor. 3

LDA linear discriminant analysis. 2, 3, 6, 7, 19–28, 33–35 MAP maximum posterior probability. 6

MFC mass flow control. 11 ml/min milliliter/minute. 11

NB Naive Bayes Classifier. 3, 6, 19–35 NN neural network. 3

PCA Principal Component Analysis. 3, 4, 6, 17, 19, 20, 33, 34 PLS-DA Partial Least Square-Discriminant Analysis. 3 RF Random Forest. 2, 3, 7, 19–31, 33, 34

SiC Silicon Carbide. 1, 3

SiC-FET Silicon Carbide Field-Effect Transistor. 1, 2, 11, 12, 17 SVM Support Vector Machine. 2, 3

(8)

Notation

Mathematical symbols used in this research study.

rXi,Xj correlation between i

th_{cycle and j}th_cycle.

¯

Zm column mean.

δA Noise in reference signal RNand response signal YN.

RN Reference signal response.

xnorm_im Normalize response of cycle i with m variable. XN Normalize response.

(9)

List of Figures

2.1 Construction of random trees . . . 8

3.1 Workflow of the thesis . . . 11

3.2 Measurement instrument setup . . . 12

3.3 Overview of data . . . 13

3.4 Source and Reference signal response . . . 13

(a) Source signal response . . . 13

(b) Reference signal response . . . 13

3.5 Frequency of each class and raw sensor signal . . . 14

(a) Frequency of each class . . . 14

(b) Raw sensor signal output of one measurement. . . 14

3.6 Illustration of the quasi-static plot of all observations . . . 14

(a) . . . 14

(b) . . . 14

3.7 Illustration of the data-set divided into two sets . . . 18

3.8 K-fold Cross validation . . . 18

4.1 Train LDA on differential baseline method using K=10 fold CV with 5 cycles . . . 21

(a) train on raw data . . . 21

(b) train on normal data . . . 21

(c) train on reduced normal data . . . 21

4.2 Train NB on differential baseline method using K=10 fold CV with 5 cycles . . . 21

4.3 Train RF on differential baseline method using K=10 fold CV with 5 cycles . . . 21

4.4 Train LDA on differential baseline method using K=10 fold CV with 15 cycles . . . 22

4.5 Train NB on differential baseline method using K=10 fold CV with 15 cycles . . . 22

4.6 Train RF on differential baseline method using K=10 fold CV with 15 cycles . . . 23

(a) train on raw . . . 23

(b) train on normal . . . 23

(10)

4.7 Train LDA on relative baseline method using K=10 fold CV with 5 cycles . . . 23

4.8 Train NB on relative baseline method using K=10 fold CV with 5 cycles . . . 24

4.9 Train RF on relative baseline method using K=10 fold CV with 5 cycles . . . 24

4.10 Train LDA on relative baseline method using K=10 fold CV with 15 cycles . . . 24

4.11 Train NB on relative baseline method using K=10 fold CV with 15 cycles . . . 25

4.12 Train RF on relative baseline method using K=10 fold CV with 15 cycles . . . 25

4.13 Train LDA on fractional baseline method using K=10 fold CV with 5 cycles . . . 26

4.14 Train NB on fractional baseline method using K=10 fold CV with 5 cycles . . . 26

4.15 Train RF on fractional baseline method using K=10 fold CV with 5 cycles . . . 27

4.16 Train LDA on fractional baseline method using K=10 fold CV with 15 cycles . . . 27

4.17 Train NB on fractional baseline method using K=10 fold CV with 15 cycles . . . 27

4.18 Train RF on fractional baseline method using K=10 fold CV with 15 cycles . . . 28

(11)

List of Tables

2.1 Confusion Matrix . . . 9 3.1 Temperature Cycle Operation . . . 12 4.1 Percentage of three kinds of Baseline Compensation methods . . . 20 4.2 Performance metrics of classifier using differential method on three data sets with

five cycles. . . 29 4.3 Performance metrics of classifier using relative method on three data sets with five

cycles. . . 29 4.4 Performance metrics of classifier using fractional method on five data sets with five

cycles. . . 30 4.5 Performance metrics of classifier using differential method on three data sets with

fifteen cycles. . . 31 4.6 Performance metrics of classifier using relative method on three data sets with

fif-teen cycles. . . 31 4.7 Performance metrics of classifier using fractional method on five data sets with

(12)

1 Introduction

This chapter explains the SiC-FET sensor, its types, and its properties. Furthermore, this sec-tion describes the main objective of this thesis, the motivasec-tion, the related research quessec-tions addressed in this thesis, and background studies.

1.1 Sensor

In recent years, sensors have become very important part of everyday life. Today sensors are used in most of the machines like in vehicles, smart devices, industries and public places etc. The purpose of the sensor is to capture the difference or change in the environment and record the response in the computer to use further. A simple example of a sensor is a temperature sensor, for example, Resistance Temperature Detector (Pt-100, resistance =100 ohms and 0˝_{C), the most commonly used sensor to measure the room temperature. There}

are different types of sensors such as physical, chemical and biosensor. The physical sensor measures the physical properties like pressure, temperature and flows while chemical sensor measures the chemical properties like different molecules in the gas phase and biosensor measure the biological properties such as cell, protein and DNA.

In this thesis, chemical gas sensor is used, which is “Silicon Carbon Field-Effect Transistor (SiC-FET)” sensor. The SiC-FET sensor is a field effect transistor that is based on silicon car-bide as the semiconductor (while silicon is the common semiconducting material in almost all electronics). Silicon carbide is used in devices that operate at high temperature or high volt-age and Field-effect transistors are used to control the flow of current in a device. SiC when used in transistors, results in lower on stage resistance and higher thermal conductivity. SiC-FET sensor is one of the sensors that have the ability to work in extreme conditions. Under high temperature and corrosive atmosphere, SiC-FET sensor not only continues to work, but also detects the changes quickly. In order to understand the SiC-FET sensor, it is necessary to know its parameters. As mentioned by Comini et al.[1] the most important parameter of this sensor are sensitivity, selectivity and stability.

Sensitivity of a sensor is referred as, the change in input gas causing unit change in output gas response. According to Vessman et al.[2], the Selectivity, of a sensor is to determine the target analyte components in the complex mixture of gases without the intrusion of other components in the mixture. Stability can be defined as a sensor giving the same response

(13)

1.2. Motivation

with time (under same conditions). Any change in the signal response, due to any chemical reaction other than the intended chemical reaction, disrupts the stability of the sensor.

The drift in a sensor is comprised of little change or variation in sensor response over a time when it is exposed to a complex mixture of gases under the same conditions[3]. The high frequency of the signal is noise, and the low frequency of the signal is drift. Drift can be additive or multiplicative. The drift compensation removes the little change or variation in sensor response by manipulating the sensor response.

As mentioned earlier, SiC-FET sensor has the ability to work at very high temperatures, but it affects the performance of the sensor. In his research Bur et al.[4] operated the SiC-FET sensor over a range of temperature cycles and evaluated the sensitivity and selectivity of it. In this research, SiC-FET sensors operate on temperature cycles ranging from 250˝_{C to 450}˝_C

for better sensitivity and selectivity.

1.2 Motivation

DANSiC AB is working with commercial sensor systems to control the indoor air quality like at homes, offices, school libraries, and public places like stores.

DANSiC AB started research work on the classification of a source in 2019. The study has been performed on a small data set which has shown that the source can be classified using supervised machine learning by exposing the gas mixture to a chemical sensor system. This thesis is the extension of the research work performed by DANSiC AB. This thesis work has been carried out with a collaboration of DANSiC AB to investigate how well classification methods will perform on a large data set.

1.3 Objective

The aim of this thesis is to classify the source. The source consists of two classes, i.e., class one and class two, of which one has greater importance than the other. To reduce the usage of resources on the less useful class, classification is done on the response recorded by SiC-FET sensor from a complex gas mixture. Since one class is more important to classify correctly than the other, we want to minimize the loss of specificity when tuning the model to have maximum sensitivity towards one class.

1.4 Research questions

This section describes the questions that will be answered by the end of this thesis. 1. How high accuracy can be obtained for the prediction of classification models?

2. Whether the maximum accuracy of classification models can be retained with fewer cycles or fewer features?

3. Which method will be appropriate for drift compensation?

1.5 Background

The human sense is limited, there are few gases that humans can detect and respond to. In order to identify the gases properly, sensors are used.

According to Qiang Li et al.[5], he conducted a study aiming to use gas-sensors to detect the characteristics of Chinese liquor. The method with highest classification accuracy was RF compared to SVM with linear kernel, SVM with Sigmoid function, back-propagation with ANN and LDA.

(14)

1.5. Background

Another research has been conducted by Christian Bur et al.[4]. to improve selectivity of SiC sensor by using temperature cycled operation (TCO). Different signal pre-processing techniques such as normalization and smoothing have been used, to compensate the additive drift and multiplicative drift. Feature selection using PCA and LDA, and different classifica-tion methods have been studied i.e K-nearest neighbor classifier and Mahalanobis distance classifier.

There is another research available for the identification of nutriment and taste of Italian hard cheese [6]. ANN and PLS-DA were used for the classification of cheese characteristic. Both techniques have provided good results but ANN turned out to be more promising with approximately 100% accuracy.

In research conducted by Chen et al.[7], which is to discriminate green tea quality. In this research, PCA has been used for dimensionality reduction than various linear and non-linear classification methods applied to the components extracted from PCA. SVM outperforms than ANN and KNN with 4 principal component. Another research has been conducted by Men et al [8] to classify odor of paraffin and grade where odor belongs to four different classes of paraffin in samples. Supervised (PCA) and unsupervised learning (PCA) has been used for dimensionality reduction. The method SVM performs better in terms of accuracy as compared to ELM (extreme learning machine) Feed-forward NN having a single hidden layer and RF.

The primary aim of this research is to classify the source. Based on the literature review, the dimensionality reduction is made using PCA. From the literature review, we can deduce that the ANN performs better with high accuracy for sensors data, but due to the presence of limited data, ANN did not seem to be a sensible choice for this study. The methods se-lected for classification are LDA, NB, and RF. The mentioned methods are discussed in detail in section 2. The performance metrics used for evaluation of methods are also discussed in section 2. In section 3, data collection and preprocessing is explained. The section further explains the implementation of methods that were carried out in this research. Section 4 illus-trates the results using tables and charts. The results are analyzed and discussed in section 5. Section 6 contains the conclusion based on the result and answers the research questions mentioned in section 1.4. Finally, future work is discussed in section 6.1.

(15)

2 Theory

A concise overview of the various classification method used in this thesis is explained in this chapter.

The following elements will be used throughout this section. Lets consider X is input space and and C is output space. The data set (Xi, Ci) P X ˆ C, i = 1, 2, 3, . . . , N where

N is the total number of observations. The Xi, is ith observation with M features i.e x =

(x1, x2, x3, . . . , xM), the xmis feature values and m=1, 2, 3 . . . , M. Furthermore, we consider

C=t1, 2, 3, . . . , cu where C represent classes.

2.1 Principal Component Analysis

PCA, analyzes the data set, in which fewer features or most of the features are correlated to each other [9]. The objective of this method is to:

1. Filter out the essential information from data set.

2. Reduce the dimension of data set and also retaining the necessary information. 3. It produces uncorrelated features.

4. It also recognizes the relationship between features that are not examined before. PCA is used in many applications such as medical sciences, facial recognition, and signal data, etc. In the medical sciences, data is collected in different ways, in which the number of features is more than the total measurements. For example, genome data, mRNA sequences of proteins, and DNA micro-array produce an abundance of data. This high dimensional data makes it difficult for statistical methods to visualize, but the principal component analysis is used to reduce the dimension of such data [10]. In face recognition applications, PCA is used in reducing the facial characteristics of the face from images [11]. In ECG signal processing, PCA is used for data compression, extraction of the different wave segments from the waveform; the purpose is to observe the change due to reduced in blood flow to the heart, irregularity in ventricular repolarization, body surface analysis, and the extraction of irregular change in heartbeat [12].

(16)

2.1. Principal Component Analysis

In this section, we use the same notation X for the transpose of input space, i.e., X has M ˆ N dimension. PCA describes the variance-covariance structure of a set of variables through fewer linear combinations of these variables [13]. So, to achieve the reduced dimensions; first, the covariance (Σ) of X is computed, which will be a M ˆ M matrix. Then, eigenvectors and corresponding eigenvalues are determined forΣ. The eigenvectors are ranked in decreasing order of eigenvalues. The linear transformation matrix "A" of M ˆ M dimension is found, which is orthogonal and forms the basis for the new coordinate system.

Let Y represents the N linear combinations of M variables, then it is given as:

Y=AX (2.1)        Y11 Y12 . . . Y1N Y21 Y22 . . . Y2N Y31 Y32 . . . Y3N .. . ... . .. ... YM1 YM2 . . . YMN        =        a11 a12 . . . a1M a21 a22 . . . a2M a31 a32 . . . a3M .. . ... . .. ... aM1 aM2 . . . aMM        .        X11 X12 . . . X1N X21 X22 . . . X2N X31 X32 . . . X3N .. . ... . .. ... XM1 XM2 . . . XMN        (2.2)

Each value in matrix Y is the dot product of the corresponding value in A to the value in X. Each column of Y has M linear combinations. For the first column of Y i.e. Y1, it can be given

as: Y1=      a11.X11 a12.X21 .. . a1M.XM1      (2.3)

Each dot product in equation 2.3 is linear combination which can be represented as:

Ym(1)=aT(1)mXm(1) (2.4)

where m = 1, 2, . . . , M and a(1)mis a transformation vector which is 1st row from matrix A.

The superscript T represents the transpose.

Var(Y_m(n)) =aT_(n)mΣa_(n)m m=1, 2, 3, . . . , M

n=1, 2, 3, . . . , N (2.5)

Cov(Y_m(n), Yl(n)) =aT(n)mΣa(n)l

m, l=1, 2, 3, . . . , M

n=1, 2, 3, . . . , N (2.6)

Often, k principal components are enough to explain the system’s variability. Therefore, the most uncorrelated k linear combinations that correspond to the highest variances are chosen. Thus, the dimension of Y is reduced to k ˆ N matrix. The first principal component is then the linear combination that maximizes the Var(Y11). kth principal component is the linear

combination Yk(n) which maximize the variance Var(Yk(n))and Cov(Yk(n), Yl(n)) = 0 where

(17)

2.2. Naive Bayes Classifier

2.2 Naive Bayes Classifier

Naive Bayes Classifier (NB) is the simplest and robust probabilistic classifier. NB classifier is used in any classification problem, for initial analysis due to naive assumption and simplicity. The assumption for NB classifier is that features are conditionally independent to the given class [14]. P(x|C) = M ź m=1 P(xm|C) (2.7)

This method is used, in many application such as document classification, recommendation system, spam filtering and sentimental analysis. Naive Bayes classifier originates from Bayes theorem. The expression for naive Bayes in terms of Bayes theorem is explained below. In Bayes theorem, P(C), which is the background information. If we do not have the prior information, then we will assign the same probability. P(Xi)is the probability of ith

observa-tion and P(C|Xi)is posterior probability then Bayes theorem as follow:

P(C|Xi) =

P(Xi|C)P(C)

p(Xi) (2.8)

Where P(Xi|C)is class conditional probability. P(Xi)is constant then Eq 2.8 re-written as:

P(C|Xi)9P(Xi|C)P(C) (2.9)

To maximize the posterior probability (MAP) of the most likely class is given as [15]

Cmap=arg max v P C

P(v|Xi) (2.10)

Cmap9arg max

v P C

P(Xi|v)p(v) (2.11)

Eq 2.11 re-written as:

Cmap9arg max

v P C

P(x1, x2, x3, . . . , xm|v)P(v) (2.12)

Since all attributes are independent of each with respect to class, therefore Eq: 2.12 is trans-formed to:

Cmap9arg max v P C P(v) M ź m=1 P(xm|v) (2.13)

2.3 Linear Discriminant Analysis

LDA usually used for both dimensionality reduction and classification, i.e. to find a set of features which are linear that characterizes two or more group of objects. The key difference between LDA and PCA is that LDA does not alter the shape of data, it classifies the data and defines the decision boundary between classes, whereas PCA transforms the data features into low dimensional space according to Balakrishnamat et al.[16].

(18)

2.4. Random Forest

According to Mai et al.[17] the assumption for the LDA model is that X belongs to the Gaus-sian distribution;

X|C „N (µC,ΣC) (2.14)

WhereΣC =Σ, same for all C classes,N is the normal distribution and µCis mean for class

C. If normal assumption is considered, the linear equation according to Bayes rule will be [17]

δBayes(X) =arg maxtlogπC+µ_CTΣ´1(X ´ µC)u (2.15)

The parameter πCis the class proportional probability. In LDA, the parameter estimates are

used to simplify the equation 2.15

δC(X) =XTΣ´1µC´1

2µ

T

CΣ´1µC+logπC (2.16)

According to James et al.[18], the estimated parameters are given as follows:

ˆ µC = 1 NC ÿ i:C_i Xi (2.17) ˆ πC= NC N (2.18) ˆ ΣC= 1 NC ÿ i:Ci (Xi´µˆC)(Xi´µˆC)T (2.19) ˆ Σ= 1 N ´ c c ÿ C=1 NCΣˆC (2.20)

Where ˆΣ is the weighted average of sample variances for each of C classes, thus it is same for all C=t1, 2, 3, . . . , cu classes and ˆµCis sample mean. T represent transpose. In Eq: (2.17

to 2.20), N is sample size and NCis number of sample observation from C class.

2.4 Random Forest

The random forest is an ensemble method that generates number of decision-making trees. Random forest randomly selects the features (through bagging) from the input vector to be used in each stage of the training as the decision tree created. RF creates a number of decision-making trees rather than one. The prediction is made based on selected features from the target class for each tree; the votes are cast for the predicted class. The class with more number of votes is selected.

Random f orest=bagging+decision trees+decorrelation (2.21) Where Bagging is a method to combine (weak) classifiers and thus to create more accurate classifier. RF [19] becomes a significant change in bagging, which constructs a large number of decorrelated tress and average them. The random forest does not have a closed-form equation it can be represented as a black box according to author [19]. If h(X)represents a

(19)

2.5. Performance Metrics

black box, then random forest consists of h1(X), h2(X), h3(X), . . . , hN(X), where X is random

input vector of features.

Figure 2.1: Randomly sample data from a training set of size N. Construct trees, cast votes, and class with more number of votes will be selected at the end [20]

.

The working of a random forest algorithm is given by Hastie:[21]: 1. for all bagging 1, 2, 3 . . . , B number of bags

a) Initially, it began with the collection of the random sample from the training data set of N size.

b) Build a decision tree recursively Figure 2.1, and repeat the following steps(i ´ iii) until the last node of the tree constructed.

i. From M random features, select the q features randomly. ii. Form q features choose the best features.

iii. split-down the node, in two child-nodes 2. Return the trees of all bags 1, 2, 3 . . . , B .

2.5 Performance Metrics

The output of the model cannot be measured simply by checking the accuracy of the model. Accuracy itself does not make any conclusion about the model performance. The confusion matrix is a significant consideration used to assess the output of the model. In this thesis, accuracy, recall, precision, sensitivity, and F1-score are used to assess the performance of the classifier. The Figure 2.1 taken from [22]

(20)

Table 2.1: Confusion Matrix comprise of true positive (TP), false negative (FN), true negative (TN) and false positive (FP).

2.5.1 Accuracy

Accuracy is the proportion of the number of the true positive(TP) and true negative(TN) over the total number of prediction from table 2.1

accuracy= TP+TN

TP+TN+FP+FN (2.22)

2.5.2 Precision

Precision is the fraction of true positive values to the number of true positive(TP), and false-positive(FP) represented as in table 2.1

precision= TP

TP+FP (2.23)

2.5.3 Recall (Sensitivity)

The recall is the proportion of true positive values and the number of false-negative(FN) and true positive(TP) in the table 2.1. A recall is often referred to as sensitivity.

recall= TP

FN+TP (2.24)

2.5.4 F1-score

F1-score is the Harmonic mean of precision and recall. Its values are between 0 and 1. If the value of the f1-score is closer to 1, this means that the classifier has good precision and recall. If its value is 0, then the classifier is not good. It is calculated as:

(21)

f 1 ´ score=2 ˆ precision ˆ recall

(22)

3 Method

The method that are used in this research explained in this section. The figure 3.1 shows the workflow that has been followed in conducting this study.

Figure 3.1: The complete workflow by taking raw signal as input, preprocessing of the raw signal, extract useful features from the signal and then pass the extracted features to the classification model for evaluation.

3.1 Data Collection

DANSiC AB gathered the data in collaboration with Linkoping University. The illustration of instrument setup is seen in Figure 3.2

The gas carrier, flow (clean air oxygen 21% and Nitrogen 78%) pass through mass flow control (MFC) with a flow rate of 100ml/min which is then pass through a SiC-FET sensor and recorded by computer program FETControl1_{[23] then the source is placed, the mixture of}

gases pass through a source with the same flow rate 100ml/min. The gases released from the source transmitted to the gas sensor, which is recorded by a computer program (FETControl). A total of 303 measurements has been recorded. This process has been repeated for every measurement.

(23)

3.2. Data Description

Figure 3.2: Illustration of instrument setup, black lines and green lines show the flow of gas and flow of current. The arrows show the direction of the flow of gas (black arrow) and flow of current (green arrow).

3.2 Data Description

DANSiC AB and Linkoping University provided the data for this thesis. Manuel Bastuck developed the FETControl[23] program during his Ph.D., which was used to record data set. A total of 303 samples have been taken using SiC-FET sensor, which was operated over six different temperatures as seen in table. 3.1 with a sampling frequency of 10 Hz. Initially, reference gas (which is clean air i-e oxygen 21% and nitrogen 78%) passes through the sensor for each temperature in the table. 3.1 then the source is exposed to a mixture of gases. The gases emitted from the source is passed to the sensor, and measurements are recorded for each temperature in table. 3.1. This process has been carried out 303 times for data collection. The data recorded by the FETControl program returns .H52_{file, in which data has been stored}

as a multidimensional-array like structure. Dot H5 file is parsed into CSV using Matlab. Temperature Cycle Operation

Temperature (˝_C) _{Time (S)} 392 18 362 11 326 11 287 10 251 10 213 10 ř time=70s

Table 3.1: Sensor is operated with six different temperatures.

Data set consists of a two-dimensional numeric matrix with N ˆ M. In which N =303 ˆ 35, where 303 is total number of sources and 35 is total number of cycles for each source, M= 700 total number of features f1, f2, f3, . . . , f700and o1, o2, o3...oNtotal number of observations

Figure 3.3. Class labels(C) = $ & % class1 i f c=1 class2 i f c=2 unknown i f c=3 (3.1)

(24)

3.2. Data Description

In Eq.3.1, YN represents the source. If c = 1 source belongs to class1, if c = 2 source

belongs to class2 and c=3 source is unknown or unidentified. For classification to be more accurate, we have used only those subjects (source data) which have known labels that are only subjects with class label class1 and class2 are being utilized. The sources with the labels c = 3 either belong to class1 or class2; those are undecided. We are not using those data points for classification.

class f1 f2 f3 .. fM           1 203.2 207.2 207.7 .. 210.2 observation1 2 303.2 307.2 307.7 .. 310.2 observation2 1 205.2 205.2 205.7 .. 211.2 observation3 .. .. .. .. .. .. .. 3 250.2 250.2 250.7 .. 250.2 observationN

Figure 3.3: Overview of data (N = 303 ˆ 35 and M=700)

Total 70 samples correspond to one source measurement, in which 35 samples are col-lected by only passing reference gas over the sensor, and the next 35 cycles, the sensor has been exposed to a mixture of gases from the source. Figure 3.4a the response signal when the source is exposed to the mixture of gases. Figure 3.4b the reference signal, the sensor signal when the carrier gas (clean air) passes through the sensor without any source.

(a) Source signal response (b) Reference signal response

Figure 3.4: The plot shows the signal response against different TCO. The X-axis represents the dataPoints, and Y-axis represents the drain current (ID(µA)). a) Signal of gases emitted

from source when it is exposed to sensor, b) Reference gas signal value when exposed to sensor.

(25)

3.3. Data Preprocessing

(a) Frequency of each class (b) Raw sensor signal output of one measure-ment.

Figure 3.5: a) Histogram shows the frequency count of each class, b) The raw sensor signal output of one measurement. (35x2) i.e 35 measurement cycles for each source and reference.

Figure 3.5a The frequency distribution of each class, 86 sources belongs to class1, 82 sources belongs to class2 and 131 sources which are unknown. Figure 3.5b shows sensor signal output of one measurement, the first 35 cycles (red) contains signal values of reference gas and the next 35 cycles (blue) contains the signal values of emitted gases from source at temperature 392˝_C.

(a) (b)

Figure 3.6: Illustration of the quasi-static plot of all observations. (black) signal shows the raw signal response of the sensor and (red) signal show the response of the signal after removing outlier/faulty measurements. Figure 3.6a, 3.6b

3.3 Data Preprocessing

Data based on the chemical gas sensor signals contain different effects due to different chemi-cal reactions happening on the sensor surface. These effects can be noise or drift which causes a change in signal. To remove noise and drift from the signal, the preprocessing of the signal is useful. The purpose of data preprocessing is to remove outlier, noise and drift from data to filter out unwanted effects, to prepare data for the feature selection and classification. It is important to filter out noise and drift effects without loss of information.

The data preprocessing is essential to filter out missing data and outlier. In the data set, there are some errors in measurement that is due to the loosening of the chamber (which contain the source, was not closed at some point), humidity mix with the source; the program crashed, measurement stops1₂cycle earlier, and some measurement affected a lot by humidity

(26)

3.3. Data Preprocessing

in the environment (warm and humid weather). This type of measurement was discarded. After removing the faulty/outlier measurements the new data set becomes N ˆ M where N=91 ˆ 35, M is same as in 3.2.

The pre-processing can be done in four steps which are as follows. 1. Baseline Compensation

2. Cycles Selection 3. Averaging Cycles 4. Normalization

3.3.1 Baseline Compensation

The baseline compensation or correction method is used for removing noise or drift using the reference signal. In this thesis, the reference signal is recorded prior to passing the gas through the source, that is the signal measurement without any change due to chemical reac-tions caused by the source.

Baseline drift is a common drift found in sensor measurements. There can either be addi-tive drift, multiplicaaddi-tive drift, or no drift. The method used for compensating the drift differs according to the type of drift. In this thesis, three different traditional baseline compensation methods are used to compensate for the drift, which is as follows [24], [25].

1. Differential method 2. Relative method 3. Fractional method

Let the random vector YN = [y1, y2, y3, . . . , yM]is the sensor response which manipulate

the reference signal vector RN = [r1, r2, r3, . . . , rM] and produces normalize response XN.

where N is describe in section 3.3. 3.3.1.1 Differential method

The transformation of sensor response YN by its reference signal RN is done by subtracting

the reference signal RN from the sensor response YN. It remove additive noise or drift δA

from sensor. The normalize response of sensor is XN: [26]

XN= (YN+δA)´(RN+δA)

=YN+δA´RN´ δA

(3.2)

XN =YN´RN (3.3)

3.3.1.2 Relative method

The transformation of sensor response YNby its reference signal RNis done by dividing the

sensor signal YNby the reference signal RN. It remove multiplicative noise or drift δMfrom

(27)

3.3. Data Preprocessing XN= YN+YNδM RN+RNδM = YN(1+δM) RN(1+δM) (3.4) XN= YN RN (3.5) 3.3.1.3 Fractional Method

The transformation of sensor response YNby its reference signal RNby subtracting and

divid-ing the reference signal RNfrom the sensor response YN. The normalize response of sensor is

XN:[26] XN = YN´RN RN (3.6)

3.3.2 Cycles Selection

For each source measurement, there are 35 cycles. One row represents one cycle. The values within each cycle lie between 100 to 300 that are sampled in sequence. The cycles which are highly correlated are selected using the Pearson correlation coefficient. It is the bi-variate correlation rXi,Xj method which measures the linear correlation between two cycles Xi and Xj, where (i ‰ j), i, j P 1, 2, 3, . . . , 35 and Xi, Xjare compensated response.

rXi,Xj = ř (Xi´ ¯Xi)(Xj´ ¯Xj) ař (Xi´ ¯Xi)2 b ř (Xj´ ¯Xj)2 (3.7)

If the value of rXi,Xj is+1 then the cycles Xi and Xj are highly positively correlated. If rXi,Xj is 0 then there is no correlation between cycles Xiand Xj. If the value of rXi,Xj is ´1 the cycles Xiand Xjare negatively correlated.

3.3.3 Averaging Cycles

Cycle selection is significant as the information in one cycle is closely related to each other cycles. With an average of very few selected cycles or all cycles, there is a chance to lose or include too much information for model processing. Thus, the highly correlated cycle will be selected and an average ¯Xithem into one cycle per source. After averaging the cycles, the

new data set is constructed, which is also two-dimensional matrix Z, with N ˆ M, in which N=91 i.e., one cycle for each source and M same as mentioned earlier in section 3.2.

3.3.4 Normalization

Normalization is changing the different scale values (actual values) to the same scale or range (normalize values). The normalization can be done in different ways. It has been used in different applications. [27] Normalization of the cycle is done by dividing each value zi,m in

that cycle to its mean ¯zi.

xnorm_im = zim ¯zi

(28)

3.4. Features Selection & Extraction

xnorm_im is normalized data with m = 1, 2, 3, . . . , M and i= 1, 2, 3, . . . , N. Another way to nor-malize the data is done by subtracting each cycle value in that cycle by its cycle mean value:

xnorm_im =zim´¯zi (3.9)

Sensor-auto-scaling is in which each features value zmiis set to have zero mean and unit

stan-dard deviation of all the cycles i [26].

xnorm_mi = b zmi´¯zm 1 i´1 ři k=1(zmk´¯zm) (3.10)

where cycle mean of each feature is represented by ¯zm, and standard deviation (std) of cycle

is b

1 i´1

ři

k=1(zmk´¯zm). The Eq 3.10 is standard normalization of data.

After compensating the drift from sensor using Eq: 3.3, 3.5, 3.6 it is necessary to select cy-cles which are highly correlated. For every source, we have 35 measurements. To reduce the computation of the classification algorithm, we choose only a few cycles which are positively correlated using Pearson correlation coefficient Eq: 3.7. Based on the correlation coefficient values, we chose cycles for further preprocessing. Selected cycles S are average them by tak-ing mean ¯Xsmwhere s= 1, 2, 3, . . . , S and for all M variables. For example, if S = 3, which

mean the first three highly correlated cycles are selected than s=1, . . . , S for all M variables, then average them into one cycle.

The final step of data preprocessing is the normalization of the average selected cycles S by setting mean feature value zmizero and standard deviation one for all cycles in the Eq: 3.10,

which is also known as a sensor auto-scaling.

After preprocessing the data, the three data sets formed by applying three baseline com-pensation methods, select the cycles which are more correlated, average the cycles and nor-malize them, further steps meant to perform on these three data sets.

3.4 Features Selection & Extraction

The SiC-FET sensor produces an immense amount of information that cannot be processed without re-ordering and summarizing efficiently. In N x M numerical matrix, the total num-ber of features determines dimensionality, and its relationship between M and N decides whether its curse of dimensionality problem or not [28]. In data set M " N, in this case, fea-tures are more than observations; this is the high-dimensional problem. It causes challenges for the traditional statistical method which works only, when M ! N, where N Ñ 8 and M=constant.

When the total number of features is much higher than the total number of observed measurements, data is called high-dimensional data. In high-dimensional data, usually, a small number of features are essential, representing the complete data-set. In this paper, PCA used to transform high-dimensional data into a low dimension. PCA is applied to the three normalized data sets, saving the results into a separate matrix, which was used further for classification.

3.5 Classification

Classification algorithm applied to the two components extracted from PCA (Principal Com-ponent Analysis). Three classification methods were implemented on raw data as well as two normalized data sets.

(29)

3.5. Classification

Before applying the classification algorithm, the data split into training and test with a ratio of 70% data within the training set and 30% data within the test set Figure 3.7. The test data is not involved in any sort of training. It keeps as a pure test set for prediction and model evaluation. The training of the model done using k-folds cross-validation.

Figure 3.7: Illustrate, the data-set divided into two sets. Training and test set with a ratio of 70 percent belong to the training set, and 30 percent belongs to the test set.

K-fold with CV applied to estimate and evaluate the performance of the models. That means a small sample of data used to estimate the model in general with the assumption that test data not used in any training. The k-fold cross-validation procedure as follow:

1. Randomly shuffle the data set.

2. Split the data set into K folds. Figure 3.8 3. For each Kth fold.

a) It keeps one fold for model testing and k ´ 1 folds kept as a training set. b) fit (training set) and predict (test set) the model.

c) calculate each metrics. d) save results.

4. Averaged the metric over each fold and evaluate the model performance.

Figure 3.8: K fold diagram, in which K is the total number of folds used to divide the data. In every iteration, it separates one fold and K ´ 1 fold as the training set. The one fold which kept separate only used for model evaluation. It runs K times.

(30)

3.6. Environment

Classification models LDA, RF and NB applied to all normalized data-sets to the compo-nents extracted from PCA. The model has trained using K-fold cross validation. The data-set divided into 5 folds and 10 folds. Each time k-1 fold keeps as the training set and hold one fold for test set to estimate and evaluate the performance of the model. Plot result, in terms of accuracy and sensitivity over each fold.

In the case of LDA and NB, the results are averaged over each fold, but in the case of the RF model, results averaged over each number of tree. The number of trees consider training RF model 10, 15, 20, 25, 30...100.

3.6 Environment

The language and software used for this thesis is an R Programming language which is an open-source statistical programming language, uses with R-studio Software (IDE for that runs R language and provide interface to write user-friendly code).

The following R-libraries did use in this thesis.

1. caret, e701, randomForest, corrplot, MASS package used for data preprocessing and classification.

(31)

4 Results

This section illustrates the results, plots and different tables that are obtained. The results represented aiming to achieve the objective of this thesis.

The Principal Component Analysis (PCA) is applied to reduce data dimensions using three baseline compensation methods: differential, relative, and fractional. It is evident from the Table 4.1 that the differential method between the two other methods relative and fractional, reflects the total variance explained by the first two components are slightly more when five and fifteen highest correlated cycles are selected for each source.

Percentage of three kinds of Baseline Compensation methods Five Cycles Selected Fifteen Cycles Selected

Baseline Compensation %C1 %C2 %C1 %C2

1 Differential 96.1 1.2 97.3 1.1

2 Relative 95.3 1.4 96.0 1.4

3 Fractional 95.3 1.4 96.0 1.4

Table 4.1: Principal Component Analysis (PCA) applied to three baseline compensation meth-ods with different number of cycles. %C1,%C2 represents the percentage of total variance explained be each component.

4.1 Differential Method

4.1.1 K=10 Folds with Five cycles

The raw sensor signal contains drift due to different chemical reactions happening on the sensor surface. A Drift can be additive or multiplicative. As mentioned in section 3.3.1, the traditional baseline compensation methods are used to remove drift. This section includes the differential method result (remove the additive drift) when three different models (LDA,NB, and RF) are trained on using five cycles.

According to figure 4.1b, the highest accuracy achieved is about 80%, with LDA on normal data if we consider the highest accuracy and sensitivity on one fold only. On the same fold,

(32)

4.1. Differential Method

when accuracy is highest, the sensitivity is below 75%. The highest sensitivity reached is 100% on exactly one fold with LDA on normal and naive Bayes with normal as well as reduced normal data Figure 4.2b, 4.1c. The perfect sensitivity i-e 100% is achieved on three folds with LDA on reduced normal data (Figure 4.1c), though the accuracy on these folds is below 70%.

(a) train on raw data (b) train on normal data (c) train on reduced normal data

Figure 4.1: The training of a model LDA using K=10 fold CV (accuracy vs sensitivity) over each fold using differential baseline method on five selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

Figure 4.2: The training of a model NB using K=10 fold CV (accuracy vs sensitivity) over each fold using differential baseline method on five selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

Figure 4.3: The training of a model RF using K=10 fold CV (accuracy vs sensitivity) over each fold using differential baseline method on five selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

(33)

4.1. Differential Method

4.1.2 K=10 Folds with Fifteen cycles

The raw sensor signal contains drift due to different chemical reactions happening on the sensor surface. A Drift can be additive or multiplicative. As mentioned in section 3.3.1, the traditional baseline compensation methods are used to remove drift. This section includes the differential method result (remove the additive drift) when three different models (LDA,NB, and RF) are trained on using fifteen cycles.

For fifteen cycles (Figure 4.4c, 4.5c), the highest accuracy on one fold is about to 86% with LDA and naive Bayes on reduced normal data. The sensitivity on these folds is 75% and 100%, respectively. Considering only one fold’s result, the naive Bayes on reduced normal data seems to perform the best. The accuracy and sensitivity of both five and fifteen cycles (Figure 4.3b, 4.6c) the random forest remains between 50% and 75% of normal data. The accuracy and sensitivity of the reduced normal data is around 50%.

Figure 4.4: The training of a model LDA using K=10 fold CV (accuracy vs sensitivity) over each fold using differential baseline method on 15 selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

Figure 4.5: The training of a model NB using K=10 fold CV (accuracy vs sensitivity) over each fold using differential baseline method on 15 selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

(34)

4.2. Relative Method

(a) train on raw (b) train on normal (c) train on reduced normal

Figure 4.6: The training of a model RF using K=10 fold CV (accuracy vs sensitivity) over each fold using differential baseline method on 15 selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

4.2 Relative Method

The raw sensor signal contains drift due to different chemical reactions happening on the sensor surface. A Drift can be additive or multiplicative. As mentioned in section 3.3.1, the traditional baseline compensation methods are used to remove drift. This section includes the relative baseline method result (remove the multiplicative drift) when three different models (LDA,NB, and RF) are trained on using five cycles and fifteen cycles.

Figure 4.7c shows that the highest accuracy with LDA on one fold is about 87.5% on re-duced normal. The sensitivity on this fold is 75% though the maximum sensitivity is 100% for one fold on normal and three folds on reduced normal data Figure 4.7b, 4.7c. The highest accuracy and sensitivity on one fold is obtained by naive Bayes classifier, which is 100% on reduced normal data Figure 4.8c. The cent percent accuracy and sensitivity is also achieved for fifteen cycles on reduced normal data with naive Bayes classifier on one fold Figure 4.11c. For fifteen cycles, the highest accuracy and sensitivity reached by LDA on one fold re-mains the same.

The random forest’s sensitivity reaches up to 75% on normal data with five cycles, while the accuracy is about 65% Figure 4.9b. For fifteen cycles, the accuracy and sensitivity seem to decrease with the highest accuracy of about 62% and the highest sensitivity of 65%.

4.2.1 K=10 Folds with Five cycles

Figure 4.7: The training of a model LDA using K=10 fold CV (accuracy vs sensitivity) over each fold using relative baseline method on five selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

(35)

4.2. Relative Method

Figure 4.8: The training of a model NB using K=10 fold CV (accuracy vs sensitivity) over each fold using relative baseline method on five selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

Figure 4.9: The training of a model RF using K=10 fold CV (accuracy vs sensitivity) over each fold using relative baseline method on five selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

4.2.2 K=10 Folds with Fifteen cycles

Figure 4.10: The training of a model LDA using K=10 fold CV (accuracy vs sensitivity) over each fold using relative baseline method on 15 selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

(36)

4.3. Fractional Method

Figure 4.11: The training of a model NB using K=10 fold CV (accuracy vs sensitivity) over each fold using relative baseline method on 15 selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

Figure 4.12: The training of a model RF using K=10 fold CV (accuracy vs sensitivity) over each fold using relative baseline method on 15 selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

4.3 Fractional Method

The raw sensor signal contains drift due to different chemical reactions happening on the sensor surface. A Drift can be additive or multiplicative. As mentioned in section 3.3.1, the traditional baseline compensation methods are used to remove drift. This section includes the fractional baseline method result (which remove the additive drift as well as multiplicative drift) with three different models (LDA,NB, and RF) are trained on using five cycles and fifteen.

The highest accuracy on one fold with LDA is about 81% on reduced normal data on five cycles (Figure 4.13c). The highest sensitivity on this fold is 100%. LDA achieves the same highest accuracy and sensitivity on normal data with 15 cycles (Figure 4.16b). The accuracy of reduced normal data with LDA decreases on fifteen cycles (Figure 4.16c).

Figure 4.14, the highest accuracy reached by NB is about 86% on three different folds with normal data and on two folds with reduced normal data for five cycles. The highest sensitivity is 100%. The same highest accuracy is achieved on two different folds with normal as well as reduced normal data on fifteen cycles (Figure 4.17). The highest sensitivity on one fold is 100%. The accuracy and sensitivity for five cycles drop to 0% on one fold on reduced normal data.

The random forest gives accuracy between 63% to 58% and sensitivity between 76% to 58% on normal data with five cycles (Figure 4.15b). The accuracy of reduced normal data with RF is 62% to 50% and sensitivity is between 63% to 50%. The accuracy range is almost the same with fifteen cycles, but the sensitivity on normal data drops to 64% to 58% (Figure 4.18).

(37)

The LDA and RF do not seem to have much difference from five to fifteen cycles. NB with normal and reduced normal data seems to have better accuracy on more folds in fifteen cycles than the accuracy on five cycles. Like other methods, RF is not producing the best results, but the accuracy and sensitivity are consistent and close to each other.

4.3.1 K=10 Folds with Five cycles

Figure 4.13: The training of a model LDA using K=10 fold CV (accuracy vs sensitivity) over each fold using fractional baseline method on five selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

Figure 4.14: The training of a model NB using K=10 fold CV (accuracy vs sensitivity) over each fold using fractional baseline method on five selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

(38)

Figure 4.15: The training of a model RF using K=10 fold CV (accuracy vs sensitivity) over each fold using fractional baseline method on five selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

4.3.2 K=10 Folds with Fifteen cycles

Figure 4.16: The training of a model LDA using K=10 fold CV (accuracy vs sensitivity) over each fold using fractional baseline method on 15 selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

Figure 4.17: The training of a model NB using K=10 fold CV (accuracy vs sensitivity) over each fold using fractional baseline method on 15 selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

(39)

4.4. Overall Analysis

Figure 4.18: The training of a model RF using K=10 fold CV (accuracy vs sensitivity) over each fold using fractional baseline method on 15 selected cycle. The accuracy is represented by red line and sensitivity is represented by blue line.

4.4 Overall Analysis

In general, the accuracy and sensitivity of each of the three methods vary on each fold. How-ever, the variance is more substantial for the LDA and Naive Bayes classifier than for the random forest. The random forest shows more consistent results on each fold as compared to the other two methods. The difference in accuracy and sensitivity is very significant for LDA and NB while for RF.

LDA and naive Bayes’ training accuracy and sensitivity are better if we consider the best result i-e, the highest accuracy and sensitivity on any fold. However, the training accuracy and sensitivity is the average of accuracy and sensitivity obtained on all folds. The perfor-mance of a classifier is assessed on training as well as test accuracy and sensitivity. Some other performance measures, such as F1-score, precision, and recall, have also been measured.

4.4.1 K=10 Folds with five cycles

Table 4.2, reveals that RF with normal has the best training accuracy with very high sensitivity and F1-score. The highest test accuracy is achieved by NB on normal data, but the sensitivity and recall are quite low. The LDA with normal data has the second-best average highest accuracy with values of all other performance metrics relatively higher than that of NB with normal.

(40)

Performance metrics of Classifier on Differential method

method type Accuracy Sensitivity Recall F1-score Precision

1 LDA-RAW training 0.50 0.49 0.49 0.58 0.50 2 LDA-RAW testing 0.30 0.21 0.21 0.24 0.27 3 LDA-N training 0.56 0.61 0.61 0.62 0.59 4 LDA-N testing 0.56 0.57 0.57 0.57 0.57 5 LDA-RN training 0.61 0.72 0.72 0.64 0.63 6 LDA-RN testing 0.41 0.57 0.57 0.50 0.44 7 NB-RAW training 0.44 0.62 0.62 0.50 0.44 8 NB-RAW testing 0.48 0.86 0.86 0.63 0.50 9 NB-N training 0.49 0.27 0.27 0.27 0.33 10 NB-N testing 0.59 0.36 0.36 0.48 0.71 11 NB-RN training 0.49 0.38 0.38 0.39 0.54 12 NB-RN testing 0.48 0.36 0.36 0.42 0.50 13 RF-RAW training 0.49 0.46 0.46 0.48 0.56 14 RF-RAW testing 0.37 0.29 0.29 0.32 0.36 15 RF-N training 0.71 0.65 0.65 0.74 0.71 16 RF-N testing 0.52 0.50 0.50 0.52 0.54 17 RF-RN training 0.56 0.58 0.58 0.56 0.58 18 RF-RN testing 0.44 0.43 0.43 0.44 0.46

Table 4.2: Performance metrics of classifier using differential method on three data sets with five cycles.

From Table 4.3, the five cycles’ highest training accuracy with the relative method is obtained by random forest on normal data. The maximum test accuracy is achieved by NB on normal and reduced normal data as well as by RF on normal data. All three methods achieve the same test accuracy, but based on other performance metrics, RF with normal is comparatively better.

Performance metrics of Classifier on Relative method

Table 4.3: Performance metrics of classifier using relative method on three data sets with five cycles.

(41)

It can be seen from Table 4.4 that the RF with normal data has the maximum training accuracy with the highest sensitivity and F1-score for the fractional method—the accuracy of the test is the highest for RF with reduced normal data.

Performance metrics of Classifier on Fractional method

Table 4.4: Performance metrics of classifier using fractional method on five data sets with five cycles.

4.4.2 K=10 Folds with fifteen cycles

Table 4.5, represent the result for the differential method with fifteen cycles. It can be noticed that RF with normal has the highest training accuracy, i.e., 56% with considerably good sensi-tivity and F1-score. The NB with normal and reduced normal have the highest test accuracy, but the former has a lower sensitivity and recall as compared to the later.

(42)

Performance metrics of Classifier on Differential method

Table 4.5: Performance metrics of classifier using differential method on three data sets with fifteen cycles.

Table 4.6, represent the result for the relative method with fifteen cycles. RF with normal data again has the best training performance. The highest test accuracy is reached by NB with normal data, but the sensitivity, recall, and F1-score are very low.

Performance metrics of Classifier on Relative method

Table 4.6: Performance metrics of classifier using relative method on three data sets with fif-teen cycles.

(43)

In case of fifteen cycle Table 4.7, the NB with reduced normal data has the highest training accuracy with a sensitivity of 51%. The highest test accuracy is also produced by NB classifier, but the sensitivity and F1-score, in this case, are very low on normal data.

Performance metrics of Classifier on Fractional method

Table 4.7: Performance metrics of classifier using fractional method on five data sets with fifteen cycles.