MACHINE LEARNING FOR MECHANICAL ANALYSIS

(1)

V¨

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics

30.0 credits

MACHINE LEARNING FOR

MECHANICAL ANALYSIS

Sebastian Bengtsson

sbn12009@student.mdh.se

Examiner: Ning Xiong

M¨

alardalen University, V¨

aster˚

as, Sweden

Supervisors: Martin Ekstr¨

om

M¨

alardalen University, V¨

aster˚

as, Sweden

Company supervisor: Henrik Grunditz,

Prevas, V¨

aster˚

as, Sweden

Per ˚

Ahman,

Prevas, V¨

aster˚

as, Sweden

June 20, 2019

(2)

Abstract

It is not reliable to depend on a persons inference on dense data of high dimensionality on a daily basis. A person will grow tired or become distracted and make mistakes over time. Therefore it is desirable to study the feasibility of replacing a persons inference with that of Machine Learn-ing in order to improve reliability. One-Class Support Vector Machines (SVM) with three different kernels (linear, Gaussian and polynomial) are implemented and tested for Anomaly Detection. Principal Component Analysis is used for dimensionality reduction and autoencoders are used with the intention to increase performance. Standard soft-margin SVMs were used for multi-class clas-sification by utilizing the 1vsAll and 1vs1 approaches with the same kernels as for the one-class SVMs. The results for the one-class SVMs and the multi-class SVM methods are compared against each other within their respective applications but also against the performance of Back-Propagation Neural Networks of varying sizes. One-Class SVMs proved very effective in detecting anomalous samples once both Principal Component Analysis and autoencoders had been applied. Standard SVMs with Principal Component Analysis produced promising classification results. Twin SVMs were researched as an alternative to standard SVMs.

(3)

1. Introduction

Machine Learning, in all its diverse forms, has been used to solve a multitude of problems - for example classification[1, 2], detection[3, 4], regression[5] and optimization[6, 7]. At its core Ma-chine Learning is a marriage between statistical theory and signal processing, which gave birth to Support Vector Machines (SVM), Neural Networks (NN), Genetic algorithms and a plethora of other algorithms with various ad-ons.

The task in this thesis is detecting flaws and shortcomings in a mechanical product based on the measured moments from rotation of a key component of the product. The moments reveal not only if something is wrong within the product but also clues as to what the source of the problem might be. Currently these measurements are interpreted manually by employees at the production site, however it is anticipated that an autonomous solution will bring several benefits including, but not limited to, improved anomaly detection and better reliability. In general this solution could be useful for a multitude of industries where anomaly detection and classification is needed. There are two main questions regarding this task: 1) How is accurate and reliable anomaly de-tection as well as classification achieved, and 2) How does it compare to a person performing the task manually? To answer these questions research was conducted broadly to find what methods has been used prior and what their results were. Once a promising direction was found a deeper study was made and extensive testing and analysis was performed. With an initially broad scope which narrows as the research progresses a generally favorable solution is expected to be found.

For this particular task a fully autonomous solution is not only favorable to the production but to the workers as well. Over time a worker will grow tired which increases the risk of overlooking measurements and misclassifying samples which will reduce the overall quality of the products -but there is also the risk of repetitive strain-injuries. Such an injury is difficult to recover from and can hurt both the company and the individual. With an autonomous solution it is anticipated that these issues can be overcome and the production process might even become more time efficient.

SVMs have been used and developed over several decades and have been proven useful in binary classifications but also for multi-class classification through application of approaches based on cross-class comparisons and Fuzzy Logic [8, 9]. A wide selection of data types and applications have been delt with by SVMs by utilizing various combinations of methods for creating the sep-arating hyperplanes, estimating classification errors and adjusting the hyperplane in an efficient manner. [10, 11,12].

This thesis will focus mainly on various forms of SVMs for the sake of anomaly detection and classification which will be compared to the capabilities of Back-Propagation NN. To help the al-gorithms achieve a higher accuracy some pre-processing will be applied to the data sets in advance

Figure 1: A model of the testbench. A sensor is attached to a shaft between an electric motor and a mechanism which can not be observed from outside. When the shaft is turned the sensor sends the moment measurements to a computer where a person will analyse the moment plot. In the illustration it is indicated that the mechanism is not properly lubricated, which should be revealed in the plot.

(5)

and the effects of this will be studied and analysed.

Due to the versatility of SVMs it is reasonable to believed that a SVM-based solution can be found that can produce a satisfactory accuracy and reliability for anomaly detection and classification for the mentioned task.

The remainder of the report will be structured as; The Problem Formulation will firstly detail the problem which this thesis will attempt to solve as well as stating the hypothesis and research questions. The Background-section will cover the notation and basic concepts used in this report after which the key algorithms and methods used will be introduced more in depth than has been done so far. Related Work will go through the development of SVMs and NNs from conception to the modern era and end with an account of the state of the art. All methods, algorithms and techniques used as well as the reasoning to why they were used will be discussed in the Method section. The stance towards use of personal data and ethical considerations will be stated in Ethical and Societal Considerations, followed by a section dedicated to describing the work process of this thesis. In Results the findings will be presented in detail through tables and analysis. Conclusions will seek to concisely summarize the findings and results of this thesis. In Discussion the results will be examined in regard to the research questions. The report will close with Future Work, suggesting what more might be done to continue the development of a Machine Learning solution with increased capabilities.

(6)

2. Problem Formulation

It may not be reliable to have people making inferences on complicated and information-dense data on a daily basis. People are prone to becoming tired, stressed and distracted which can lead to bad decisions. No decisive interpretation can be made with no fitness criteria to speak of and only relying on the experience a person might have in recognizing a good sample from a bad one. Therefore it is desirable to study the feasibility of replacing a persons inference with that of a machine.

2.1 Hypothesis

The hypothesis for this thesis is stated as follows;

A Machine Learning implementation can handle anomaly detection and classification of measure-ments from a mechanism at least as well as an experienced person.

2.2 Research Questions

A set of research questions will be stated in order to test the hypothesis.

RQ1 Can Machine Learning perform anomaly detection on a comparable level as a person in terms of accuracy and time on the given task?

RQ2 How does a person compare to Machine Learning in terms of anomaly detection rate and time spent on the given task?

RQ3 How reliably can Machine Learning detect specific faults through the received data?

RQ1 is, in other words, a Yes or No question regarding the possibility of Machine Learning having capabilities similar to an experienced person in anomaly detection. RQ2 then asks for the difference in performance between a person and Machine Learning in anomaly detection, which can be answered in numerous ways depending on what differences were noted and whichever answer to RQ1 was found. Finally, RQ3 concerns the multi-class classification performance.

2.3 Limitations

This thesis have a couple of initial limitations. Firstly, the thesis is limited in time as it shall be done in 20 weeks which shall cover initial research, implementation and writing of the report. Secondly, the hardware on which to run the implementations for the sake of testing and evaluating is limited to a computer with Windows 7 installed, equipped with 16 gigabytes of RAM and an Intel Core i5-9600K CPU running at 3.7 Ghz.

Thirdly, the initially available data is not produced by the testbench which a Machine Learning solution is researched for, but a prototype of such a testbench. Proper data is anticipated to be produced during the time-span of this thesis.

(7)

3. Background

As the name suggests, Machine Learning is the scientific study of getting machines to produce some seemingly cognitive response to external information, such as inferring that a cluster of pix-els is a specific hand-written character, based on prior examples. By training a Machine Learning algorithm with appropriate examples (or ”training data”, ”samples”) it can be taught to classify similar examples in a predetermined set of classes. Depending on the specifics of the algorithm implementation and the quality as well as the volume of the training data it is possible to make highly accurate classifications - some of which might be difficult or even impossible for a human to make.

For the sake of legibility and understanding of the equations and expressions which will be presented further into this report, a notation convention will be determined below.

3.1 Concepts and notations

Before delving into the Machine Learning algorithms the necessary mathematics will be introduced.

Scalars Scalars will be denoted with italicized lower-case letters, for example u.

Vectors Vectors will be denoted by bold lower-case letters, for example u. All vectors will be column vectors unless transposed, which will be notated as uT_.

Matricies Matricies will be denoted with bold upper-case letters, for example A. Input domain (X ) The domain from which data will be drawn.

Feature map (Φ : X → H) A function which maps a space X into another space H.

Kernel (K) A kernel, be it linear, Gaussian or polynomial, performs a similarity measure over pairs of data samples often for the sake of pattern analysis. It is the same as calculating the dot product of two data samples in feature space H, ie K(x1, x2) = (Φ(x1) · Φ(x2)), but by

not calculating the coordinates for the samples in the feature space and instead calculating the dot products of the images of the samples in the feature space a lot of computational effort can be avoided.

Feature space (H) The space that an input space X is mapped into in order to enable linear separation.

γ: Functional margin of an example with respect to a hyperplane The perpendicular dis-tance from a hyperplane to the closest member of a class element. An important value during the training of a SVM, this value indicates how big a difference the SVM currently has be-tween the classes to be separated. It is generally desired to achieve a large γ value through the training.

ζ: Slack variable In the cases where a SVM is to find a line or plane separating the training samples strictly into their respective classes but is unable to achieve this it is necessary to introduce a ”slack variable”. The slack variable turns the SVM from a ”hard-margin” into a ”soft-margin” SVM where some misclassification is permissible during training. Each training sample is given a slack variable which will add a new term to the minimization expression of the SVM: The sum of all slack variables multiplied by the cost parameter C. The slack variable is commonly initialized to 1 and training is regarded as successful when every slack variable has been minimized to zero.

L: Loss function A measurement of how badly an example was classified. L is generally positive and can be calculated according to several expressions which affects the overall training of the given algorithm. A loss function can be used instead of a slack variable depending on the formulation of the SVM.

(8)

Figure 2: A Neural Network with three input-nodes and an arbitrary number of nodes in an arbitrary number of hidden layers leading to a final output-node.

3.2 Back-Propagation Neural Network (BPNN)

A collection of nodes, grouped into layers, intended to mimic the core functionality of a biological brain to some degree. It is constructed with an initial input-layer of nodes, each node correspond-ing to some data point for example an element in a vector, followed by any number of ”hidden” layers containing additional nodes which are all connected to every node in the preceding layer with an associated weight coupled with an activation function, for example the sigmoid function. The final layer of the Neural Network is the output layer where a final output or ”decision” is produced[13]. An illustration is shown in Figure 2.

The network may be trained using the algorithm known as Back-Propagation: Whenever the net-work produces the wrong output a ”wave of correction” sweeps through the netnet-work to change the weights associated with the node connections in order to affect future decisions of the algorithm[14]. How much a weight is changed during each correction is determined to the most part by the Learn-ing Ratio, which is passed as a parameter. A large LearnLearn-ing Ratio might seem advantageous as the weights would take larger steps towards making better outputs but this could very well lead to the algorithm ”stepping over” the optimal solution and never really converging. Conversely, too small of a learning rate would give a very slow convergence. There exists solutions to the problem of having an appropriate learning rate, such as self-adaptive learning rate proposed in 2009 where the learning rate is adapted with the Taylor theorem [15]. With this method the learning rate will decrease in accordance with the rate at which the classification error of the network decreases. BPNN is quire popular, particularly for solving classification problems and has won several classi-fication competitions.

3.3 Support Vector Machines (SVM)

Often abbreviated to just ”SVM”, this is an algorithm attributed to Vapnik and Lerner who in 1962 proposed a Generalized Portrait algorithm which was later developed, in large part by Vapnik but also by others, into the algorithm that is frequently used and studied today [8,16,17,18]. The core concept of a SVM is creating a hyperplane, also known as the decision boundary, in a space which will separate two classes of points from each other. On either side of the hyperplane lies the boundaries of the margin, which is desired to be as large as possible because a wide mar-gin indicates a high separability of the classes. The training of a SVM is essentially solving for a margin of maximum width in which none of the training examples fall into. It is however not always possible to have a large margin or to even have a proper margin at all - there might not be a feasible hyperplane which perfectly separates the desired classes in the input space X . In this case a certain amount of slack can be allowed, in which some misclassification is permitted during training granted that the distance from the margin of the correct class to the individual point is

(9)

Figure 3: An illustration of a Support Vector Machine. The thick straight line represents the plane which separates the two classes with the margin γ.

not too large. Instead of a slack variable a loss function can be used, commonly the Hinge-loss function but other functions exists. A SVM allowing some level of misclassification during training is known as a ”soft-margin SVM” and a SVM not allowing any degree of misclassification of the training data is known as a ”hard-margin SVM”.

The outputs of a SVM is ±1, the signage indicating the classification.

Equation 1, which is subject to Equation 2, shows the basic problem that needs to be solved in a soft-margin SVM: Equation 1 needs to be minimized by adjusting the weight vector w, the bias term b and the slack variables ζi for the given l training samples. A training sample can be defined

as xi = {x ∈ X , y ∈ [±1]} where y is the label of the data sample. The feature map Φ can be a

simple linear map which just multiplies the input sample xi by 1 or it can be any other linear or

non-linear map. min w∈H,b∈R,ζi∈R 1 2||w|| 2₊1 l l X i=1 ζi (1) subject to yi(wTφ(xi) − b) ≥ 1 − ζi, ζi ≥ 0, i = 1, . . . , l (2)

The primal objective function for a Support Vector Machine with constraints.

During the 1990’s an addition to the SVM method was proposed [16]: The kernel trick, in which a carefully chosen kernel function K is applied to the input vectors in order to transform them into some feature space H. This addition made it possible to separate sample vectors which were not linearly separable otherwise.

3.3.1 One-class Support Vector Machines (OC-SVM)

In the year 2000, Sch¨olkopf et al formulated a One-Class Support Vector Machine which centers around the probability density of the input space through the application of a kernel[19]. It creates a hyperplane which encapsulates the spaces which the provided training samples are most frequently in. Because of this the OC-SVM is an unsupervised classifier which labels all samples which fall within the encapsulated spaces as +1 and all others as -1. The quadratic problem that

(10)

min w∈H,ζ∈Rl_,ρ∈R 1 2||w|| 2₊ 1 v · l l X i=1 ζi− ρ (3) subject to (w · Φ(xi)) ≥ ρ − ζi, ζi≥ 0 (4)

The primal OC-SVM objective function.

v is a parameter, l is the number of data samples, ρ is a bias term.

needs to be solved for a OC-SVM is given in Eq. 3 and is subject to 4. Figure 4 illustrates a OC-SVM.

Figure 4: An illustration of a One-Class Support Vector Machine. All samples which are within the boundaries set by the positive training samples are labeled +1 while all other samples are labeled -1.

3.3.2 SVM for multiclass classification

SVMs are normally binary classifiers, however, there exists simple methods to apply them to multi-class problems[20]. One method is the 1vsAll, also known as 1vsRest, approach in which one classifier is constructed for each class - see Figure 5. Every classifier is then trained using data for their respective classes as positive examples and all other classes as negative examples. As long as there is no ambiguity amongst the classes leading to an input being classified positively for multiple classes, whichever class that is assigned +1 is considered the correct classification. This approach requires comparably few classifiers, however it is susceptible to being poorly balanced due to uneven distribution of training examples.

Another method is the 1vs1 method, also known as pairwise decomposition, where k classes re-quires kk−1

2 classifiers - see Figure 6. Each classifier returns the most probable of two classes

to a given input, with the complete number of classifiers giving an exhaustive comparison of all available classes to each other. Whichever class was deemed the most probable the most times is considered the correct classification. This approach is more robust than 1vsAll though it does lead to a quadratic increase of the number of classifiers.

3.4 Data processing

To increase the capabilities of an algorithm the data can be processed in advance. Depending on the type, shape and volume of the data it might be appropriate to decrease the dimensionality, extract features or somehow model the data in a novel way. Some methods of processing data which will be used in this report will be presented below. It is important to note that the normalization described below will be used on every data set throughout all testing and evaluation before any other kind of processing is applied.

(11)

k1 k2 k3 k4 c1 c2 c3 c4 Other c1 c2 c3 c4 c5 c6 k1 k2 k3 k4

Figure 5: 1vsAll. Each classifier cn

(represented by circles) labels an input as either in the given class kn or as not

in kn.

Figure 6: 1vs1. Each classifier cn

(represented by lines between the classes kn) compares an input to every

combination of two available classes.

3.4.1 Normalization

Before any data is used in this thesis it will be normalized according to Equation 5. This is also known as feature scaling.

xnew =

x − xmin

xmax− xmin

(5)

The equation used to normalize a sample vector x.

In Equation 5 a sample x is normalized by subtracting the minimum elemental value of x from every element of x and dividing this by the scalar difference between the maximum and minimum elemental value of x. This causes every element in x to have a value between 0 and 1.

3.4.2 PCA

Principal Component Analysis (PCA) is a useful tool for dimension reduction. It is commonly used as a pre-processing step in order to improve the performance of classification algorithms by reducing the dimensions of the input data to a small number of features which still represents the data to a large degree. Depending on the data, the dimensions of a sample can be decreased from thousands to just a few without much loss in information. This is achieved by projecting the points onto the orthogonal eigenvectors of the covariance matrix. Each projection onto an eigenvector gives one principal component. [21].

3.4.3 Autoencoder

An autoencoder is a two-step method that is built on Neural Networks. In the first step a set of ”normal” data samples are fed to the auto-encoder, which will train a Neural Network that encodes the data samples. The autoencoder will then attempt to recreate the encoded samples into their original form. Given that the autoencoder has been trained to only encode and decode the desired or ”normal” samples, any anomalous sample is probable to be decoded with a noticeable error to its original form[22].

3.5 Cross-validation

A common method for thoroughly evaluating the performance of a classifier on a data set is called cross-validation, illustrated in Figure 7. The process is fairly straight-forward and is used

(12)

through-out in this thesis.

Figure 7: 3-fold cross-validation on a set with four classes. For each of the three iterations the shaded section is excluded from the training of the classifier and used for testing.

A data set is partitioned into a number of equally large groups, five and ten groups being common choices. Iteratively every group except one is used for training while the remaining group is used for testing. When every group has been used for testing an overall performance can be calculated. It is desirable that all classes are represented in each group in relation to the proportion it has in the set.

(13)

4. Related Work

In 1943 Warren McCulloch and Walter Pitts sowed the seed for what would become Artificial Neu-ral Networks[23]. They proposed a mathematical model called threshold logic which attempted to mimic some of the functionality of the neurons in a brain. The ”perceptron” was proposed by Rosenblatt in 1958, a further mathematical and computational mimicry of biological cellular functionality. From this perceptron came the first Neural Networks. It was however not until 1974 that the back-propagation algorithm was proposed, stemming from H. Kelleys work in 1960[24,25]. This invention revitalized the research into Neural Networks to the point that the major limiting factor of the algorithm was the hardware. Due to the size of the network and the large amounts of neurons necessary for it to produce a useful result the amount of time to train the network was often in the range of months, depending on the problem and application of the network. As the speed of processors increased and computer memory became more cheaply accessible the popular-ity of Neural Networks grew.

Wei Zhang et al constructed a multi-layered feed-forward parallel distributed processing model in 1990 which used the Neural Network methodology[26]. This model was capable of classifying letters even when they were tilted, shifted or distorted and would serve as the foundation for Con-volutional Neural Networks (CNN or ConvNets).

Another branch of Neural Networks is the Recurrent Neural Network (RNN), which is based on the work of Rumelhart in 1986. RNNs incorporate its own output values as inputs to itself to give temporal information. A kind of RNN is Long Short-Term Memory (LSTM) discovered in 1997 by Hochreiter and Schmidhuber which proved excellent in speech-recognition and other context-dependent applications[27,14].

In 1962 Vapnik and Lerner published an article (translated to English from Russian in 1963) about their idea of a ”Generalized Portrait algorithm” where they also gave an axiomatic definition of patterns based on decomposition of images into subsets[8]. This proposed algorithm was only applicable to linear sets of data and was highly susceptible to noise and outliers. T.M. Cover de-veloped the idea of hyperplanes for pattern separation in 1965, which laid the foundation for large margin hyperplanes[17]. After some development, this algorithm was still fairly limited as it could only be applied to linearly separable binary classes. However, this changed with the introduction of kernels which had previously been researched by Aiserman, Braverman and Rozonoer in 1964[28]. Kernels were realized as a useful tool in SVMs in 1992 by Boser, Guyon and Vapnik and enabled classification of non-linearly separable data by transforming the given data into a feature space where linear separability was possible[16].

In 1995 Cortes and Vapnik introduced the ”soft margin” where each data sample xi is assigned

a variable ζi ≥ 0[18]. During training it is attempted to find a solution where these ζi values

are minimized as they are indicators of how ill-fitting the current iteration of the classification hyperplane is. Up until this point SVMs utilized what is called a ”hard margin”, meaning that a point of data is either on the right or wrong side of the classification hyperplane with no indication of how right or wrong the samples was classified in terms of distance from the hyperplane.

There exists a myriad of adaptations of SVM for various problem solutions[8]. Xi-Zhao Wang and Shu-Xia Lu incorporated Fuzzy Logic into a SVM where a fuzzy membership value was made part of the objective function as a factor to the loss values[9]. By utilizing Fuzzy Logic some of the inherent sensitivity to outliers in ordinary SVMs was overcome. The proposed Improved Fuzzy Multi-category SVM (IFMSVM) achieved a slight but noticeable improvement to the classification scores as compared to a 1vsAll, 1vs1 and Multi-category SVM on various sets of data.

Support Vector Machines continue to be developed in the 20th _{century. Since its conception all}

SVMs used a single plane with a surrounding parallel margin to perform classification. In 2006 Mangasarian and Wild introduced the Generalized Eigenvalue Proximal SVM (GEPSVM), in which

(14)

min w(1)_,b(1)_,ζ 1 2(Aw (1)_{+ e} 1b(1))T(Aw(1)+ e1b(1)) + c1eT2ζ (6) subject to − (Bw(1)_{+ e} 2b(1)) + ζ ≥ e2, ζ ≥ 0 (7) min w(2)_,b(2)_,ζ 1 2(Bw (2)_{+ e} 2b(2))T(Bw(2)+ e2b(2)) + c2eT1ζ (8) subject to (Aw(2)+ e1b(2)) + ζ ≥ e1, ζ ≥ 0 (9)

TSVM as a constrained minimization problem. A and B are matrices containing the training vectors, w is the weight-vector, e is a 1-vector of appropriate dimensions, b is the bias-terms, c is

a trade-off constant and ζ is the slack-vector.

a plane with a maximized margin towards the given samples are replaced by two non-parallel planes of maximum distance from each other[29]. The planes are eigen-vectors gained from finding a pair of related generalized eigen-value problems with the smallest eigen-values. Each plane attempts to be as close as possible to one class while creating as large distance as possible to the other class, see Figure 8.

In 2007 Jayadeva et al proposed another multi-plane SVM which, just like Mangasarians and Wilds GEPSVM, replaces the maximum-margin classifier with two non-parallel planes[30]. The main dif-ference between these two approaches is to do with the formulation of the planes. In GEPSVM the planes are eigen-vectors while in Jayadevas approach, coined ”Twin Support Vector Machine” (TSVM), the planes are very much the same as in normal SVMs. Though they could not display a significant increase in accuracy from a conventional SVM, the TSVM did have a remarkably shorter training time since instead of a single major QP problem, as solved for in ordinary SVMs, two smaller Quadratic Programming (QP) problems are solved for the TSVM. Jayadeva et als for-mulation for this Twin SVM is given in Equations 6 - 9 as formulated in the notation convention adopted by this thesis.

min w(1) 1 2||w (1)_||2₊v1 l2 l1 X i=1 L x+_i , yi, fi(x+i) (10) min w(2) 1 2||w (2) ||2−v2 l1 l2 X j=1 L x−_j, yj, fj(x−j) (11)

TSVM as an Unconstrained Minimization Problem with a loss-function. w is the weight-vector, v is the difference of the current iteration of w and the previous iteration, l is the number of

samples, f() is the classification function.

In 2019 Sharma et al reformulated Jayadeva et als TSVM into two unconstrained minimization problems with a loss function. The objective functions are given in Equation 10 and 11. In their report Sharma et al proposed a stochastic solution to this minimization problem using a quasi-Newton method and approximations of the Hessian matrices as these are computationally expensive to calculate. This TSVM is denoted as SQN-PTWSVM (Stochastic Quasi-Newton Pin-ball Twin Support Vector Machine).

Sharma et al also made a comparison between the conventional Hinge loss function and the Pin-ball loss function with the conclusion that the latter had some prominent advantages over the former[31]. Most importantly, the Pinball loss function produces stability by removing some sensi-tivity to noisy training data, thus promoting quicker convergence. If τ is set to zero the Pinball loss function would be the exact same as the Hinge loss function. Equation 12 showcases the Pinball

(15)

Lτ(x, y, f (x)) =

(

−yf (x), if − yf (x) ≥ 0 (Incorrect classification)

τ yf (x), if − yf (x) < 0 (Correct classification) (12) The Pinball loss function. x is a data sample, y is the label of x, f(x) is the classification value of

x, τ is the penalty rate applied to correctly classified samples.

loss function. With extensive testing and comparisons of various SVMs it was shown that, on aver-age, the stochastic quasi-Newton optimization technique coupled with the Pinball loss function for a Twin Support Vector Machine was greater than both SVMs and conventional TSVMs in terms of both accuracy and training time.

Figure 8: An illustration of a Twin Support Vector Machine. A central separating plane is here replaced by two non-parallel planes - one for each group of samples.

(16)

5. Method

The given problem which this thesis centers around is finding an algorithm (or system of algo-rithms) which reliably can detect anomalies in a univariate time-series and then go on to pin-point what may be the cause of the anomalous behavior. This begs the question: Is this possible given the time-frame of this thesis? Prior knowledge suggests that this is very much possible. The question-ing then moves to how to achieve this and also how does the solution compare to a person manually solving this problem. These latter questions are the basis for the research questions RQ1 and RQ2.

Research was conducted to find what solutions have been attempted for similar problems and what the outcomes were. It was found that Support Vector Machines (SVM) have been used for anomaly detection with good results[32,33, 34]. A special case of SVM known as the One-Class SVM (OC-SVM) is particularly good for anomaly detection since it does not require any balance of the number of samples in each class. This is important when dealing with anomaly detection since anomalous samples by definition are rare in comparison with the normal samples. Additionally, other variations of SVM have been used with good results for classification [1, 2, 10]. With the 1vs1 and 1vsAll approaches it is possible to use SVM, by default a binary classifier, for multi-class classification.

Since SVM shows promise for the primary problem of anomaly detection - and can indeed also be used for multi-class classification - this thesis will focus heavily on SVM which will cause the scope of the thesis to become deep at the expense of width.

In addition to finding methods and algorithms for anomaly detection and classification it was also necessary to pre-process the data. The samples from the company was in the range of several thousand points of data, which can become unwieldy. After further research two data processing solutions with good potential were found: Principal Component Analysis (PCA) and Autoencod-ing. These methods would be used to decrease the dimensions of the data samples while still retaining any useful information and to hopefully increase the performance of the solutions.

Neural Networks are very popular and frequently considered a go-to method for all kinds of clas-sifications. As a means of comparison a Back-Propagation Neural Network (BPNN) was used as a counter-point for the OC-SVM in the case of anomaly detection as well as for the binary SVM used for classification.

5.1 Constructing a system with alternative data

The provided data set from the company had a very high resolution but a low number of labeled samples - particularly in the anomalous class. Therefore a handful of alternative data sets were gathered in order to more thoroughly evaluate the selected methods. The data sets were gathered from the UCI Machine Learning Repository[35] and the UEA & UCR Time Series Classification Repository[36]. All data sets are either drawn from the real world or produced synthetically, and are put into one of two groups: ’Binary’ and ’Multiclass’.

The ’Binary’ group consists of data with binary classes where one class will be regarded as the ”normal” occurrences and the other class represents anomalies. The number of anomalous samples will be reduced to 10% of the data set to make them seemingly anomalous. The ’Multi-class’ group of data sets will have more than two classes to enable the testing of the researched solutions in terms of classification accuracy.

(17)

5.1.1 Data sets

The data sets used in this report are presented below. The data which was acquired from the company will be referred to as ”Company data”.

Binary

Hill Valley A synthetic data set where the data, when plotted along the horizontal axis, will form steep ramps either going up then down (class +1) or down then up (class -1). There are 606 samples with 100 points in each sample. Additionally, each sample is available either with or without noise.

Dodger Loop Weekend (DLW) A real-world data set gathered from a sensor measuring the amount of traffic on the 101 North Freeway in Los Angeles, close to the Dodgers Stadium which is frequented during weekends. The classes are +1 (weekday) and -1 (weekend). There are 158 samples with 288 attributes each.

PowerCons A series of measurements of house-hold electric power consumption distributed into the two classes +1 (warm season, April-September) and -1 (cold season, October-March). There are a total of 360 samples of 144 data points each.

Strawberry Obtained using Fourier transform infrared (FTIR) spectroscopy with attenuated total reflectance (ATR) sampling, this set contains 983 samples of length 235 points each with the classes +1 (strawberry) and -1 (not a strawberry or adulterated strawberry). Company data A moment-sensor has been mounted on an arm which turns a key component in a

mechanical apparatus. This data set is made of the measured moments and the corresponding angular degree which the moment was measured at. An approved sample is labeled +1 and a failed sample is labeled -1.

Multiclass

Waveform A synthetic set containing three classes (1-3) corresponding to different waveforms. Each waveform is a combination of two out of three base-forms. In the set are 5000 samples composed of 40 attributes, including noise.

Ethanol Level The data samples are spectographs of bottles containing alcoholic beverages of four different ethanol levels which correspond to the following classes: 1) E35, 2) E38, 3) E40, 4) E45. There are 1004 samples with 1751 data points in each.

UMD A synthetic set with 3 classes (1-3), 180 samples and 150 points per sample.

Melbourne Pedestrian (MP) A real-world data set gathered from a pedestrian counting sys-tem used at 10 different locations in and around Melbourne, Australia. There are 3450 samples, each containing discrete measurements corresponding to 24 consecutive hours and each class (1-10) corresponding to a specific location.

5.2 Stochastic quasi-Newton method for Twin SVM

Sharma et al used a stochastic quasi-Newton method to optimize their Twin SVM. This method is reproduced below. Equation 10 and 11 must be re-formulated as multi-variate functions in order to apply the quasi-Newton method. The input for the functions is the corresponding weight-vector and a sample θ which contains both the data x for the sample and the label y.

The stochastic gradients ˆs are calculated according to Equation 13 and 14. Here c1and c2are

pa-rameters which dictate the importance of the classification penalties, k is +1 if the current weight for the given plane and sample incorrectly classifies the sample and −τ if it correctly classifies the sample. y is the label of the sample.

(18)

ˆ s1(w1,t, θ1,t) = (w1,t+ v1 l2 x−_t −c1 l1 ky_t(x+_t)) (13) ˆ s2(w2,t, θ2,t) = (w2,t− v2 l1 x+_t −c2 l2 ky_t(x−_t)) (14)

The stochastic gradients of Equations 10 and 11 when re-formulated as multi-variate functions.

The quasi-Newton method as described by Sharma et al is given in Equations 13 to 22, adjusted to fit the notation of this thesis. The process of the method can be described as such in plain text: While the weight-vectors have not stagnated to a certain tolerance and the maximum number of it-erations have not been exceeded, a sample from both classes are gathered. The stochastic gradients are calculated according to Equation 13 and 14 which are then used to update the weight-vectors according to Equation 15 and 16. The variable variations are then calculated as given in Equations 17 to 20. Lastly the Hessian approximation matrices are updated as in Equation 21 and 22. When the tolerance or iteration limit has been exceeded it is anticipated that the weight-vectors are fit to perform binary classification with the use of Equation 23.

w1,t+1= w1,t− αt( ˆB1,t+ ΓI)−1ˆs1(w1,t, θ1,t) (15)

w2,t+1= w2,t− αt( ˆB2,t+ ΓI)−1ˆs2(w2,t, θ2,t) (16)

Formula for updating the weights in the SQN-PTWSVM.

Equation 15 and 16 show how the weight-vectors are updated in the described quasi-Newton method. The matrices ˆB are square matrices with the same length as the weight-vectors. α is a parameter dictating the step-size of updating the weight-vectors. Γ is a regularization constant which prevents ill conditioning of the Hessian approximations. I is the identity matrix of equal size to ˆB. v1,t= w1,t+1− w1,t (17) v2,t= w2,t+1− w2,t (18) ˆ r1,t= ˆs1(w1,t+1, θ1,t) − ˆs1(w1,t, θ1,t) (19) ˆ r2,t= ˆs2(w2,t+1, θ2,t) − ˆs2(w2,t, θ2,t) (20)

Variable variations for the SQN-PTWSVM.

The only term yet introduced for the described method is the bias term δ which shall be greater than 0. ˆ B1,t+1= ˆB1,t+ ˆr1,tˆrT1,t v1,tˆr1,t − ˆ B1,tv1,tvT1,tBˆ1,t vT 1,tBˆ1,tv1,t + δI (21) ˆ B2,t+1= ˆB2,t+ ˆr2,tˆrT2,t v2,tˆr2,t − ˆ B2,tv2,tvT2,tBˆ2,t vT 2,tBˆ2,tv2,t + δI (22)

(19)

The classification for the Twin SVM is made with Equation 23. Using the two planes given by the weight-vectors w1 and w2 with their corresponding bias terms b1 and b2 the seemingly correct

class is given by whichever plane produces the smallest absolute value.

Class(i ) = arg min

i=1,2

|¯xT_w i+ bi|

||wi||

(23)

Classification formula for the SQN-PTWSVM. Here i indicates the two planes of the TSVM.

5.3 Result evaluation

To evaluate a solution, 10-fold cross-validation will be used. The predictions from a solution will be used to calculate the recall and specificities in the case of anomaly detection and the accuracies in the case of classification. These values are calculated with the equations in Figure 9(b) where the components are given in 9(a). All tests will be performed on a computer running Windows 7 with an Intel Core i5-9600K CPU running at 3.7 GHz equipped with 16 gigabytes of RAM.

TP True Positive TN True Negative FP False Positive FN False Negative

(a) Abbriviation explanation

Accuracy = T P + T N T P + T N + F P + F N Recall = T P T P + F N Specif icity = T N T N + F P (b) Evaluation equations

(20)

6. Ethical and Societal Considerations

In this thesis all data concerning personal information, company property (including data sets) and company procedures will be anonymized or omitted to the greatest extent possible. All descrip-tions and mendescrip-tions of machines, products and procedures will be made as ambiguous as possible.

(21)

7. Work process

Below will be given a thorough narration of the process, progress and obstacles of the thesis.

7.1 Algorithm research and preparation

The first step to take was acquiring the data from the company. This proved more difficult than it first seemed as the only available data was from a prototype of the testbench which was fairly limited in resolution and quality - the final version of the hardware had yet to be installed at this point in time. Proper data was promised to be delivered eventually, which was no pressing issue since a reasonable understanding of the data was gained through the prototype outputs. The data was a univariate time-series with a couple of hundred features per sample and each sample was labeled as either ”approved” or ”not approved”. With this knowledge research into potential solutions was initiated.

Classification and anomaly detection methods within Machine Learning was researched. Support Vector Machines (SVM) were understood to be a suitable family of classifiers for RQ1, as it was a binary classification problem. After some more research it was learned that SVM has been used with good results for multiclass classification by utilizing the 1vs1 approach (pairwise decomposition) and the 1vsAll approach which opened up for SVM being the topic of RQ3 as well. Though Fuzzy SVM has been shown to have a slightly higher accuracy in the case of multiclass SVM the improvement was so small that the increased complexity as compared to 1vs1 and 1vsAll was not deemed a feasible trade-off at this stage [9].

As a frame of reference for the SVM implementations Back-Propagation Neural Network (BPNN) was chosen based on its current popularity as a classifier.

From the multitude of SVMs found during the research three were chosen to be in the scope of this thesis: Standard SVM, One-Class SVM (OC-SVM) and Stochastic Quasi-Newton Pinball Twin SVM (SQN-PTWSVM)[31].

7.2 Data processing research

Eventually, data from the final version of the testbench was produced and made available. The data was composed of several thousand features per sample which was a greater number of features than anticipated. This prompted the need to decrease the size of the data samples as the raw data was bulky and cumbersome. Further research was conducted which covered Independent Component Analysis (ICA), subtractive clustering, Fuzzy c-means clustering and more[37,38,39]. Eventually it was settled for Principal Component Analysis (PCA) to decrease the dimensionality of samples. It was also learned about the positive effects which autoencoders can have on classification performance, which led to incorporating this into the thesis as well.

7.3 Anomaly detection implementation

It was thought best to tackle the research questions chronologically, meaning that the capabilities of OC-SVM on anomaly detection was the first to be implemented and evaluated in conjunction with a BPNN. For the sake of rigorous testing and demonstrating the versatility of the implemen-tations additional data sets were gathered from online sources, such as the UCI Machine Learning Repository and timeseriesclassification.com[35].

Using MatLab, a BPNN was set up and its accuracy was tested on the data sets gathered for the sake of anomaly detection through 10-fold cross-validation. A OC-SVM was also set up and tested in the same fashion with three different kernels: Linear, Gaussian and Polynomial. Once the base-line performance of these two methods was established the data processing methods were applied and the results documented. To gain a solid understanding of the effects of the data processing methods the classifiers were initially tested with the processing methods separately before being tested with both methods applied simultaneously. All results were recorded and can be found in the corresponding subsection in section Results.

(22)

Figure 10: An illustration of the iterative method used for this thesis.

7.4 Classification

The final version of the testbench which the company had built features multiclass labeling capa-bilities. These had yet to be used for the data which had been made available for this thesis. There was hope that multiclass-labeled data would be produced in the time-span of this thesis, however this did not happen. Instead, more data sets were gathered to test the general performance of the implemented classifiers.

The testing procedure of the classifiers were very much the same as for anomaly detection: Each solution was tested with 10-fold cross-validation on each data set and each SVM implementa-tion was tested with the linear, Gaussian and polynomial kernel. Once a baseline performance was established the solutions were tested together with PCA. All the results are available in the corresponding subsection in Results.

7.5 SQN-TWSVM

Due to its reportedly good performance, Sharma et als Stochastic Quasi-Newton Pinball Twin SVM (SQN-PTWSVM) was implemented based on their descriptions[31]. It was hoped that this new and promising SVM would provide even better results than the other solutions. No conclusive results could be made however, even after weeks of work. The notation and descriptions used in their report was ambiguous and unclear at parts and there were no recommendations to the ranges of several parameters, making it difficult to properly reproduce this algorithm. When it was believed that a functional SQN-PTWSVM had been created matrix search was used to iteratively find functional parameter values. With extensive testing the results were still not very satisfying and this method was eventually abandoned.

(23)

8. Results

In the sections below all algorithms and their variations will be discussed with respect to their individual results. The first subsection will cover the results of the implemented anomaly detec-tion soludetec-tions while the second subsecdetec-tion will cover the multiclass classificadetec-tion results. A final comparison of all the algorithms in their respective groups (anomaly detection and multiclass classification) will be summarized in the third subsection.

For every test, the random number generator of Matlab will be set to default for the sake of test–retest reliability. 10-fold cross-validation is used exclusively on all data sets as is normaliza-tion according to Equanormaliza-tion 5.

Back-Propagation Neural Network will be abbreviated to simply ”NN” with the subscript indicat-ing the number of hidden layers.

8.1 Anomaly detection results

In the following sections the performance of BPNN and OC-SVM will be compared.

The tables below will show the performance of BPNN and OC-SVM with and without data pre-processing in the form of the recall and specificity proportions as well as the training and testing times. As a reminder, the Recall is the proportion of correctly classified positive samples and Specificity is the proportion of correctly classified negative samples. As this subsection concerns itself with anomaly detection the specificity is of greater interest than the recall as it is preferable to not give any sample the benefit of the doubt.

For each test the anomalous samples, labeled -1, will make up 10% of the training data in order for them to be regarded as ”rare” occurrences. For the testing data the original relation of negative samples to positive samples determines the proportional occurrence of the classes.

8.1.1 Baseline performance

Table 1 and Table 2 shows the recall and specificity of the NN and OC-SVM solutions without any pre-processing other than normalization. It is important to look at both tables when evaluating these results: Only looking at Table 1 might give the false impression that all the solutions have quite good performance and that the OC-SVM with a Gaussian kernel is the best of them. Looking at Table 2 however reveals that the Gaussian OC-SVM failed to catch a single anomalous sample in three out of six data sets. It goes without saying that this is dreadful for an anomaly detection method.

As for the other solutions they all had a mixed score. For example the NN3 achieved a recall in

the range of 94-100% but a specificity of 8-97%. Both the linear and polynomial OC-SVMs scored

Table 1: Anomaly Detection baseline recall

Method

Data set _HV (smooth)

HV

(noisy) DLW PowerCons Strawberry Company data

NN1 1 0.98 0.99 0.98 1 0.93

NN3 1 1 0.99 0.94 1 0.99

OC-SVM (Lin.) 1 1 0.99 0.97 0.98 0.99

OC-SVM (Gaus.) 1 1 1 1 1 1

OC-SVM (Poly.) 1 1 0.99 0.98 1 0.98

This table shows the performance of the anomaly detection methods in terms of their recall on a variety of data sets. No pre-processing has been applied to the data other than normalization.

(24)

a bit more balanced, having similar recalls as the NN3 but specificities in the ranges 32-100% and

45-100% respectively.

Table 2: Anomaly Detection baseline specificity

Method

HV

NN1 0.93 0.19 1 0.74 0.91 0.45

NN3 0.11 0.08 0.93 0.63 0.97 0.29

OC-SVM (Lin.) 1 1 0.93 0.77 0.65 0.32

OC-SVM (Gaus.) 1 0.25 0 0 0.79 0

OC-SVM (Poly.) 1 0.97 0.93 0.77 0.96 0.45

This table shows the performance of the anomaly detection methods in terms of their specificities on a variety of data sets. No pre-processing has been applied to the data.

From the vast variation of these scores it is clear that there is room for improvement, particularly on the PowerCons, Strawberry and Company data sets where not a single method achieved 100% specificity. Most of the methods did score quite well on the Dodger Loop Weekend set and the smooth Hill-Valley set.

In Table 3 the training and testing times of all the solutions are shown. Here it is very clear that the Neural Networks are far slower than the OC-SVMs. However, the times are still measured in at most tens of seconds.

Table 3: Anomaly Detection baseline performance times

Method

HV

NN1 3.46 1.36 1.87 0.96 1.31 4.17

NN3 3.52 1.73 3.08 1.61 3.56 56.88

OC-SVM (Lin.) 0.35 0.72 0.05 0.08 0.54 0.80 OC-SVM (Gaus.) 0.17 0.55 0.06 0.07 0.10 0.58 OC-SVM (Poly.) 0.29 0.22 0.05 0.06 6.59 0.70

The times, measure in seconds, for performing 10-fold cross-validation on the anomaly detection methods. No pre-processing has been applied to the data.

8.1.2 Performance with PCA

In an attempt to improve performance and possibly decrease training and testing time PCA was applied to the data sets. The evaluation procedure was performed iteratively in order to find the smallest number of principal components which still gave a satisfying performance. The results are presented in Table 4 and Table 5 with the number of principal components presented in Table 7. The training and testing times are presented in Table 6.

For this test no solution achieved a specificity lower than 16%, an improvement from the baseline where the Gaussian OC-SVM scored 0% on three sets. Regarding the specificity, the Gaussian OC-SVM gained a lot from the application of PCA - its specificity on the Dodger Loop Weekend data set rose from 0% to 86% - but it still remains the weakest classifier with a specificity as low as 16% on the Company data set. Meanwhile, all classifiers except the Gaussian OC-SVM achieved a perfect specificity of 100% on the Dodger Loop Weekend set. The polynomial OC-SVM was alone

(25)

Table 4: Anomaly Detection recall with PCA

Method

HV

NN1 1 0.97 1 0.98 1 0.98

NN3 1 0.97 1 0.97 0.99 0.99

OC-SVM (Lin.) 1 1 0.97 0.94 0.99 0.98

OC-SVM (Gaus.) 1 1 1 0.99 1 0.98

OC-SVM (Poly.) 1 1 1 0.99 0.99 0.96

Recall of the anomaly detection methods with PCA applied to the data sets. Each combination of classification method and data set uses a number of principal components according to what gives the highest accuracy.

in managing 100% recall and specificity on more than two data sets.

At this point it seems as if all the methods perform comparatively well to each other. The biggest obstacle appears to be the Company data set where only one method managed a specificity greater than guessing-level.

Table 5: Anomaly Detection specificity with PCA

Method

HV

NN1 0.99 0.47 1 0.80 0.96 0.42

NN3 0.71 0.33 1 0.86 0.99 0.39

OC-SVM (Lin.) 1 1 1 0.74 0.58 0.32 OC-SVM (Gaus.) 1 1 0.86 0.71 0.79 0.16 OC-SVM (Poly.) 1 1 1 0.80 0.90 0.61

Specificity of the anomaly detection methods with PCA applied to the data sets. Each combination of classification method and data set uses a number of principal components according to what gives the highest accuracy.

Though PCA seems to have given the solutions an overall boost in performance, there were some decreases as well. Several methods lost 1-3% in recall on some sets while the linear and polynomial OC-SVMs lost 8% and 6% specificity on the Strawberry set respectively. The gains are however greater than the losses.

Regarding the times, almost all solutions received a positive effect from PCA with a few exceptions where the time was increased. For example the polynomial OC-SVMs time on the Strawberry set increased from 6.59 seconds to 19.37 seconds and on the Company data set from 0.70 to 4.00 seconds. This is however the by far greatest increase in time of all the methods, all other increases were but 0.01 seconds which could very well be due to rounding-off errors within Matlab.

On the other hand, the NNs gained greatly from PCA. The NN3 sped up from 56.88 seconds to

0.91 seconds - a decrease of 98.4%. When it comes to dimension reduction the NNs are the biggest benefactors, mainly due to their initial time consumption as compared to the OC-SVMs.

The capabilities of PCA is shown in Table 7 as only 3 to 5 principal components on the Dodger Loop Weekend set are used out of 288 available - yet the recall and specificities in earlier tables show all-around good results.

(26)

Table 6: Anomaly Detection times with PCA

Method

HV

NN1 1.11 1.03 0.91 0.90 0.88 0.79

NN3 1.66 1.31 1.28 0.95 1.00 0.91

The times, measure in seconds, for performing 10-fold cross-validation on the anomaly detection methods with PCA applied to the data. Each combination of classification method and data set uses a number of principal components according to what gives the highest accuracy.

Table 7: Number of principal components used for the Anomaly Detection methods with PCA applied

Method

HV

NN1 14 9 4 10 16 10

NN3 23 40 4 5 13 9

OC-SVM (Lin.) 1 1 3 7 13 11

OC-SVM (Gaus.) 1 1 3 3 9 3

OC-SVM (Poly.) 1 1 5 6 5 26

This table shows the number of principal components used for each combination of Anomaly De-tection method and data set for testing the impact of PCA.

8.1.3 Performance with autoencoders

Another attempted way to increase performance was by utilizing autoencoders. The test setup here was the same as in previous experiments: 10-fold cross-validation on normalized data where the training data is made to consist of 10% anomalous samples while the testing data has a the original proportion of anomalies to normal samples.

Table 8: Anomaly Detection recall with autoencoders

Method

HV

NN1 1 1 1 0.99 1 1

NN3 1 0.99 1 1 1 1

OC-SVM (Lin.) 1 1 1 1 1 1

OC-SVM (Gaus.) 1 0.95 1 0.99 1 1

OC-SVM (Poly.) 1 0.91 1 0.99 1 1

Recall of the anomaly detection methods with autoencoders applied to the data sets.

(27)

in Table 1. Every method managed to achieve perfect recall on at least four out of six data sets - four methods even got 100% on every data set. The overall lowest recall was 91%, done by the polynomial OC-SVM.

Table 9: Anomaly Detection specificities with autoencoders

Method

HV

NN1 0 0 1 0.86 1 1

NN3 0.97 0.97 1 0.94 1 1

OC-SVM (Lin.) 0.01 0.01 1 0.66 1 1 OC-SVM (Gaus.) 0.82 0.59 1 0.69 1 0.61

OC-SVM (Poly.) 0.79 1 1 0.91 1 1

Specificities of the anomaly detection methods with autoencoders applied to the data sets.

On the other side of the evaluation, the application of an autoencoder was a blessing in some cases and a curse in other. For example, the linear OC-SVM which scored 100% baseline speci-ficity on both of the Hill-Valley sets now scored just 1%. Simultaneously, the Neural Networks were close to polar opposites to each other on this same set: The NN1 managed 0% specificity

and the NN3 managed 97%. This latter NN only managed 11% and 8% in baseline specificity,

highlighting the fact that an autoencoder has a large effect on the data. Whether this effect is positive or negative is unclear at this point, as it seems to depend on both the data and the method.

Table 10: Anomaly Detection times with autoencoders

Method

HV

NN1 2.44 7.40 2.25 1.75 6.13 106.65

NN3 27.49 28.53 20.97 6.93 45.80 805.98

The times, measure in seconds, of the anomaly detection methods with autoencoders applied to the data sets.

As for the effect of autoencoders on time the results were clear. Only two data set/method com-binations had its time reduced while all others had it increased. This increase was in several cases by no small amount either. Primarily the Neural Networks suffered with the application of an autoencoder with some times increasing by a factor of 16 from 1.73 seconds to 28.53 seconds. The by far largest time measured in this thesis can be found in Table 10 at 805 seconds.

It is worth mentioning that the OC-SVMs also had their times increased a fair lot. The polynomial OC-SVM had its time on the noisy Hill-Valley set increased from 0.22 seconds to 11.52 - a factor of 52. It is also worth mentioning that autoencoders were not employed to in any way better the times of the methods but to boost the recall and specificity, however it can be of interest to study its effects regarding time as well.

(28)

Table 11: Anomaly Detection recall with PCA and Autoencoders

Method

HV

NN1 1 1 1 0.99 1 1

NN3 1 1 1 1 1 1

OC-SVM (Lin.) 1 1 1 1 1 1

OC-SVM (Gaus.) 1 1 0.99 1 1 1

OC-SVM (Poly.) 0.99 1 0.99 1 1 1

Recall of the anomaly detection methods with PCA and autoencoders applied to the data sets. Each combination of classification method and data set uses a number of principal components according to what gives the highest accuracy.

8.1.4 Performance with PCA and autoencoders

The final set of experiments for anomaly detection combines both of the data processing tech-niques: PCA and autoencoders. For all tests PCA will be applied on the data first, followed by autoencoders. It would have been interesting to investigate the effect of reversing this order but time is a factor. The order was decided on the fact that PCA was implemented to decrease the dimension of the samples to hasten the trial process. Having an autoencoder, in essence a double Neural Network, handle samples in the range of hundreds and even thousands of points followed by dimension reduction seemed absurd.

The two methods were implemented sequentially and iteratively evaluated in order to find the best number of principal components. Initially the dimensions of the data samples were reduced to a single principal component and fed into the autoencoder which attempted to create a model to recreate the positive samples. Gradually the number of components were increased until the performance stagnated.

The recalls for the solutions as presented in Table 11 shows that the vast majority of the recall scores either are perfect or close to perfect. Two solutions achieved perfect recalls on all sets: The NN3 and the linear OC-SVM. All other solutions missed having an all-together perfect score by

just 1% on one or two sets each.

Table 12: Anomaly Detection specificity with PCA and Autoencoders

Method

HV

NN1 0.91 0.76 1 0.91 1 1

NN3 1 1 1 1 1 1

OC-SVM (Lin.) 0.59 1 1 1 1 1

OC-SVM (Gaus.) 0.78 1 1 1 1 1

OC-SVM (Poly.) 0.81 1 1 1 1 1

Specificities of the anomaly detection methods with PCA and autoencoders applied to the data sets. Each combination of classification method and data set uses a number of principal components according to what gives the highest accuracy.

The specificities in Table 12 are in large part almost as good as the recalls in Table 11, indicating a very good performance. The most challenging data set appeared to be the smooth Hill-Valley set on which only one method scored a perfect 100% specificity: The NN3. The NN1 is arguably

(29)

the worst performing anomaly detector, scoring 76-91% on three sets while all other methods at most failed a perfect score on one and the same set, namely the aforementioned Hill-Valley set. This could be a testament to the fact that Neural Networks achieve better performance as they grow in size.

Table 13: Anomaly Detection times with PCA and Autoencoders

,

Method

HV

NN1 5.93 3.70 2.54 2.83 2.59 2.52

NN3 9.04 8.85 3.66 6.44 8.62 5.58

The times, measure in seconds, of the anomaly detection methods with PCA and autoencoders applied to the data sets. Each combination of classification method and data set uses a number of principal components according to what gives the highest accuracy.

Table 13 clearly shows the difference in times between the Neural Nets and the OC-SVMs: The worst times for the OC-SVMs are better than the best times for the Neural Networks.

It can be of interest to compare the number of principal components used for anomaly detection with only PCA applied and when both PCA and autoencoders have been applied, as seen in Tables 7 and 14. It can be seen that on average the numbers seem unchanged - some are a bit higher while some are a bit lower. For the Company data set the number of components used change from the range of 3-26 to 2’s on all methods except one which use 10 components, in other words a large overall decrease for this particular set. At the same time, however, the smooth Hill-Valley set seems a lot more challenging for the OC-SVMs. Initially these methods only needed a single component to manage great scores on this set, but at this point they need 4 and all the way to 40 components without even performing as well as they did without autoencoders. Autoencoders are evidently not fit for every situation.

Table 14: Number of principal components with PCA and autoencoders

Method

HV

NN1 14 14 4 10 16 10

NN3 5 4 2 7 3 2

OC-SVM (Lin.) 30 3 2 4 2 2

OC-SVM (Gaus.) 4 3 3 4 2 2

OC-SVM (Poly.) 19 3 2 4 2 2

This table shows the number of principal components used for each combination of Anomaly De-tection method and data set for evaluating the impact of both PCA and autoencoders being applied to the data.

(30)

8.2 Classification results

The recorded accuracies and times for the selected classification methods will be presented in the following sections. The selected methods are Neural Networks (one and three hidden layers), lin-ear/Gaussian/polynomial 1vsAll SVMs and linlin-ear/Gaussian/polynomial 1vs1 SVMs.

8.2.1 Classification baseline performance

Below all classification solutions and their baseline accuracies as well as their times are presented.

Table 15: Classification baseline accuracy

Method

Data set

Waveform data Ethanol data UMD MP

NN1 0.58 0.74 0.95 0.10 NN3 0.74 0.85 0.98 0.24 1vsAll SVM (Lin.) 0.78 0.23 1 0.61 1vsAll SVM (Gaus.) 0.84 0.03 0.78 0.80 1vsAll SVM (Poly.) 0.81 0.62 1 0.86 1vs1 SVM (Lin.) 0.85 0.62 1 0.84 1vs1 SVM (Gaus.) 0.85 0.34 0.94 0.89 1vs1 SVM (Poly.) 0.82 0.90 1 0.89

The baseline accuracies of the selected classification methods. No pre-processing has been applied to the data.

As can be seen in Table 15 the methods perform vastly different on the data sets. For example, the Gaussian 1vsAll SVM had an overall good performance on every set except the Ethanol data set where it scored just 3%. Meanwhile, the NN1scored acceptably on all sets except the Melbourne

Pedestrian data set where it only classified 10% correctly. At this initial step it can be seen that the polynomial 1vs1 SVM is the on average best classifier, scoring higher than any other classifier on every set except the Waveform data set where it was 3% under the linear and Gaussian 1vs1 SVMs and 2% under the Gaussian 1vsAll SVM.

Table 16: Classification baseline times

Method

Data set

NN1 1.42 43.37 0.94 1.16 NN3 2.07 250.37 2.01 2.05 1vsAll SVM (Lin.) 6.46 23.76 0.16 8.77 1vsAll SVM (Gaus.) 6.02 17.02 0.17 5.79 1vsAll SVM (Poly.) 200.64 251.28 1.39 55.50 1vs1 SVM (Lin.) 1.95 3.30 0.13 3.14 1vs1 SVM (Gaus.) 2.21 2.98 0.15 3.09 1vs1 SVM (Poly.) 44.14 54.27 1.04 9.47

The baseline times, measure in seconds, of the selected classification methods. No pre-processing has been applied to the data.

(31)

other due to the quite consistently larger time one of them requires on training and testing. However, it is not the 1vs1 approach with its greater number of classifiers that takes the longest time -it is instead the 1vsAll approach. This could be due to the classifiers in the 1vs1 approach training on significantly smaller sets of data than the 1vsAll approach. The 1vsAll approach trains every classifier on all of the training data every time, while the 1vs1 only trains its classifiers on two subsets of the training data.

The overall speed of the methods tested here is difficult to compare as some methods are faster than others on some data sets but slower on others. What can be said though is that the poly-nomial methods, particularly the 1vsAll SVM, take much longer time than any other method on almost every data set.

8.2.2 Classification performance with PCA

Below are the accuracies and times for all the classification methods with PCA applied to the data. The number of principal components used for each method and data set can be found in Table 19.

Table 17: Classification accuracies with PCA

Method

Data set

NN1 0.59 0.95 1 0.17 NN3 0.75 0.95 1 0.33 1vsAll SVM (Lin.) 0.78 0.23 0.99 0.61 1vsAll SVM (Gaus.) 0.78 0.23 0.99 0.61 1vsAll SVM (Poly.) 0.85 0.62 1 0.86 1vs1 SVM (Lin.) 0.85 0.62 1 0.84 1vs1 SVM (Gaus.) 0.85 0.34 1 0.89 1vs1 SVM (Poly.) 0.85 0.89 1 0.88

Accuracies of the classification methods. Each combination of classification method and data set uses a number of principal components according to what gives the highest accuracy.

As can be seen in Table 15 the accuracies vary quite a bit between both the data sets and the methods and the kernels. The NN1achieved the lowest overall accuracy of 17% which was closely

followed by the linear and Gaussian 1vsAll SVMs with 23%. The best performing classifier at this stage is the polynomial 1vs1 SVM, having accuracies within the range of 85-100%.

In Table 17 it is shown that the time difference between the 1vsAll and 1vs1 SVMs still hold true. There is just one minor exception to this relationship, namely the polynomial SVMs on the UMD set, where the 1vsAll approach is 0.3 seconds faster.

Regarding the effect of PCA on the times it can be seen that it is not only positive. The NN3

did get its time on the Ethanol data set diminished from 250 seconds to almost 2 seconds and the polynomial 1vsAll SVM had its time on the Waveform data set cut from 200 seconds to 5 seconds. However, in some places the times did go up slightly. An example is the polynomial 1vsAll SVM on the Melbourne Pedestrian data set: Here the time went from 55.50 seconds to 59.04 seconds. The increases were by no means as dramatic and damning as the decreases and in this regard PCA can still be regarded as a positive influence.

MACHINE LEARNING FOR MECHANICAL ANALYSIS

V¨

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics

30.0 credits

MACHINE LEARNING FOR

MECHANICAL ANALYSIS

Sebastian Bengtsson

sbn12009@student.mdh.se

Examiner: Ning Xiong

M¨

alardalen University, V¨

aster˚

as, Sweden

Supervisors: Martin Ekstr¨

om

M¨

alardalen University, V¨

aster˚

as, Sweden

Company supervisor: Henrik Grunditz,

Prevas, V¨

aster˚

as, Sweden

Per ˚

Ahman,

Prevas, V¨

aster˚

as, Sweden

June 20, 2019

Table of Contents

1.

Introduction

2.

Problem Formulation

2.1

Hypothesis

2.2

Research Questions

2.3

Limitations

3.

Background

3.1

Concepts and notations

3.2

Back-Propagation Neural Network (BPNN)

3.3

Support Vector Machines (SVM)

3.4

Data processing

3.5

Cross-validation

4.

Related Work

5.

Method

5.1

Constructing a system with alternative data

5.2

Stochastic quasi-Newton method for Twin SVM

5.3

Result evaluation

6.

Ethical and Societal Considerations

7.

Work process

7.1

Algorithm research and preparation

7.2

Data processing research

7.3

Anomaly detection implementation

7.4

Classification

7.5

SQN-TWSVM

8.

Results