Machine Learning to Detect Anomalies in the Welding Process to Support Additive Manufacturing

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2021 | LIU-IDA/STAT-A--21/036--SE

Machine Learning to Detect

Anomalies in the Welding

Pro-cess to Support Additive

Manu-facturing

Vinod Kumar Dasari

Supervisor : Amanda Olmin Examiner : Krzysztof Bartoszek

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Additive Manufacturing (AM) is a fast-growing technology in manufacturing indus-tries. Applications of AM are spread across a wide range of fields. The aerospace industry is one of the industries that use AM because of its ability to produce light-weighted com-ponents and design freedom. Since the aerospace industry is conservative, quality control and quality assurance are essential. The quality of the welding is one of the factors that determine the quality of the AM components, hence, detecting faults in the welding is cru-cial. In this thesis, an automated system for detecting the faults in the welding process is presented. For this, three methods are proposed to find the anomalies in the process. The process videos that contain weld melt-pool behaviour are used in the methods. The three methods are 1) Autoencoder method, 2) Variational Autoencoder method, and 3) Image Classification method. Methods 1 and 2 are implemented using Convolutional-Long Short Term Memory (LSTM) networks to capture anomalies that occur over a span of time. For this, instead of a single image, a sequence of images is used as input to track abnormal be-haviour by identifying the dependencies among the images. The method training to detect anomalies is unsupervised. Method 3 is implemented using Convolutional Neural Net-works, and it takes a single image as input and predicts the process image as stable or un-stable. The method learning is supervised. The results show that among the three models, the Variational Autoencoder model performed best in our case for detecting the anoma-lies. In addition, it is observed that in methods 1 and 2, the sequence length and frames retrieved per second from process videos has effect on model performance. Furthermore, it is observed that considering the time dependencies in our case is very beneficial as the difference between the anomalous and the non anomalous process is very small.

(4)

Acknowledgments

I would like to express my greatest gratitude to my supervisor Amanda Olmin for her su-pervision throughout the thesis and for making it easy for me to ask even trivial questions. Thank you so much for your continuous support, material you have provided me, your valuable suggestions & feedback. I would like to thank my examiner Krzysztof Bartoszek for his valuable feedback and suggestions.

I would like to thank my industrial supervisor Jonatan Palmquist from GKN Aerospace for his support, helping me with data whenever I need, sharing his domain knowledge with me, his feedback and discussions.

I would especially like to thank my mentor and sister Siva Krishna Dasari for guiding me throughout my Master’s studies, for her suggestions about my carrier, and feedback. I am very thankful for her continuous care and support. I would like to thank Markus Wejle-torp for inspiring me in every aspect, helping me with settling down in Sweden, and for his continuous care and support. A special thanks to Bodil and Per for their support and letting me to spend all my holidays with them and fun times together.

Finally, I would like to thank my family in India, mother Simhachalam, father Naidu, grand mother Ramudamma, aunt Satyavathi, my two more sisters Ramana and Lakshmi for believing in me and supporting me all the time.

(5)

List of Tables

4.1 Comparison of the three proposed models . . . 27

6.1 Model performances with different settings . . . 37

6.2 Model performances with different settings . . . 37

(8)

List of Figures

2.1 Neuron computation . . . 5

2.2 Artificial Neural Network . . . 6

2.3 Max pooling . . . 7

2.4 Standard RNN structure . . . 8

2.5 Architectures of standard RNN and LSTM . . . 8

2.6 Gate . . . 9

2.7 Forget gate layer . . . 9

2.8 Input gate layer . . . 10

2.9 Cell state update . . . 10

2.10 Output gate layer . . . 11

2.11 Autoencoder architecture . . . 12

2.12 Variational Autoencoder architecture . . . 13

3.1 Process images . . . 17

3.2 Training process for Autoencoder models . . . 18

3.3 Autoencoder architecture . . . 19

3.4 Variational Autoencoder architecture . . . 20

3.5 Classification model architecture . . . 22

4.1 ROC curve for selecting optimal threshold value . . . 25

4.2 Barplot showing Precison scores for models . . . 26

4.3 Barplot showing Recall scores for models . . . 26

4.4 Plot showing anomalies captured by the model . . . 28 4.5 Plot showing high regularity score throughout the process indicating stable process 28

(9)

1 Introduction

1.1 Background

GKN Aerospace supplies engine parts and develops technologies used in military & com-mercial air crafts that help them to fly faster, safer, and with greater fuel efficiency. The company is specialized in manufacturing and maintaining components of engine systems, aircraft structures, and cabin windows for air crafts. It is known for its expertise in a wide range of engine components covering compressor exhaust cases and intermediate cases for civil, and military aircraft engines, nozzles, and turbines for the most advanced space rock-ets, etc. To reduce fuel consumption and in turn the emissions from aviation, the company is developing specialized technologies and applying them in the production process. One of the technologies that the company is using in production is additive manufacturing, which makes components lightweight by manufacturing them in one piece.

Additive Manufacturing (AM) is a layer by layer addition of material to produce a three-dimensional object [12]. An AM process uses a computer-aided design for component pro-duction which increases the design freedom and results in producing light-weighted com-ponents. Thus, AM can produce components with complex internal geometries which oth-erwise would be manufactured by joining multiple components. This process reduces ma-terial wastage and manufacturing time. AM is well suited for manufacturing a variety of customized components because the design needs to be altered only in the CAD model to modify any component. AM technology is applied in various domains, such as medical [20], aviation and aerospace [27], automotive and tool manufacture [31] and several other man-ufacturing industries [24]. Among the several AM techniques, the Laser Metal Deposition technique is preferred when it comes to manufacturing large components or extending an existing component [13].

Laser Metal Deposition (LMD) is an AM technique where a metal component is man-ufactured through layer by layer deposition of the metal [1]. Usually, metal wire or metal powder is used to deposit on the metal substrate. The metal wire or powder is continuously fed from the nozzle for deposition. The base metal substrate and the metal fed through the nozzle are melted simultaneously and fused. The melting is done with the help of a laser beam. A melt pool is formed at the joint when the laser beam is applied throughout the process. The properties of the generated melt pool are good indicators for identifying the stability in the welding process [28] .

(10)

1.2. Motivation

Despite the several advantages gained through AM techniques, the use of them is still a challenge due to the quality assurance, as it is newly evolving technology. The application of AM techniques is limited in producing flying components because the aerospace industry is very conservative, and providing quality assurance for an AM manufactured component is a key factor. The current quality controlling measures for AM components carried out at GKN aerospace are fixing the process parameters and ensuring that they are the same throughout the process. All components are inspected by surface and volumetric non-destructive testing (NDT) techniques like X-ray and ultrasonic scanning. Furthermore, manual inspection by the operator is a standard approach used to identify faults during the process.

1.2 Motivation

Additive manufacturing is a fast-growing technology in the production industry. Welding is a key process in producing the product in welding based AM processes. Therefore, the quality of the weld is one of the factors that determine the quality of the AM component. Detecting faults in the welding process is a key step in the production process, hence monitoring of the welding process is essential. Furthermore, it helps in the quality control of the product. One of the existing approaches for monitoring this process is human inspection. However, human inspection is expensive and not suitable at all times. For instance, based on the size of the component, some processes take a long time to manufacture making the process tedious. Long production times make the monitoring difficult and also prone to human inaccuracy.

Neethu N.J. et al. [22] discuss the role of computer vision in automatic inspection and the advantages of using computer vision. They state that it is easy, quick, cost-effective, efficient, and robust to collect data using machines. Using the automated tools makes it possible to record data about the process and store it for future analysis. Hence, monitoring and ensur-ing quality control through these automated tools make the process much faster and accessi-ble than manual inspection. Jiang et al. [16] presented the merits of using hybrid inspection systems where the humans worked in cooperation with the automated systems. They also reported that there is an increase in accuracy when using these systems. In addition, with the advent of machine learning and deep learning technologies, the usage of automated systems has increased. Furthermore, computer vision is used for quality control in diverse applica-tions, for example, textile quality control [2], abnormal crowd behavior [25], inspection and quality control of fruits and vegetables [30], automated machine visual inspection [32]. The use of computer vision addresses the problems with human inspection which are mentioned above.

The information acquired from the melt pool behavior is an important factor that is used for tractability if the successive NDT methods find any indication of faults. Furumoto et al. [10] focused on the melt pool behavior of metal powder and the influence of the metal sub-strate temperature on the build quality of the component for detecting the spatter particles using high-speed imaging. This approach is being used to identify the droplets formed dur-ing the meltdur-ing process and is limited to finddur-ing the formation of spatter particles. Siva et al. [9] used the traditional machine learning method Random Forests to classify melt pool images in a supervised manner. Caggiano et al. [7] used a machine learning approach based on Deep Convolutional Neural Network (DCNN), where DCNN finds the features and em-bedded patterns from the melt pool images captured at each layer that are relevant to the specified fault processes. Since it is supervised, it is limited to identify the known and speci-fied faults.

Unlike the supervised methods cited above [7, 9, 10], in this study, Autoencoder architec-tures are implemented in an unsupervised manner to detect the faults of the process. The proposed method aims to identify all the faults of the welding process which can be identi-fied through monitoring. However, all the fault instances in any process are unknown before-hand. Furthermore, it is difficult to collect all the data relevant to faults as they occur rarely

(11)

1.3. Aim

in any process. Hence, it is impossible to build a model which learns all the irregularities. Then, how can we address the problem of detecting the faults without knowing all of them? A popular approach for this type of problem is exploiting the abundant stable data which is observed most of the time to identify anomalies. Making the model learn all the regular patterns during the training phase and then looking for the anomalies if the process devi-ates significantly from the learned patterns. Some example studies where only stable data is used in anomaly detection are [8, 21, 33]. Cewu Lu et al. [21] implemented sparsity-based abnormality detection to identify the abnormal events captured from the surveillance cam-eras. Cong et al. [8] used a Multi-scale Histogram of Optical Flow (MHOF) to identify the abnormal events using the normal positive training samples by basing the weighted L1 min-imization to construct sparse reconstruction cost for abnormal events. Zhao et al. [33] used 3-dimensional convolutions to extract features from spatial and temporal dimensions which are used to identify the anomalous events in video scenes.

1.3 Aim

The main aim of this thesis is to implement an automated fault detection system for the welding faults in the Laser Metal Deposition technique of an AM Process. The purpose of the developed system is to certify that the manufactured component is produced without any defects. For this purpose, machine learning methods and statistical inferences are used to implement the studied problem. The methods are 1) Convolutional-LSTM Autoencoder, 2) Variational Convolutional-LSTM Autoencoder, and 3) Binary Classification model with CNN. With the first two methods, two generative models, and with the last method, a dis-criminative model are built based on image data generated by a camera during the welding process. Similar to the example studies [8, 21, 33], only stable process data is used in the proposed generative methods to detect the anomalies for studied application. Whereas in discriminative model, stable and unstable process data is used.

1.4 Research Questions

More specifically, the research questions of the thesis are following. • How can machine learning techniques be used effectively

1. To extract useful information from process data and help to reduce time spent on manual monitoring?

2. To facilitate the certification of the AM products by identifying faults in the weld-ing process?

1.5 Delimitations

The study is conducted using only the welding process data generated by GKN Aerospace. The generated process data includes process videos and numerical parameters of welding robot. Only the process videos are used in this entire work.

(12)

2 Theory

Three methods are implemented in this work: 1) Convolutional-LSTM Autoencoder model, 2) Variational Convolutional-LSTM Autoencoder model, and 3) Binary Classification model. In methods 1 and 2, LSTM networks are used in combination with CNNs. These two meth-ods are based on Autoencoder architecture. In method 3, only CNNs are used to classify the images. In this chapter, Artificial Neural Networks (ANNs), Convolutional Neural Net-works (CNNs), and Recurrent Neural NetNet-works (RNNs) on which the models are based, are explained. These are followed by description of Autoencoder and Variational Autoencoder related concepts.

2.1 Artificial Neural Networks

An artificial neural network (ANN) is a parametric model consisting of a set of connected units called artificial neurons where each neuron receives a signal and then processes it, and passes it to the connected neurons [15]. Neurons are connected with each other with directed edges that have associated weights. Each individual neuron is like a linear regression model composed of inputs, weights, a bias, and output. The output at each node is computed by some nonlinear function applied on the weighted sum of inputs and their bias. The compu-tation at a single neuron is represented in Figure 2.1 below:

Figure 2.1: Neuron computation: Where X is the input vector, W is the weight vector, b is the bias, f is the non linear function, and y is the output for the neuron.

(13)

2.2. Convolutional Neural Networks:

The non-linear function is called the activation function. The neural networks have a layered structure where each layer has multiple neurons. The layers in between the input layer and output layer are called hidden layers and neurons in these layers are called hidden units. In a fully connected neural network, every neuron in a layer is connected to every neuron in the following layer. The output from a neuron in layer l is passed as input to neuron in layer l+1. It is assumed that neurons in a single layer are independent of each other. Graphically, an ANN looks as follow:

Figure 2.2: Artificial Neural Network [5]

The input layer contains the input vector X, The output layer contains the outputs y, and Wrepresents the weights of the network. Every layer has corresponding weights associated with each input.

For training of a neural network, an optimization criterion is specified during the train-ing process to learn the weights of the network that is called the loss function. The weights are auto learned during the training phase by the backpropagation method. In backpropa-gation, the gradient of the loss function is computed with respect to the learned weights of the network for a single input-output example. It works in chain rule, that is it computes the gradient for one layer at a time and is iterated backward by starting from the last layer.

2.2 Convolutional Neural Networks:

The input volume size is very large for images, due to this the parameters add up quickly if the standard fully connected neural networks are used, hence convolutions are used. Con-volutional Neural Networks (CNN) are similar to ordinary neural networks and are known for their good performance with images and audio [19]. The convolutional layer extracts interesting patterns and features from images by using convolution. For example, they can detect edge of some orientation, lines of some blocks, a blotch of some color, honeycomb or wheel-like patterns, etc. Suppose a single image of shape [32 ˆ 32 ˆ 3], for this shape a single fully-connected neuron in a regular neural network needs 32 ˆ 32 ˆ 3=3072 weights. If the

shape of the image increases to [200 ˆ 200 ˆ 3] then it needs 120,000 weights. Hence, unlike the hidden layers in the fully connected neural networks, here the neurons are connected to a small region in the previous layer i.e the region where filters interact. The neurons in the regular neural networks are assumed completely independent of each other and do not share any connections between them. Whereas the neurons in the CNN share common parameters called filters which results in a decrease in the number of the parameters. In general, the con-volutional layers are followed by pooling layers which are used for downsampling the data in the previous layer by applying different mathematical functions like maximum, minimum, average, aggregating, or sampling to localized regions of the input volume. The local regions are specified using filters which are slid over the input volume with a specified stride length. An example for pooling operation, max pooling is shown in Figure 2.3 with stride length 2

(14)

2.3. Recurrent Neural Networks (RNNs)

and filter size 2 ˆ 2. In max-pooling, the maximum value from the selected region is returned. For example, in upper left region (yellow color), the highest value is selected which is 49.

Figure 2.3: Max pooling

Convolution: The convolutions are used to extract spatial features from the input image. As explained in the previous Section, unlike the fully connected layers where each input has corresponding weights, the convolutional layer has a set of learnable parameters called filters. These filters are smaller in terms of spatial size (width and height) and are extended through full depth (channels) of the input volume. Suppose that the input volume has a shape of [256 ˆ 256 ˆ 3] and the filter has a size of [5 ˆ 5] then each neuron in that layer has 5 ˆ 5 ˆ 3 = 75 weights. These filters are slid over the entire input volume to generate low

dimensional feature maps (output volume). Different stride lengths S are used to slide the filter. Sometimes the input volume is padded with zeros around the border to get desired dimensions for input volume when the filter is slid over it. At each position, the dot products are computed between the values of the filter and corresponding input volume to produce the output volume. As the filter slides over, it produces a low dimensional response at each spatial position of the input volume. The spatial size of the output volume V can be computed as a function of input volume size I, filter size F, stride S with which the filter is slid over input volume, and the number of zero paddings, P, used on the border. The output volume V is computed using the Equation 2.1 below.

V= (W ´ F+2P)

S +1 (2.1)

For example, for a 7 ˆ 7 input shape and a 3 ˆ 3 filter with padding 0 would get a 5 ˆ 5 output for stride 1 and 3 ˆ 3 output for stride 2. The network learns the filters when particular visual features are observed like an edge of any entity, spherical, and any structural patterns. Multiple filters can be used at each layer and more features are captured with an increase in the number of filters. An increased number of filters increases the risk of overfitting. Several low-dimensional values are produced for different filters and all these responses are stacked to produce the output volume.

2.3 Recurrent Neural Networks (RNNs)

RNNs are networks with loops that allows them to store information overtime. Figure 2.4 below shows a network with repeating loop, where a module of neural network R takes some input Xt and outputs a value ht. The loop in the network allows the information to be passed from one step of the network to next. If we unroll the network, the repeating chunk of the network can be thought of as multiple copies of the same network where each chunk is passing a message h to the successor [4]. These intermediate messages make RNNs capable of capturing the dependencies in sequential data. The repeating module of a standard

(15)

2.4. Long Short Term Memory (LSTM)

RNN consists of a simple structure, such as a single tanh layer. The learning algorithms used in the standard RNNs are gradient-based. During the backpropagation, the RNNs run into problems like exploding gradients and vanishing gradients when some of the values involved in the gradient operation are greater than 1 or less than 1. With the increase in the distance between two inputs of sequence, the performance of standard RNNs decreases. Thus, it affects the model’s ability to capture the long-term dependencies which is addressed by Long Short Term Memory (LSTM) networks [6].

Figure 2.4: Standard RNN structure

2.4 Long Short Term Memory (LSTM)

LSTMs are a special kind of RNNs that are capable of capturing the long-term dependencies in sequential data. Unlike standard RNNs which consist of a single neural network layer in their repeating module, the LSTM contains 4 such layers interacting in a special way [23]. The repeating module architectures for standard RNN and LSTM are shown in Figure 2.5.

Figure 2.5: Architectures of standard RNN and LSTM

LSTM has the ability to add or remove important information from the network using special structures called gates, and the kept information is saved at each module as a cell state. The LSTM preserves the long-term dependencies by using this cell state which runs down the network chain like a conveyor belt. Using these gated structures alleviate the problem of gradients which is faced in standard RNNs. Figure 2.6 represents a single gate in a LSTM network.

(16)

Figure 2.6: Gate

The gates are composed of point-wise multiplication and a sigmoid neural net layer (acti-vation function) that manages the flow of information. An LSTM module contains three of such gates to let the flow of information. The sigmoid layer takes an input vector and returns a number between 0 and 1 for each value in the vector. Each value determines how much information should be let through. A 0 indicates that the value should be completely forgot-ten and 1 indicates that the full value should be kept. The two activation functions used in the LSTM network are the sigmoid (σ) and tanh (tanh) which are mathematically defined as follows.

σ(x) = 1

1+exp(´x) (2.2)

tanh(x) = exp(2x)´1

exp(2x) +1 (2.3)

The function of the LSTM network is explained in four steps as follows:

Step1: The first step in the LSTM module is to decide what information the network should forget from the cell state. The "forget gate layer" as shown in the red box in Figure 2.7, uses sigmoid layer to decide what to forget from the cell state Ct-1. The sigmoid layers takes previous module output ht-1and current input Xtand returns a number between 0 and 1 for each value in the cell state Ct-1. A 0 indicates to completely "get rid of the value" and a 1 indicates "completely keep the value".

ft=σ∼ (Wfb[ht-1, Xt] +bf) (2.4)

Figure 2.7: Forget gate layer

In the Equation 2.4, [ht-1 , Xt] represents the concatenation of current input and previous module output. Wfrepresents the trainable model weight matrix which is learned during the

(17)

training, bfrepresents the bias vector, σ represents the sigmoid activation function,∼

repre-sents the point wise application of function (σ) on multidimensional input, and b reprerepre-sents the Hadamard product.

Step2: The second step in LSTM module is to decide what new information the network should preserve from the current input Xt. This step has two parts. First, the sigmoid layer called the "input gate layer" decides which values to be updated in the input by producing the vector it. Second, a tanh layer which generates new candidate values Ccandwhich are to be added to the previous cell state Ct-1.

it=σ∼ (Wib[ht-1, Xt] +bi) (2.5) Ccand=tanh∼ (Wcandb[ht-1, Xt] +bcand) (2.6)

Figure 2.8: Input gate layer

Step3: In the third step, as shown in the red box of Figure 2.9, the current cell state Ct is updated using the outputs ft, it, and Ccandfrom the steps one and two. The output from the "forget gate layer" ftis multiplied with the previous cell state Ct-1 and the result is added to the product of generated new candidates Ccandand "input gate layer" output it.

Ct= (ftbCt-1) + (itbCcand) (2.7)

(18)

2.5. Convolutional LSTM

Step4: The last step decides the output of the current module ht as shown in the red box in Figure 2.10. A sigmoid layer called "output gate layer"is run on the current input Xtand previous module output h_t-1 which decides what parts of the current cell state should be given as output. The current cell state Ctis filtered where it is passed to tanh function which normalizes values between -1 and 1 and is multiplied by sigmoid layer output otto generate the final output of the current module.

ot=σ∼ (Wob[ht-1, Xt] +bo) (2.8)

ht=otbtanh∼ (Ct) (2.9)

Figure 2.10: Output gate layer

The output htand cell state Ctare passed to the next module of the network.

2.5 Convolutional LSTM

The limitation of the fully connected LSTM while handling spatiotemporal data is the use of full connections during input-to-state and state-to-state transitions. The full connections need a large number of weights for the model. In contrast, the convolutional LSTM replaces its ma-trix operations with convolutions and thus reduces the number of weights. By using these convolutions, the convolutional LSTM generates better spatial features with a smaller num-ber of weights. All the Equations from 2.4 to 2.9 in Section 2.4 are computed in this network except that the input here is an image. The weights are replaced with the convolutional filters and the Hadamard product is replaced by convolution operation in this computation. Thus, the convolutions allow convolutional LSTM to work with images with the ability to prop-agate the spatial features temporally from state-to-state transitions in convolutional LSTM. This makes convolutional LSTMs better at capturing the spatial feature maps in the image than standard fully connected LSTMs.

2.6 Autoencoder

Autoencoders are fully connected neural networks with the same number of input units (first layer) and output units (final layer). They replicate the input data in an unsupervised man-ner. Autoencoders are used to learn the representation of a given dataset by training the neural network. They reconstruct each input dimension by passing through several layers of network. Middle layers of the network have a smaller number of units than input and output layers, which reduces the input first to a smaller representation (encoding). Along with the

(19)

2.7. Variational Autoencoder

reduction side, the network learns the reconstruction side which is used to reconstruct the output from the reduced representation that is as close as possible to the original input (de-coding). During the reduction, the network extracts only important features from the given input that describes the whole data. An Autoencoder consists of three components:

1) Encoder: A fully connected feed-forward neural network that compresses the input data into a latent representation with reduced dimension.

2) Bottleneck: The encoded input which is in the reduced dimension and that is fed to the decoder.

3) Decoder: A fully connected feed-forward neural network that is having the opposite struc-ture to the encoder. The input shapes of the decoder are symmetric to the encoder. Thus, it reconstructs the reduced encoded representation back to the original input dimension. A nonlinear activation function is used in autoencoders which makes them extract more useful features than methods like PCA where linear transformations are used. Figure 2.11 illustrates the architecture of the Autoencoder.

Figure 2.11: Autoencoder architecture

Autoencoders reconstruct the given input by minimizing the difference between the input and output. The autoencoders are considered unsupervised since they do not need any la-beled data to enable the learning. Autoencoders are trained to minimize the reconstruction error, for example, the squared error between original and the reconstructed input called loss function. The loss can be computed as follows:

L(x, ˆx) =}x ´ ˆx}2 (2.10)

Where x is the given input and ˆx is the output from the autoencoder network. The recon-struction error used in this work is explained in Section 3.2.3

2.7 Variational Autoencoder

Variational Autoencoders [18] are similar to Autoencoders in terms of architecture except for the latent representation produced by the encoder. Instead of the latent vector (bottleneck) produced in the Autoencoders, a latent probability distribution is generated. The architecture for Variational Autoencoder is shown in Figure. 2.12. The latent distribution is modeled by the data. The latent variable is sampled from the latent distribution and given as input to the decoder, which is responsible for the reconstruction of the input. Variational inference and

(20)

Kullback–Leibler divergence, which are used in Variational Autoencoders, are explained in the following Sections.

Figure 2.12: Variational Autoencoder architecture

2.7.1 Kullback–Leibler Divergence

The Kullback-Leibler (KL) Divergence [11] score is a measure that quantifies how much a probability distribution differs from another probability distribution. The KL divergence be-tween two distributions Q and P is represented as KL(P}Q). KL(P}Q)can be explained as

P’s divergence from Q. KL divergence is calculated as follow for discrete (Equation 2.11) and continuous (Equation 2.12) distributions. Both the Equations are further represented as Equation 2.13. KL(P(x)}Q(x)) =ÿ x P(x)¨log(P(x) Q(x)) (2.11) KL(P(x)}Q(x)) = ż x log(P(x) Q(x))¨P(x)dx (2.12) =E_p(x)[log P(x)´log Q(x)] (2.13)

The intuition behind the KL divergence score is, if the probability of an event x in distribution P, i.e P(x)is large and the probability of that same event x in Q, i.e Q(x)is small, then there

is a large divergence between P and Q. Note that KL(P}Q)and KL(Q}P)are not equal.

2.7.2 Variational Inference

Variational inference [29] is used for approximating the intractable distributions that come across in Bayesian inferences. For understanding this, let us introduce a problem: given the original data, x P X, and the latent variables, z P Z, the aim is to estimate the conditional density of the posterior of the latent variables, i.e. p(z | x), it can be computed using Bayes’

theorem as follows.

p(z | x) = p(z, x)

(21)

The denominator in the Equation 2.14 is called evidence. To compute evidence one needs to marginal out z from the joint distribution using the following integral.

p(x) =

ż z

p(z, x)dz (2.15)

However, the above integral computation is often intractable. Two paradigms are used to estimate the conditional density of the posterior of the latent variables, i.e. p(z | x):

(1) Markov Chain Monte Carlo (MCMC) [3], approximates the posterior by sampling across an ergodic Markov chain on the latent variable z whose stationary distribution is p(z | x).

(2) Variational Inference (VI), instead of sampling, optimization is used to approximate the posterior by minimizing the Kullback–Leibler (KL) divergence between the approximate pos-terior p(z)and the exact one p(z | x).

f˚(z) =arg min

f PF

KL(f(z)}p(z | x)) (2.16)

Where f(z)is an arbitrary function defined over z, F is the domain of all possible values for

function f , f˚₍_z₎_{is the objective function which achieves the minimum KL divergence for} posterior p(z | x). On extending the Equation 2.16 with the definition of KL divergence using

Equation 2.13:

KL(f(z)}p(z | x)) =E_{f (z)}[log f(z)]´E_{f (z)}[log p(z | x)] (2.17)

On replacing the conditional probability p(z | x)with joint probability and marginal

proba-bility using Equation 2.14

KL(f(z)}p(z | x)) =E_{f (z)}[log f(z)]´E_{f (z)}[log(p(z, x)/p(x))] (2.18) =E_{f (z)}[log f(z)]´E_{f (z)}[log p(z, x)] +E_{f (z)}[log p(x)] (2.19)

log p(x)is independent of variable z, hence taking the value log p(x)out of expectation gives

KL(f(z)}p(z | x)) =Ef (z)[log f(z)]´Ef (z)[log p(z, x)] +log p(x) (2.20) The right-hand side expression is not tractable as it contains the computation of log p(x).

Due to this dependence, an alternative objective function is defined for KL divergence by rearranging the Equation. This alternative objective function called Evidence Lower Bound (ELBO) which is the variational lower bound on log p(x)i.e:

KL(f(z)}p(z | x))´log p(x) =E_{f (z)}[log f(z)]´E_{f (z)}[log p(z, x)] (2.21) Changing the sign on both sides

log p(x)´KL(f(z)}p(z | x)) =E_{f (z)}[log p(z, x)]´E_{f (z)}[log f(z)] (2.22)

ELBO(f) =E_{f (z)}[log p(z, x)]´E_{f (z)}[log f(z)] (2.23)

Maximizing the ELBO(f)results in minimizing the KL divergence which is the main

objec-tive of variational inference. i.e, Equation 2.16.

2.7.3 Variational Autoencoder (VAE)

The objective of the standard Autoencoder and VAE is to model the training data. If we consider training data as x then p(x)should be modeled.

Let’s introduce some notations:p(x)be probability distribution of the training data, p(z)

(22)

latent probability distribution p(z), and p(x | z)be distribution reconstructing the data given

the latent variable z.

By using the Equation 2.15, p(x)can be computed as.

p(x) =

ż z

p(x | z)¨p(z)dz (2.24) In the Equation 2.24, p(x) is computed by marginalizing out z from the joint distribution

p(x, z). It is possible if we know p(x, z)or p(x | z)and p(z). The idea here is that p(z)is

inferred using p(z | x)which is not known yet. Hence, p(z | x)is approximated using some

known and simpler distribution Q, that is easy to evaluate, i.e., Gaussian using the varia-tional inference introduced in Section 2.7.2. The KL divergence for these two distributions formulated according to Equation 2.22 where f(z) =q(z | x)and rearranging it turns as:

log p(x)´KL(q(z | x)}p(z | x)) =E_q(z|x)[log p(x | z)]

´(Eq(z|x)[log q(z | x)]´Eq(z|x)[log p(z)])

(2.25) log p(x)´KL(q(z | x)}p(z | x)) =Eq(z|x)[log p(x | z)]´(Eq(z|x)[log q(z | x)´log p(z)])

(2.26) The second part of right hand side in the Equation 2.26 is in the form of Equation 2.13 and can be written as KL divergence

log p(x)´KL(q(z | x)}p(z | x)) =Eq(z|x)[log p(x | z)]´KL(q(z | x)}p(z)) (2.27) Equation 2.27 is the key in the Variational Autoencoder. The left-hand side expression is in-terpreted as, modeling of the data represented by log p(x)under some error KL(q(z | x)}p(z | x)), i.e. the Variational Autoencoder tries to compute the lower bound of the data log p(x).

The VAE model can be found by maximizing the left-hand side of the Equation. i.e. by maxi-mizing over log p(x | z)and minimizing the KL divergence between the simple distribution

Q, i.e. q(z | x)and the prior distribution p(z). The equation can be optimized by the

stochas-tic gradient descent method provided a suitable choice for q, hence the objective function is:

E_{f (z|x)}[log q(x | z)]´KL(q(z | x)}p(z)) (2.28)

Maximizing log q(x | z)is a maximum likelihood estimation. It can be implemented using

any classifier like linear regression, SVM, or Logistic regression with input z and output x, then it can be optimized by regression loss. The term KL(q(z | x)}p(z)) needs p(z) that

is prior distribution. It might need to sample p(z)later, hence the easiest choice for this is N (0, 1)which is the standard procedure in Variational Autoencoder implementation. Thus

we want make our simple distribution q(z | x)as close as possible toN (0, 1). Having p(z) = N (0, 1)adds benefit here. If we want q(z | x)to be Gaussian with the parameters mean and

variance given the data x, then, the KL divergence can be computed between these two as:

KL(N (µ(x),Σ(x)}N (0, 1)) = 1

2Σ(exp(Σ(x)) +µ2(x)´1 ´Σ(x)) (2.29) The above Equation 2.29 is the final KL divergence objective function. This objective function is used along with the reconstruction error defined in Section 3.2.3 as a loss function during the training of the model.

(23)

3 Method

In this chapter, the methods implemented for anomaly detection and the data used in them are explained. Three methods are proposed to detect the faults from the process videos. The proposed models are: 1) Convolutional-LSTM Autoencoder model 2) Variational Convolutional-LSTM Autoencoder model, and 3) Binary Classification model. Each method is explained one by one.

3.1 Data

The data used in this study are welding process videos which are captured co-axially from a camera that is mounted on the laser optics. For training, 30 stable process videos of du-ration varying from 30 seconds to 5 minutes are used. All the videos used in this study are well inspected before they are used, to make sure that there is no anomalous behavior in the process. For testing and evaluating the method, separate experiments are conducted at the company where the process is intentionally tampered by the robot operator. The robot set-tings are manually changed to induce anomalous behavior in the process. For the generated videos during this process, the exact time is noted where the process is unstable.

Some of the common faults that can be observed by monitoring the melt pool videos of the LMD process are stubbing, dripping, and overshooting which are explained below.

• Stubbing is continuous wobbling of the wire which is fed in the LMD process that leads to uneven surfaces on the manufactured component. It is caused when the rate of wire speed is high and the distance between the laser beam nozzle to the metal substrate is too close. Shown in Figure 3.1 in the middle.

• Dripping is the formation of droplets at the wire tip that causes continuous drops on the metal substrate. The dripping is caused when the rate of wire speed is low and the distance between the laser beam nozzle to the metal substrate is too far. Shown on the right side in Figure 3.1.

• Overshooting is a condition where the wire is supplied beyond the melt pool due to the excessive wire feed.

(24)

3.2. Autoencoder Models

Figure 3.1: Process images

In Figure 3.1 some images from the process video are shown, the center image is observed when the process is stubby, and the image on the right side is observed when the process is drippy.

3.2 Autoencoder Models

Two Autoencoder models are proposed for solving the problem. These Autoencoder models learn regular spatial and temporal patterns in the training videos using convolutional neural networks in combination with LSTMs. The proposed models are then used to reconstruct the input sequences. The regularity of a frame sequence in a process video is generated using the reconstructed input sequences. For this, the models are trained with normal process videos. During the training, the aim is to minimize the reconstruction error between the original input sequences and the reconstructed sequences constructed by the models. Figure 3.2 illustrates the training process of the proposed models. The intuition behind this approach is that the learned model reconstructs the motion patterns present in the trained videos with low error. On the other hand, it will not accurately reconstruct the motion patterns of anomalous videos, which the model has not seen before. The reconstruction errors for all the input sequences are calculated to identify the abnormal sequences. By setting a threshold on the reconstruction error, the anomalous sequences are identified.

(25)

Figure 3.2: Training Process for autoencoder models

3.2.1 Data Preprocessing

The Autoencoder models require the input to be given as ordered sequences of frames from the process videos. First, the videos are sliced to remove the unwanted sections before the start of the process and after the end of the process. Then, the sliced videos are further split into raw frames. The obtained videos have a frame rate of 30 frames per second. Since the whole process is stable, all the frames are extracted for training the built model. 20000 frames are extracted from all the videos. In the process videos, along with the process, information about the process is displayed on the top. For example, the process running time, robot wire speed, and other numerical information of the process are displayed. The surrounding area of the process is also captured in the process video and is considered as noise. The process stability is identified using the melt pools and the position of the fed wire. Hence, to remove these noises, the raw frames are further cropped to capture melt pools properly. The obtained frames are resized to 256 ˆ 256 and converted into a grey scale to minimize the input memory and to recognize the features more efficiently.

Since input to the autoencoder models should be in ordered sequences, sequences of frames of length L are constructed from the pre-processed frames and used for training the model. The reason to use sequences is to capture the temporal dependencies between the frames. Hence, the sequence length is selected based on the number of seconds to be consid-ered for finding those dependencies. For example, if 4 frames per second with equidistant are extracted, then a sequence of 12 frames constitutes 3 seconds. The shape of a single frame is 256 ˆ 256 ˆ 1 which represents the height and width of the image with one channel. A sequence of 12 images is formed to provide input with the shape 12 ˆ 256 ˆ 256 ˆ 1. Differ-ent configurations for sequence length and the number of frames extracted per second may change the performance of the model.

3.2.2 Autoencoder Methods

Using the Autoencoder architecture two models are proposed. The first model is developed using standard Autoencoder architecture with Convolutional-LSTMs. The second model is implemented using Variational Autoencoder architecture with Convolutional-LSTMs.

(26)

3.2.2.1 Method1: Convolutional-LSTM Autoencoder model

The proposed architecture shown in Figure 3.3 consists of two parts: encoder and decoder, where each part consists of spatial and temporal decoder structures. Spatial encoder-decoder captures the spatial patterns in the input sequence and temporal encoder-encoder-decoder captures the temporal patterns in the input sequence. The encoder has two convolutional layers and the decoder has two deconvolutional layers that are responsible for capturing the temporal patterns of input sequences. There are three convolutional long short term memory (LSTM) layers that are responsible for capturing the temporal patterns in the convoluted sequences from previous layers. The second layer of convolutional LSTM layer produces the bottleneck (latent vector) that is fed to the decoder. These seven layers are followed by the final convolutional layer which generates the final reconstructed sequence back to the initial dimension.

Figure 3.3: Autoencoder Architecture.

In Figure 3.3, a sequence of length L = 12is given as input, and the model outputs a

recon-structed input sequence. The layers used in the architecture and corresponding output shape are mentioned in the first row of each box in the figure. The yellow boxes represent the spatial encodings and decodings and the green boxes represent the temporal encodings and decod-ings. The top row is responsible for encoding the input sequence into a lower-dimensional latent vector representation. The bottom row is responsible for the reconstruction of the input sequence. The loss function used for training the model is the reconstruction error which is defined in Equation 3.1.

3.2.2.2 Method2: Variational Convolutional-LSTM Autoencoder model

The Variational Autoencoder (VAE) has similar architecture to Autoencoder as explained in the previous section, except for the intermediate latent variable representation. The input training data is modeled over a latent distribution. The encode is a sample from that latent distribution instead of a vector as in the Autoencoder. A point from the latent distribution is sampled and is used by the decoder for reconstructing the input. The same configuration used in method1 for the neural networks is used for all the layers of encoder and decoder networks. In addition to that, some extra layers are added in the encoder part of VAE. The final output from the encoder in method1 is multidimensional. To make it possible for gen-erating the latent distribution, the output should be a vector. For this, a flatten layer is used to flatten the two-dimensional output into a vector. To generate latent distribution, a mean

(27)

vector and variance vector are required. The latent space dimensions of the distributions are adjusted using several dense layers i.e. if the latent space dimension is d then 2 ˆ d outputs are generated from the final dense layer. Of these, d outputs are used as mean vector and d outputs are used as variance vector. Hence the output from flatten layer is passed to several dense layers to produce the required number of inputs to generate the parameters of the la-tent distribution. In this method, using the mean and variance vectors a lala-tent representation zis randomly sampled to give the encoder output. The architecture for the VAE model used in this study is shown below in Figure 3.4.

Figure 3.4: Variational Autoencoder Architecture

Unlike the loss function in method1, the loss function that is used in this method contains two parts. The first part is the reconstruction error between the input and output at the final layer of the decoder part which makes the VAE model as performant as possible. The reconstruction error defined in Equation 3.1. The second part is the regularization term that is the objective function defined in Equation 2.29 which is responsible for regularizing the latent space at the final layer of the encoder.

3.2.3 Reconstruction Error

The reconstruction error for the reconstructed frames is calculated using original image se-quence and reconstructed sese-quence. The pixel difference between each frame of input and reconstructed frames at time stamp t is calculated by taking the Euclidean distance between the pixel intensities. All the differences are added to get the overall difference. In this work, L2 normalization is used for calculating the error from which regularity score is calculated later.

Error(seq) =ÿ

t

||x(t) ´ M(x(t))||2 (3.1)

In Equation 3.1, x(t) represents the input frame at time stamp t and M(x(t)) represents the reconstructed frame at the corresponding time stamp. From the obtained reconstructon error abnormality score Sa(seq)for the reconstructed sequence is computed by scalling Error(t) be-tween 0 and 1. Regularity score Sris further calculated by simply subtracting the abnormality

(28)

3.3. Method3: Binary Classification Model

score from 1 as shown in Equation 3.3 . Sa(seq) = Error

(seq)´Error(seq)_min

Error(seq)max (3.2)

Sr(seq) =1 ´ Sa(seq) (3.3)

The regularity score is used to describe the process stability.

3.2.4 Threshold Selection

After finding the regularity score for all the frame sequences, a threshold is used to determine whether the reconstructed frame is regular or anomalous. Selecting the threshold value gives the user the control of how much deviation from the stable data can be accepted while identi-fying an anomaly. A large threshold indicates that the reconstructed frames should be highly similar to the train data and may lead to identifying frames as anomalous even though they are not. On the other hand, a small threshold indicates that higher deviance can be allowed while classifying the frames as anomalous and may result in classifying anomalous frames as normal. The threshold is selected in such a way that the system should separate anoma-lous and normal frames with high accuracy. For selecting the threshold, a validation video is selected for which each frame is labeled manually by the domain expert whether it is anoma-lous or normal. A Receiver Operating Characteristic (ROC) curve is created for selecting the threshold. A ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) for various threshold values. TPR is the probability of detecting the true positive events among all the positively classified events. FPR is the probability of detecting the true negative events among all the negatively classified events. A threshold is selected where TPR is maximum and the FPR is minimum.

3.2.5 Model Parameters and Training time

The Autoencoder models are trained by minimizing the reconstruction error. The Adam [17] optimizer is used for optimization by allowing it to set the learning rate automatically based on the model weights generated before. Adam is a stochastic gradient descent based optimizer which makes use of exponentially decaying average of previous gradients, i.e. it is based on adaptive estimation of the first and second-order of moments. It is a commonly used optimizer [17] for problems that require huge training data as it is computationally effi-cient and has little memory requirement. Because the training is computationally expensive, layer normalization is used to normalize the activities of neurons. The layer normalization is selected because the recurrent neural networks are used in the methods. Layer normalization normalizes the inputs across the features i.e it normalizes each given input activations inde-pendently of batch rather than across a batch. Since the inputs for the models are sequences of images, a small batch size of 5 is used. Each model is trained for 3 epochs since the data is large. The Autoencoder models took 100 seconds per step in each epoch for all the batches during the training, on the other hand, the VAE models took 120 seconds per step. The computer used for the training had the following configuration.

Processor: Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz Installed RAM: 32.0 GB (31.9 GB usable)

System type: 64-bit operating system, x64-based processor OS edition: Windows 10 Pro for Workstations

3.3 Method3: Binary Classification Model

In this method, an image classification system is developed using Convolutional neural net-works. Image classification is defined as a task that takes an image as input and gives out

(29)

3.4. Model Evaluation Metrics

the class of the image or the probability of that image belonging to a class. For this task, the images are labeled with classes, and thus it is a supervised learning task. In our case, the images are labeled as ’stable’ for stable process images and ’unstable’ for unstable process images. Convolutional neural networks are used to train the model to learn the patterns that distinguish stable and unstable images. For training this model, 10,000 images are used. Out of which 3500 images belong to the unstable class. Except the part about the sequences, the frames are pre-processed in the same way as in Autoencoder models. The architecture for the method is illustrated in Figure 3.5.

Figure 3.5: Classification model architecture

The proposed model network contains four layers of convolutional layers followed by a flat-ten layer and 3 layers of dense layers. Each convolutional layer is followed by a max-pooling layer and batch normalization layer. The pooling layer is used to perform down-sampling along the spatial dimension. It is used to make the model capable of recognizing the image even when there is a change in the appearance of an image in some way. Batch normalization [26] is used for rescaling and recentering the inputs to make the network faster and more stable. The flatten layer is used to flatten the two-dimensional output from the convolutional layer. Using the dense layers the input is compressed to get the final classes. Since it is bi-nary classification, the sigmoid activation function is used at the final layer and the bibi-nary cross-entropy is used as the loss function for the training the model.

3.4 Model Evaluation Metrics

The evaluation metrics used for model evaluation are precision and recall which are defined as follows:

(30)

3.4. Model Evaluation Metrics

Precision: It is the fraction of correctly classified positive instances among all the positively classified instances by the model.

Precision= TP

TP+FP (3.4)

Recall: It is the fraction of correctly classified positive instances among all the true positive instances.

Recall= TP

TP+FN (3.5)

Where TP is true positive, or, as in this case Stable process that is correctly identified, FP is false positive: a stable process that is incorrectly identified, and FN is false negative: a stable process that is identified as anomalous by mistake. Then the overall F-measure (F1-score) is computed using both precision and recall.

F ´ measure=2 ¨ Precision ¨ Recall

Precision+Recall (3.6)

For all the models, the stable process is defined as positives and the unstable process as neg-atives for the evaluation.

(31)

4 Results

In this section, the results of the three proposed models for the studied problem are pre-sented. The models are compared to select the best model which identifies the anomalies with greater accuracy. The proposed models for the specified problem are 1) Convolutional-LSTM Autoencoder model, 2) Variational Convolutional-Convolutional-LSTM Autoencoder model, and 3) Binary Classification model. The data is divided into a training dataset to learn the model weights and a testing data set to evaluate the models.

4.1 Model Evaluation Process

For training autoencoder models, the frames are extracted from the process videos. The pro-cess videos have a frame rate of 30 frames per second (FPS). To select the best choice for the number of frames retrieved per second, different FPS values are considered. The models are trained with the following training dataset configurations for both the Autoencoder model and Variational Autoencoder model:

1. Frames extracted with 4 FPS. 2. Frames extracted with 10 FPS. 3. Frames extracted with 15 FPS. 4. Frames extracted with 30 FPS.

From the above settings, it can be observed that with the increase in the FPS value, the train-ing data set is capturtrain-ing more information about the process behavior from a strain-ingle second of the process video. In addition, to capture the temporal dependencies efficiently sequence of images is passed as input, each of the above four settings is used in combination with dif-ferent sequence lengths to train the models. The sequence lengths 4, 8, 10, and 12 are tested to select the best sequence length.

For validating the models, a process video of a duration of 5 min consisting of 9000 frames was labeled manually with the help of professionals from the GKN Aerospace. Labeling is done as "stable" for stable process behaviour and "unstable" for anomalous process behaviour. Since the input used for models are sequences and each model is trained with different con-figurations, it is hard to evaluate the models based on frames. Hence, the validation of the

(32)

4.2. Threshold

models is done based on the time variable, i.e every half second of the process video is labeled as stable or unstable.

As mentioned above different sequence lengths are used as inputs for models. For unifor-mity for all the models, a different number of frames are extracted during the testing of the models. If a model needs an input sequence of length L then 2L frames are extracted per sec-ond at equal intervals. For example, if a model requires an input of sequence length 10, then 20 frames per second are extracted. That is 10 frames represent half a second of the process video. For classification model confusion matrix is generated for the predictions.

4.2 Threshold

As mentioned in Section 3.2.4, the threshold for the regularity score for each model is selected based on the ROC curves. For every model, different threshold values are selected based on the ROC curves given by their corresponding model predictions shown in Table 6.3 in the Appendix. The following plot (Figure 4.1) shows an example ROC curve generated for one of the Autoencoder models (sequence length 12 and FPS value 15).

Figure 4.1: ROC curve for selecting optimal threshold value

The threshold value 0.97 is selected from the above plot. It can be observed that the true positive rate is high and the corresponding false positive rate is low for the selected threshold value. If the threshold is increased, the false positive rate is gradually increasing and resulting in identifying every instance as stable by the model. On the other hand, if the threshold is decreased, the instances are identified as unstable.

4.3 Model Evaluation Results

For all the models, the metrics defined in Section 3.4 are calculated and shown in Tables 6.1 and 6.1 in the Appendix. The bar plots for all those values are presented below. From the

(33)

4.3. Model Evaluation Results

evaluation scores, it can be observed that the Autoencoder architecture models outperformed the standard Classification model. Among the two Autoencoder models, the Variational Au-toencoder model performed better in terms of every measure. Across all the models the Variational Autoencoder model (30 FPS and (8,10) Sequence length) achieved 82% precision and 86% recall. The Variational Autoencoder model (10 FPS 12 sequence length) achieved 73% precision and 81% recall with less training data which is equal to the best Autoencoder model (30 FPS 8 sequence length). On the other hand, for the same settings (10 FPS 12 se-quence length), the Autoencoder model gave 70% precision and 50% recall scores.

Figure 4.2: Barplot showing Precison scores for models.

(34)

4.4. The Process Video Anomaly Detection

In the above plots (Figure 4.2 and Figure 4.3), the Y-axis denotes the precision (Figure 4.2) and recall (Figure 4.3) scores obtained for each model and the X-axis denotes the model names. In the model names, AE represents Autoencoder and VAE represents Variational Autoencoder. The later part FPS denotes the value of FPS used in the corresponding models, and Bin_Class denotes the binary classification model.

It is observed from the bar plots that the precision and recall values of each model are gradually increasing with the increase in the FPS values selected for training the model. There is a noticeable difference in the scores concerning the sequence length used for input sequences. In each model, the scores are very low when the sequence length is 4. The preci-sion and recall for most of the models are observed maximum when the sequence length is 8. It is also noticed that there is a drop in the performance of each model when the sequence lengths are increasing after 8.

The model with 30FPS and sequence length 8 from Autoencoder models and the models with 30FPS and sequence lengths 8 and 10 from Variational Autoencoders are selected to compare with the Classification model shown in Table 4.1. The Autoencoder model achieved 26 units and VAE model achieved 33 units more F-measure than the Classification model. The Variational Autoencoder model performed equally well with the standard Autoencoder by using less training data.

Model Evaluation Metrics

Precision Recall F-measure

Classification 0.49 0.53 0.51

Autoencoder (30 FPS)-Sequence length (8) 0.80 0.75 0.77 Variational Autoencoder (30 FPS)-Sequence length (8) 0.82 0.86 0.84

Table 4.1: Comparison of the three proposed models

4.4 The Process Video Anomaly Detection

The anomalies are identified using the regularity score generated by the models. The follow-ing graph (Figure 4.4) represents the regularity score generated for a process video to detect the anomalies using the Autoencoder model (10 FPS and 12 Sequence length). The X-axis values indicate the input frame sequences generated for the process video. The Y-axis indi-cates the regularity scored generated according to the formula in Section 3.2.3. A regularity score of 1 represents the process is completely stable or is very similar to the stable process. If the score is 0, the process is very different from the stable process. The red line in the graph indicates the threshold and the scores below this line are considered as anomalies highlighted by red dots. The drop in the regularity score shows that there is anomalous behavior in the process video. The greater the drop in regularity score implies the greater the possibility of the presence of an anomaly in the process. Around the frame sequence numbers 90 to 100, 145, 180, 230, 350, 490, 520, and 560 the drop in the scores can be observed clearly. Among them, at 180, 230, 490 the scores dropped drastically indicating the process at those frames deviates more from the stable process. With these frame sequence numbers, the time in the process video at which the anomaly occurred can be retrieved based on the sequence length and FPS values used in the corresponding predicted model. A stable process is tested with the Autoencoder model (10 FPS and 12 Sequence length), only high scores are observed and the results are shown in Figure 4.5.

(35)

Figure 4.4: Plot showing anomalies captured by the model

Figure 4.5: Plot showing high regularity score throughout the process indicating stable process

(36)

In the plot (Figure 4.5), all the regularity scores for the process are above the threshold value, i.e lies above the red line. Such a plot can be used to certify an AM component with more certainty by stating that the welding process is stable while manufacturing the component.

(37)

5 Discussion

5.1 Effect of Sequence Length on Autoencoder Model Performance

As can be seen from bar plots (Figure 4.2 and Figure 4.3) in the results section, selecting se-quence length is a key parameter to be considered while training the model. With a sese-quence length of 4, the performance of every model is very low. The cause for the low performance is likely that the anomalies present in the process occur over some time. That is, more number of frames are required to capture an anomaly. Hence, capturing a particular behavior entirely requires more frames. The changes observed in the process in just a sequence of four frames does not seem to be sufficient in our case. In addition, choosing a large value for sequence length also affects the model performance inversely. The evaluation metrics Precision, Re-call, and F-measure starting to decreases when the sequence length increased. The possible cause for the drop in performance is that the models are not able to capture and remain all the dependencies between the given sequence of frames as the length becomes larger. The LSTM layers capture more dependencies between the frames when the sequence length is 8. It can be observed from the bar plots in the result section that, the orange bar represents the sequence length 8 in the precision plot is higher for every model. The recall values are higher except for models with less (4 and 10) frame rate.

5.2 Effect of Frames Extracted Per Second (FPS) on Autoencoder Model

Performance

The process videos have a frame rate of 30 frames per second. That is 30 frames represent the process behavior for one second. From the bar plots (Figure 4.2 and 4.3), it can be observed that the performance of models is gradually increased with the increase in the FPS values. There is a significant increase from the model’s performance with 4 FPS to 15 FPS. The per-formance is low for lower FPS values because, while extracting, the frames are extracted at equal intervals. If we consider 4 FPS, only two frames are retrieved per half-second for train-ing the model. The anomalies observed in the weldtrain-ing process occur over a span of time. Hence the two frames does not seem to be enough to capture the changes that occurred in the process. With the increase in FPS value, the training data is covering the full process and preserving the necessary information to identify an unstable process. Furthermore, if the FPS value is low, then it requires a huge number of process videos to generate training data.