Anomaly Detection on Gas Turbine Time-series’ Data Using Deep LSTM-Autoencoder

(1)

Anomaly Detection on Gas Turbine Time-series’ Data Using Deep

LSTM-Autoencoder

Marzieh Farahani

Marzieh Farahani Autumn 2020

Degree Project in Computational Science and Engineering, 30 credits Supervisor: Lili Jiang

Extern Supervisor: Mohamed Elhafiz Hassan Examiner: Eddie Wadbro

Master of Science Programme in Computational Science and Engineering, 120 credits

(2)

Anomaly detection with the aim of identifying outliers plays a very important role in various applications (e.g., online spam, manufacturing, finance etc.). An automatic and reliable anomaly detection tool with accurate prediction is essential in many domains. This thesis proposes an anomaly detection method by applying deep LSTM (long short-term memory) especially on time-series data. By validating on real-world data at Siemens Industrial Turbomachinery (SIT), the proposed method shows promising performance, and can be employed in different data domains like device logs of turbine machines to provide useful information on abnormal behaviors.

In detail, our proposed method applies an autoencoder to have feature selection by keeping vital features, and learn the time series’s encoded representation. This approach reduces the extensive input data by pulling out the autoencoder’s latent layer output. For prediction, we then train a deep LSTM model with three hidden layers based on the encoder’s latent layer output. Afterwards, given the output from the prediction model, we detect the anomaly sensors related to the specific gas turbine by using a threshold approach.

Our experimental results show that our proposed methods perform well on noisy and real-world dataset in order to detect anomalies. More- over, it confirmed that making predictions based on encoding representation, which is under reduction, is more accurate. We could say applying autoencoder can improve both anomaly detection and prediction tasks. Additionally, the performance of deep neural networks would be significantly improved for data with high complexity.

(3)

We are grateful because we managed to complete the master final project in anomaly detection of a gas turbine using deep LSTM-Autoencoder within the given time by Marzieh Farahani, a student of the master program of computational science and engineering.

This theis is could not be completed without the effort and cooperation of Siemens Company and Umeå University.

I also thank both supervisors in Siemens Company and Umeå University, Mr. Mohamed Elhafiz Hassan and Dr. Lili Jiang for the guidance and encouragement in finishing the final project.

Last but not least, I would like to thank Siemens Data Scientist members in Siemens Com- pany, My family, and Mr. Mehrdad Farahani for their constant source of inspiration and guidance.

(4)

1 Introduction 1

1.1 Objectives 2

1.2 Scope and Limitation 2

1.3 Literature Review 3

1.3.1 Statistical-based methods 3

1.3.2 Prediction-based methods 4

1.3.3 Reconstruction-based methods 5

1.4 Thesis Structure 5

2 Principles and Concepts 6

2.1 Time Series and Anomaly Forecasting 6

2.1.1 Key Components associated with An Anomaly Detection Problem 8

2.1.1.1 Nature/Type of Anomaly 8

2.1.1.2 Type of Time-spaces 8

2.2 Time Series and Deep LSTM 9

2.3 Dimentionality Reduction (Autoencoder) 11

3 Methodology 14

3.1 Dataset 14

3.2 Model Design 19

3.3 Prediction Model 19

3.3.1 Reconstruction Autoencoder 19

3.3.1.1 Reduction Using AE 21

3.3.2 Deep LSTM 23

3.4 Detection Model 26

3.4.1 Anomaly Scoring and Selection of Candidate Set 27

4 Experimental Study and Results Analysis 29

4.1 Prediction Model 29

(5)

4.1.2 Deep LSTM 33

4.2 Detection Model 36

5 Conclusion and Future Work 39

5.1 Conclusion 39

5.2 Future Work 39

References 41

(6)

1 Introduction

Anomaly detection (aka outlier detection) is the process of identifying unexpected items, observations or events in data sets, which differ from the norm. As an integral part of most companies and businesses, anomaly detection significantly reduces financial and technical losses. Especially, as time-ordered data in various platforms (industrial, health, economic, and financial) has been exponentially increasing along with the emerging Internet of Things (IoT) [18] , which enables them to collect and share data. This growth creates new business opportunities as well as brings more challenges to detect outliers among the time-series data. However, many companies usually have manual monitoring for identifying anomalies on different underlying bases, which require substantial human effort to monitor daily or weekly reports on operations or performance. Thus, it is challenging for the companies to track all metrics simultaneously and find a correlation between them.

Besides the above difficulties, time series data in companies is noisy and in large scale, and especially, the label or the class of the anomalies data is lacking. Many types of research are trying to apply data-driven methods. These methods for anomaly detection can be mainly categorized into three types: statistical modelings, such as the k-means clustering and Random forest, temporal feature modeling, which is mainly based on the Long Short-term Memory (LSTM), and spatial feature modeling, which takes advantage of Con- volutions Neural Network CNN [14]. The primary purpose of these methods is developing stable algorithms by adopting system conditions to detect outliers even in different environ- ments.

Deep learning methods have become successful due to the capacity to handle non-linearity in complex temporal correlation [8]. Deep learning (DL) is retrieved from classical Machine Learning (ML). Still, deep learning is responsible for the growth of Artificial Intelligence usage by improving the existing algorithms. It has shown high-grade performance because of its power to deal with unstructured and unlabeled data. As well as there is no need for domain knowledge to extract features. Nevertheless, it is fair to say in deep learning approaches, there are limitations such as extra time to train and need lots of training data.

Moreover, one of the most considered boundaries of deep learning is the Neural Networks at the core of deep learning are black boxes.

The project aims to provide and apply the deep LSTM method with using autoencoder in an unsupervised way to detect time series anomalies. Apart from working with an unlabeled dataset, there is no need to conduct feature engineering because it is a complicated task. Instead, many deep neural network parameters are trained to learn the input data’s critical feature during the training stage. Plus, Autoencoder helps deal with the large scale dimensional inputs data which they are ordered in time.

(7)

1.1 Objectives

Siemens Industrial Turbomachinery (SIT) is one of the biggest international companies in power generation. The company invested in diverse projects to examine and study machine lifetime and its corresponding components to identify how and when various failures influenced the system. In recent years, the digitalization transformation benefits from collecting and maintaining data in database formats that carry various valuable information about unexpected events, component repair, and operation outage history. The controlling system, which includes a computing device, gives us information about the hardware component’s thermodynamic and operating parameters using sensors placed along with the turbines sections.

This thesis’s principal goal is to develop an advanced maintenance strategy that can help the power plant operators increase their assets’ availability and reliability and minimize their CAPEX and OPEX. Capital Expenditures (CAPEX) are significant purchases a company proceeds on goods or services to develop a company’s future performance. Furthermore, Operating Expenses(OPEX) are the typical cost, such as salaries and rent that a company incurs to run their day-to-day operations [36].

To reach this goal, we have stated a general Deep Learning (DL) model with two desires:

• Daily Forecasting in order to predict each sensor’s five minutes head for specific gas turbine machine.

• Detection in order to find the list of sensors with anomaly behavior that gave the system hard time. It improves system cost along with the daily Forecasting model.

The model is expected:

• The model must be generalize. In this study, the model performed for the specific case study. For a different case study the result should be obtained within minimal changes.

• The model should be accurate and valided under different environmental conditions.

The prediction/estimation approach detects abnormal behavior (in a shape anomaly) on the collected data by comparing it with the desired network outputs. It helps build the decision- making process for the power plant operators automatically and classify the useful pattern during the operations. Detecting anomalies can give pre-warning and reduce system costs to the manufactories. The work of anomaly detection in this thesis especially provides useful information to the department within Siemens Industrial Turbomachinery (SIT).

1.2 Scope and Limitation

The thesis scope’s decision has been taken after careful analysis of the customer service dataset. It is required to declare that a turbomachine, such as a gas turbine, generally includes several sections, and each section includes numerous hardware components. Over

(8)

time, multiple elements, including thermal cycling, vibration, and pressure pulses within the gas turbine were measurement by sensors (aka signals).

Fifteen gas turbine units were considered according to customers’ commonly requested units with the frequent turbine-model package. Finally, The project begins with one final gas turbine unit, and the rest were left out of the project because they were in the commissioning phase, and the signal value’s quality and quantity were not good enough.

Another limitation is related to the records of signal value for the specific gas turbine unit at several time intervals. The records were collected from 2012 until the ongoing year.

Nevertheless, some signals did not have any records for some months between years (2012- 2020).

As the project’s complexity was high enough, the author of this thesis considered studying each year’s quality and quantity of signals. Finally, it has to be pointed out that the primitive dataset during the thesis is considered in the year 2013.

1.3 Literature Review

There has been a considerable amount of research in the field of anomaly detection.

The most manageable and common way in time series anomaly detection is to set thresholds and generate warnings whenever the metric goes above or under the threshold. However, Finding the threshold for each metric needs a deep understanding of the performance of the indicator. It was a difficult task to capture the desired output from the complex structures in the data.

To overcome this difficulty, more advanced techniques, namely: statistical-based, prediction- based, and Reconstruction-based, are mainly applied.

1.3.1 Statistical-based methods

Statistical-based methods [35] can be classified as supervised and unsupervised approaches.

Both Supervised and unsupervised techniques aim to isolate anomalies within the time series. In the supervised method, observations are labeled as healthy and faulty based on previous historical data; then, this dataset is used to create classification models that can predict unseen records. Support Vector Machine (SVM) and k-nearest neighbor (KNN) are representative algorithms of this category. These algorithms work dependably on a distance measure between objects. Objects that are distant from others are considered as anomalies. This detection is also could be called the distance-based methods [30]. Both KNN and SVM are classical Machine Learning methods and generally used for classification [15]

[17]. Nevertheless, Standard SVM and KNN may be out of work if we deal with anomaly detection. Next, researchers were pictured on how these methods could be managed for anomaly detection problems.

A vital factor for composing an anomaly-based detection model is to select significant features for making decisions. In most recent researchers, KNN (k-nearest-neighbor) approach in combination with the mother algorithm, showed excellent and successful feature selection and weighting performance [34]. The procedure could be done simply by weighting all initial features in the training stage based on the distance measures, and the top ones were

(9)

selected to complete the testing stage.

The KNN algorithm has performed the identification of the nearest neighbors. In most cases, KNN is used as a classifier technique. In this study [35], KNN is represented as a semi-supervised learning approach to determine the indicator’s performance in the health area.

This paper [26] also showed how mapping the data into the kernel space and separates them from the origin with maximum margin could be answered the weaknesses of the standard SVM on anomaly detection problems.This method is declared as a one-class support vector machine (OCSVM).

The application of these techniques is restricted by the availability of training data of anomalies. Several researchers use density-based methods such as Local Outlier Factor (LOF) and k-mean clustering [37] to handle the limitation of distance-based methods. Still, these techniques’ success depends on the similarities between clusters and anomalies’ characteristics.

According to their manual similarity, it groups observations into different clusters; for example, standard data may come from large and dense clusters, and anomalies may arise from small and sparse clusters.

From the preceding, the statistical-based methods are covered two models, namely: distance- based and density-based. They had two major obstacles facing time-series data: previous knowledge about anomaly duration; methods could not capture temporal correlations.

1.3.2 Prediction-based methods

It is essential to consider all methods to try to highlight the difference between standard and faulty behaviors. Prediction-based methods study a predictive model for the given time series data to predict future values. A data is pointed out as an anomaly if the difference between the predicted and original value exceeds a certain threshold. Several traditional prediction models employ the relationships between the time series and its lag features to predict future values such as Auto-Regressive (AR), Moving Average (MA), Autoregressive Moving Integrated Average (ARIMA), and Seasonal Autoregressive Moving Integrated Av- erage (SARIMA).

There are lots of paper with a focus on the above techniques. However, in most cases, the time series prediction methods were not applied to anomaly detection [4] [31].Still, there exist work that develops traditional time series models in order to have the ability to detect anomalies in the problem [38] [23].

These techniques have some significant limitations; for instance, this study [24] discussed trend and seasonal time series forecasting methods and its importance for making critical decisions. The research shows that the traditional forecasting model, such as ARIMA, has difficulty modeling the nonlinear relationships between variables. Moreover, a constant standard deviation in the ARIMA model’s errors is considered, which may not be satisfied with different problems [32].

A deep learning-based approached attempts to overcome these challenges. LSTM (Long Short-Term Memory) is a particular form of Recurrent Neural Network (RNN) that was initially proposed to solve the vanishing gradient in RNN’s by replacing their simple internal loop with a different formation that makes LSTMs be caple to track variable in a sequence and learn the long dependencies between them. In numerous researchers, the LSTM lonely

(10)

or in combined with different approaches, can effectively detect anomalies [16] [20] .

1.3.3 Reconstruction-based methods

Reconstruction-based models learn by encoding their input data to a lower-dimensional representation in the latent structure and decoding back to the original input. According to this research [9], "Reconstruction-based methods assume that anomalies lose information when they are mapped to a lower dimension space, there by cannot be effectively reconstructed;

thus, high reconstruction errors suggest high chances of being anomalies."

There are several dimensionality-reduction, such as Principal Component Analysis (PCA) and Auto-encoder. The Auto-encoder has more Attention between these two existing methods because it can better handle PCA’s limitation. The most visible limitations are that the method is restricted to linear reconstruction and requires positively correlated data, which follows Gaussian distribution.

Lately, the Auto-encoder method in anomaly detection is grown. For instance, in this paper [28], Variational Auto-encoder (VAE) with Attention could provide structured and ex- pensive representation to detect anomalous behavior in time series. Furthermore, in other papers [19], Auto-encoder, using Recurrent Neural Network (RNN) to generate multiple auto-encoders with different neural network connection structures. As a result, the framework shows outperforming on time series’ outlier detection problems. It is also qualified to discuss that the Encoder-Decoder using Long Short-term Memory (LSTM) showed excellent performance on multi-sensors anomaly detection [25].

1.4 Thesis Structure

Chapter 2 presents the theoretical background of this project. It covers the summary of time series and deep learning forecasting methods, followed by the theory of dimensionality reduction techniques focused on Autoencoders.

Chapter 3’s focus is on the methods we used in this project. It starts with dataset and data preparation. Next, the scheme of prediction and detection models will be defined.

Chapter 4 shows and discusses the results of each prediction and detection model based on the chosen case study.

Chapter 5 is reportage with an outline of the outputs and future work done in this project’s path.

(11)

2 Principles and Concepts

This chapter defines some principles and concepts that are related to knowledge in the con- text of this study.

The first section of this chapter demonstrates the time series and their properties and re- views the necessary tools for studying anomaly time series detection, which provides the organization with useful information for making significant decisions.The second section introduces the definition of Deep Learning (DL) and Long-short-term Memory (LSTM) algorithms. The third part discusses the Autoencoers algorithm and their applicability to the anomaly detection problem

2.1 Time Series and Anomaly Forecasting

A time series [30] is a collection of random observations S = {Xt,t ∈ T } made sequentially through time (T). In time-series data, we have only one realization and a finite number of variable records. If only one variable is modifying over time, The time series specified as a univariant time series (UTS); otherwise, The set S is defined as a multivariant time series (MTS). Figure 1 as follows, one sensor variable of the time-series dataset has been chosen in determined time duration.

Figure 1: active load sensor’s behavior through time

(12)

Figure 2: active load sensor’s seasonal decomposition

The time interval for collection data could be, for example, seconds, minutes, hours, days, weeks, months, years. Time-series data arise naturally in various disciplines, namely, finance, economics, environmental science, electrical engineering, and computer science [11].

Stationary time series is set to have a constant long-term mean and variance independent of time. Detection of stationary or non-stationary is done by differencing the data from a shifted version of itself after subtracting the data from its trend and seasonality. As a rule, non-stationary data is unpredictable and could not be modeled or forecasted. While the prediction time series, the data should be expected to be stationary.

Forecasting [40] in time series is simply described as a process to predict the changes that happen within the given data and to predict the moves that will happen in the future. The prediction methods could be used on the presented data whenever:

• Firstly, each variable’s records must have the time dimension and be arrayed in the temporal order.

• Secondly, the records values are a continuous one in a settled period under specific laws.

Their temporal feature, such as trend, seasonality, and residuals, gave us important and useful information in the prediction scheme, which could be obtained by decomposing 2.1 the series. The result of the time series decomposition is shown in figure 2.

(13)

Xt = mt+ st+Yt (2.1)

• mt, trend, a long-term non-periodic movement in the mean.

• s_t, seasonal variation, cyclic fluctuations, for example, due to calendar or daily variations.

• Y_t, residuals, random noise and all other unexplained variations.

2.1.1 Key Components associated with An Anomaly Detection Problem

To study time series anomaly detection and prediction, the first thing is knowing the anomaly’s determination and its type. It is essential to agree on the exceptions. The second is to un- derstand the different types of time-spaces during the prediction.

As discussed before, anomalies are patterns in data that do not fit a well-defined notion of normal behavior. Most of the present anomaly detection techniques solve a particular problem. Solving problem is influenced by numerous circumstances such as type of anomalies [2] and the prediction time-spaces that deal with numerical data.

2.1.1.1 Nature/Type of Anomaly Point Anomalies

Simply definition is a single record of the data has deviated mostly from the rest of the data points in the dataset. An common example is credit card fraud detection.

Contextual Anomalies is defined as the anomalies in the dataset whose detection hugely depends on contextual information. This type of anomaly is popular in time-series data.

Collective Anomalies

means a collection of related data instance concerning the entire dataset would be granted anomalies, not individual value.

2.1.1.2 Type of Time-spaces

Time series prediction on numerical data could be made on three principles of time-spaces including Short-Term Period, Mid-Term Period, and Long-Term Period[24].

Time in a short-term period forecasting is set as a time frame of fewer than three months.

Whereas the mid-term is focused on a time frame of three months to one year, the last one is considered more than a year. It is fair to state that these time frame categorization could be changed based on the problem’s circumstances. For example, In traffic time series prediction, it is possible to consider the short-term, mid-term, and long-term periods as seconds, minutes, and hours.

(14)

2.2 Time Series and Deep LSTM

The definition of deep learning is quietly varying. However, most researchers’ core is that deep learning is a sub-field of machine learning, and it can learn from high dimensional data in a supervised, unsupervised, and hybrid manner [22].

The "deep" word has easily pictured a network of layers stacked on top of each other. Each layer can be seen as a non-linear module that receives the previous layer’s output as its input to transform the input data into meaningful output automatically, which is one reason that makes them quite popular.

In recent years, deep learning has frequently become popular and has been applied in various anomaly detection algorithms, as illustrated in Figure 3. Deep anomaly detection (DAD) [1] techniques can automatically learn and extract features without developing manual features by domain experts. Unsupervised learning is expected to gain more attention because collecting labels in an imbalance dataset has many difficulties. The dataset is im- balanced if anomaly behavior happens rarely, and most of the records are normal.

Figure 3: Performance Comparison of traditional vs Deep Learning algorithms. Picture adopted from [1]

The first group of techniques deals with supervised classification. In these methods, records of the variable are labeled as an anomaly or normal based on previous historical data;

then, this dataset is used to create classification models that can predict the state(normal or anomaly) of unseen records. The second method deals with unsupervised methodolo- gies which are based on unlabeled states. This approach aims to detect outlier behavior in contrast with the legitimate behavioral. For this purpose, the model needs to extract the standard behavioral for each state and then identify anomaly activities.

In summary, unsupervised deep learning models are usually used for denoising, compres- sion, or finding correlation. One of these models is Long Short-term Memory (LSTM).

Long Short-term Memory networks (LSTMs) are well-suited to classifying, processing, and making predictions based on data acts like time-series behavior. LSTM was developed to deal with the exploding and vanishing gradient problems encountered when training traditional Recurrent Neural networks (RNNs) [27]. The LSTMs be competent in learning the dependencies between variables in a long period of time.

In general, as demonstrated in the figure 4, an LSTM [13] contains a hidden state ht, cell

(15)

state ct, and LSTM gates (input, output, and forget). The hidden state and cell state is also known as an external and internal state, respectively. The external state is the output of the network, and it showed the LSTM capacity, and the choice of the hidden cell size is on the user’s shoulders.

Figure 4: LSTM structure. Picture adopted from [5]

Cell state is one of the significant differences between LSTMs and RNNs network because the internal state could act as a memory cell for the LSTM, and it kept information from the past. However, it is not required to have them in the output gate of the LSTM network.

Gates of the LSTM [12] provide continuous analogs of write, read, and reset information.

It is essential to point out each final result of the gates goes through the sigmoid function to lie down the values between zero and one.

The forget gate is the first gate in the LSTM network. It is responsible for deciding how much information should be kept from the network. The closer sigmoid result to one, the more information from the past by the LSTM unit is stored. Similarly, the closer the sigmoid result to zero, the less information from the past by the LSTM unit is saved. This part result affects the previous state c_t−1.

The input gate is the second gate. It is responsible for choosing the amount of information added to the previous LSTM knowledge to perform better. This choice could be decided after applying the sigmoid function on the new input and past state.

The cell state is updated by multiplying the input gate’s result with eCt to provide a new vector added to the recurrent cell state. The output gate decided on LSTM output, and it affected the hidden state value as well.

Three different weights (Wxh, Whh, b.) are included in the LSTM gates. Weights are matrices that represent a linear transformation of the input. The calculation of the weights is done automatically based on the input and the desired output shape. The functions of the LSTM unit showed in detail in following Equations (2.2-2.6).

(16)

It is good to know for an LSTM layer with "h units", the number of the parameters would be 4 ∗ (hunits ∗ hunits + hunits ∗ num f eatures + hunits ∗ 1)

Forgetgate: ft = σ(Wxh_fxt+Whh_fh_t−1+ bf) (2.2)

Inputgate: it = σ(Wxh_ixt+Whh_ih_t−1+ bi) (2.3)

In f ormation: eCt= tanh(Wxh_cxt+Whh_ch_t−1+ bc) (2.4)

Cellstate: Ct = ft

Kct−1+ it

K

Cet (2.5)

Hiddenstate/out put : ht= ottanh(Ct) (2.6)

2.3 Dimentionality Reduction (Autoencoder)

There are different techniques to reduce input dimensionality. For instance, Principal Com- ponent Analysis (PCA) is used for dimensionality reduction as feature extraction, and Au- toencoder is applied for dimensionality reduction as feature selection.

PCA briefly used statistical techniques in order to give unlabeled high dimensional dataset, dimensional reduction. Moreover, Autoencoder benefits from applying dimensionality reduction and feature engineering. They are usually helpful for extracting useful features from the input data in an unsupervised way [22].

The Autoencoder is a unique design of neural networks that tries to learn an image of its input. The Autoencoder model is formed of two main models, which they are in charge of an operation called reconstruction. Reconstruction could be performed by the encoder model for encoding its input data to a lower or higher dimension in a hidden layer (aka latent space) and the decoder model to trying to decode back the original input in the desired space.

There are different types of Autoencoders available, like variational, sparse, and denoising autoencoders[21] [10].

Autoencoders have some knowledge beforehand of how their output should look; therefore, they are considered self-supervised models [39]. The figure 5 shows the general structure of Autoencoder. The structure of the Autoencoder has usually been symmetric. It means the encoder layer’s size is the same as the decoder layer’s size but in reverse order.

(17)

Figure 5: Autoencoder Structure: f , and g are represented the encoded and decoded functions

The following equations 2.7, 2.8 showed the general functions for a basic Autoencoder with one layer. In these equations function f (x) and function g(x) represent the encoder and decoder model, respectively. In the encoder stage, σ1and σ2are activation functions, W⁽¹⁾ and W⁽²⁾are weight matrices, and b⁽¹⁾and b⁽²⁾are bias vectors.The entire reconstruction of the input x is determined by go f (x).

h= f (x) = σ₁(W⁽¹⁾x+ b⁽¹⁾) (2.7)

xe= g(h) = σ2(W⁽²⁾x+ b⁽²⁾) (2.8) It is essential to consider having better perform the input reconstruction; the system needs to minimize the error based on the Loss function defined in the equation 2.9.

L(x,x) = kx −e exk²=

x− σ₂(W⁽²⁾(σ₁(W⁽¹⁾x+ b¹)) + b²)

2

(2.9) In this paper[3], the autoencoders based on the number of layers were categorized. There are two main types: Shallow and Deep. Shallow is also known as the autoencoders’ main structure and contains three layers: input, encoding(one hidden layer), and output. Despite this, a Deep autoencoder has more than one hidden layer. In figure 6 , there are four types of autoencoder which is made of these two combinations.

(18)

Figure 6: Different types of autoencoder structure. Picture adopted from[3]

(19)

3 Methodology

3.1 Dataset

For this project, we got data for one of the specific Gas turbines in Siemens Industrial Turbomachinery (SIT), representing the 69 sensors’ record in 2013 between January and December. This data is a daily log with a minute interval of each sensor KPIs for one year.

In figures (7-9), three arbitrary examples out of the 69-time series are shown. Each time series has a different behavior through time from the others depending on the gas turbine’s sensor location. We consider the unexpected changes (increases and drops) in time series patterns as anomaly behavior, which we intend to predict.

Figure 7: inlet pressure sensor after Standardization: The real value of the sensor could not be shown because of the Siemens company restriction

(20)

Figure 8: air temperature sensor after Standardization: The real value of the sensor could not be shown because of the Siemens company restriction

Figure 9: outlet pressure sensor after Standardization: The real value of the sensor could not be shown because of the Siemens company restriction

Due to the limitation of providing sensors with their realistic numbers, it is essential to mention important behavior of the dataset that effect on the preprocessing steps. we could say, each of these time series are on different scales. Furthermore, the original dataset pictures that two types of quality for the sensors’ value exist: ( good-quality and bad-quality). After removing the dataset from the bad-quality values, it may cause missing value. Additionaly, There are sensors which did not have any behavior of the signals and their value only lies down on two numbers( zero or one), and that make the data noisy.

Therefore, data preprocessing helps to make the raw data ready to be fed to the neural network. Data preprocessing is the way to handle the missing values, normalization, and vectorization.

There are standard preprocessing related to time series, for instance, Power transformation,

(21)

Difference Transformation, Standardization, and Normalization. Power transformation is used to transform data into normal (Gaussian) distribution. Difference Transform is removing trend and seasonality structure from the time series. Standardization is transforming data to zero mean and standard deviation one, as shown in the equation 3.1. Normalization is a scale data transformation between zero to one or minus one to plus one, as noted in the equation 3.2. It is also called the Min-Max scaler.

Zx= xi− ¯x

σ (3.1)

MinMaxx= xi− xmin

x_max− x_min (3.2)

The goal is to have the mid-term prediction of each 69 sensors based on the historical data and predict if there will be an abnormal behavior on each sensor’s performance. For this aim, the flowchart in the figure 10 shows the steps we demanded in the preprocessing stage.

Figure 10: data preprocess steps

The first step is preparation of the time interval. This step supports the user to determine the time interval duration. For instance, the user could adjust the one-minute interval of the raw data to a five-minute interval. The implementation code for this part is shown by block 3.1.

1 d e f d a t a _ p r e p a r a t i o n ( d f , d t _ c o l _ n a m e , v a l _ c o l _ n a m e , i n t e r v a l =’ 5T ’) :

2 d f [ d t _ c o l _ n a m e ] = pd . t o _ d a t e t i m e ( d f [ d t _ c o l _ n a m e ] )

3 d f = d f . g r o u p b y ( pd . G r o u p e r ( key = d t _ c o l _ n a m e , f r e q = i n t e r v a l ) ) [ v a l _ c o l _ n a m e ] . mean ( )

4 d f = pd . D a t a F r a m e ( d f )

5 d f [ d t _ c o l _ n a m e ] = d f . i n d e x

6 d f = d f . r e s e t _ i n d e x ( d r o p = T r u e )

7

8 r e t u r n d f

Block 3.1: time interval preparation code block

In the whole duration of 2013, some records can be missing because of the operation performed in the preprocessing phase and mechanical/electrical failures during the data recov- ery process. These missing values are considered unknown data (aka incompleted feature vector). Different types of approaches exist to deal with missing values. One of the approaches is imputation or estimation of missing values. Imputation could be implemented

(22)

based on statistical methods such as Mean imputation, Regression Imputation, and Multi- ple Imputation. [7]. In this project, we use Mean imputation to solve the missing sensors’

value, as it shown in block 3.1.

This research deals with the time series prediction problem. It is clear to state prediction on a non-time series dataset is more accessible than a time series dataset. The reason behind that is scoring on new records can be performed independently of the other records. How- ever, in the non-time series data, scoring new records depends on recent records’ look-back window. Hence, The following steps are considered to be Normalizing data and generating the timesteps called the " Multi-feature Window Method."

At first, we used standard scaling to scale the time series. Next, in the multi-feature window method, we chose a window size of 12 as a number of timesteps, and the prediction is for when timestep is equal to zero. In the case of the multi-feature prediction, each time series is considered one feature itself. Since there are 69 selected time series, it gives us 69 features.

We put the first timestep of all 69 time-series at first positions, then the second timestep of all time-series go after them, and so on, as illustrated in the Figure 11.

Figure 11: Multi feature approach: The train data starts with timestep 12 of all time series,then time step 11 of all 69 time series till time step 1 of all 69 time series.

The Target are the future time step of all time series. Each time series (ts) is the data related to one sensor located on Gas turbine.

1 x _ c o l s = l i s t( d _ t o t _ c o p y . c o l u m n s [ 1 : ] )

2 t s _ c o l = d _ t o t _ c o p y . c o l u m n s [ 0 ]

3 # s t a n d a r d s c s l a r

4 s _ d a t a = d _ t o t _ c o p y [ [ t s _ c o l ] + x _ c o l s ]

5 s c a l e r = p r e p r o c e s s i n g . S t a n d a r d S c a l e r ( )

6 s c a l e r _ d a t a = s c a l e r . f i t _ t r a n s f o r m ( s _ d a t a [ x _ c o l s ] . v a l u e s ) . t o l i s t ( )

7 s c a l e r _ d a t a = pd . D a t a F r a m e ( s c a l e r _ d a t a , c o l u m n s = x _ c o l s )

8 s _ d a t a = pd . D a t a F r a m e ( pd . c o n c a t ( [ s c a l e r _ d a t a , s _ d a t a [ t s _ c o l ] ] , a x i s = 1 ) , c o l u m n s = s _ d a t a . c o l u m n s )

9 s _ d a t a . h e a d ( )

Block 3.2: standardization code block

(23)

The last step in the data preprocesses flowchart is splitting the data into train, Valid, and Test sets. It is vital to consider that random split is not the right choice for a time series dataset.

The main reason for this decision is choosing random rows from the dataset caused to lose valuable information. This result goes back to the continuous and time-ordered behavior of time-series datasets. Figures (12-14) show a normalized sample of final data and split into three training, validation, and test datasets.

Figure 12: standardization data examples. sensor-9 time series is splitted to train data (blue), validation data (red), and test data (green)

(24)

3.2 Model Design

This chapter presents a framework of the research methods followed in this study. It provides the prediction and detection model structure in detail, which helped solve the anomaly detection problem associated with Siemens Industrial Turbomachinery (SIT), as described in Figure 15.

Figure 15: proposed model in the study

3.3 Prediction Model

The prediction model consists of two parts: a reconstruction autoencoder, and a Deep Long Short-term Memory model.

3.3.1 Reconstruction Autoencoder

This part generates a representation of our input time series. For this purpose, an Autoen- coder with a reconstruction goal was implemented. It gets a 2D input data with shape (None,

(25)

828) and outputs a reconstruction of its input. The autoencoder model is composed of two models: encoder and decoder. Each model is built by a Multilayer Perceptron Network.

The encoder model has two hidden dense layers and one latent dense layer. The decoder model has one latent dense layer and two hidden dense layers, and an output layer.

Figure 16: encoder model structure: consist of the input layer, two hidden dense layer with 512 and 256 hidden units, and latent layer with 120 units

Figure 16 shows the structure for the encoder model built by a multilayer Perceptron network. Each neuron in one layer is connected to all neurons of the next layer. The input for 69 features (number of whole time-series available), and 12 timesteps will be in the shape of (None, 12 *69) = (None, 828). The first hidden layer has 512 neurons, and its output has the shape of (None, 512). The second hidden layer has 256 neurons; hence, this layer’s output is in shape (None, 256). The last layer in the encoder model is the latent layer, which has 120 neurons since our goal is to extract essential features by reducing the total number of the input layer. Therefore, the output for this layer is (None, 120).

Equation 3.3 gives the computation of the whole encoder structure. o³_n indicated the third dense layer (aka latent layer), which has 64 neurons. In this equation, xiis the input to the model, w^l_uvdenotes the connection between v:th neuron in layer 1-1 to the u:th neuron in layer 1. The bias of the u:th neuron in layer 1, represent by b^l_u. σ₁ and σ₂ are activation function in the first dense layer and second layer, respectively.

(26)

O³_n=64=

"

∑

m

w³_nm

"

σ₂(

∑

j

w²_{k j}

"

σ₁(

∑

i

w¹_jix_i+ b¹_j)

# + b²_k)

# + b³_n

#

(3.3)

A multilayer Perceptron network built the decoder model the same as the encoder. Each neuron in one layer is connected to all neurons of the next layer. The decoder’s input is the result of the encoder stage in the latent layer, which has the shape of (None, 120). The first hidden layer has 256 neurons, and the second hidden layer has 512 neurons; hence, the layer’s output is in shape (None, 256), and (None, 512), respectively. The last layer in the decoder model has 828 neurons since our goal is to reconstruct the input layer. Therefore, the output for this layer is (None, 828). The structure is the same as figure 16 but in the opposite flow.

We applied the Mean squared error (MSE) as a Loss function for this model. Equation 3.4 explains how to measure this Loss where y is the reconstructed value, y⁰ is the input value, and N is the number of features (aka observations). The Loss function presents the error in the reconstruction value compared to the expected result. Another vital step is updating the weight in order to improve the reconstruction result of the network. The backpropagation expression assists us to settle this step by minimizing the error that the Loss function is giving us. The process is done by calculating the gradient. First, we considered different network parameters that affect the loss function, such as weight elements and bias matrices as θ. Then, gradient descent of the Loss function concerning these parameters will be

∂L(θ)

∂θ . Equation 3.5 represents how to update theta parameters for each layer using the gradient descent method. In the equation, γ is a learning rate that is a parameter that updates parameters.

L=

N i=1

∑

1

N(yi− y⁰_i)² (3.4)

θ = θ − γ∂L(θ)

∂θ (3.5)

Meanwhile, the network is extensive, and the size of the training data is vast; therefore, the Optimizer algorithm in the model benefits from speeding up learning and minimizing the Loss function. Stochastic Gradient Descent(SGD) is alternative gradient descent. Instead of updating the Loss calculation parameters on the whole dataset, the SGD algorithm divides the dataset into batches and updates parameters for each batch Loss calculation. There are other optimizer algorithms, such as RMSprop, AdaGrad, and Adam. We called them hyperparameter, and tuning them helps to get a better result as well. In this study, we mainly used Adam.

3.3.1.1 Reduction Using AE

Autoencoder is a way to transform the representation of the input. There are two kinds of design for the autoencoder: Sparse or compressed. Sparse autoencoder could be achieved

(27)

by keeping the number of the hidden layer nodes greater than the number of original input nodes. On the other hand, the compressed autoencoder could be obtained by selecting the number of hidden layer nodes less than the original input nodes. This study focused on the compressed representation of the input, which achieves the desired dimensionality reduction effect.

In this part, we are looking for a non-linear projection method that maps the data from high feature space to lower feature space. Because sample data in high-dimensional space generally cannot diffuse in the whole space, they lie in a low-dimensional manifold embedded in high-dimensional space.

The dimensional reduction process is done by designing the non-linear autoencoder reconstruction at the first stage as it shown in block code 3.3.

1 d e f b u i l d _ a e ( i n p u t _ d i m , l a t e n t _ d i m s , l r , d r o p o u t _ r a t e ) :

2 # i n p u t s

3 i n p u t s = t f . k e r a s . l a y e r s . I n p u t ( s h a p e = [ i n p u t _ d i m ] , name=’ i n p u t s ’)

4 x = i n p u t s

5

6 h i d d e n _ d i m s = l a t e n t _ d i m s [ : − 1 ]

7 l a t e n t _ d i m = l a t e n t _ d i m s [ −1]

8

9 f o r h i d d e n _ d i m i n h i d d e n _ d i m s :

10 x = t f . k e r a s . l a y e r s . Dense ( h i d d e n _ d i m , a c t i v a t i o n =’ l i n e a r ’) ( x )

11 x = t f . k e r a s . l a y e r s . D r o p o u t ( r a t e = d r o p o u t _ r a t e ) ( x )

12

13 x = t f . k e r a s . l a y e r s . Dense ( l a t e n t _ d i m , a c t i v a t i o n =’ l i n e a r ’, name=’ l a t e n t _ l a y e r ’) ( x )

14

15 f o r h i d d e n _ d i m i n h i d d e n _ d i m s [ : : − 1 ] :

16 x = t f . k e r a s . l a y e r s . Dense ( h i d d e n _ d i m , a c t i v a t i o n =’ l i n e a r ’) ( x )

18

19 o u t p u t s = t f . k e r a s . l a y e r s . Dense ( i n p u t _ d i m , a c t i v a t i o n =’ s i g m o i d ’) ( x )

20

21 model = t f . k e r a s . Model ( i n p u t s = i n p u t s , o u t p u t s = o u t p u t s )

22 o p t = t f . k e r a s . o p t i m i z e r s . Adam ( l r = l r )

23 model .c o m p i l e( o p t i m i z e r = o p t , l o s s =’ mse ’)

24

25 r e t u r n model

Block 3.3: autoencoder build model code block

The critical point in designing the autoencoder model is considering the hidden layer nodes smaller than the original input layer node. The final step is selecting the latent layer, which contained compressed information of the input layer. The scheme of the reduction process is shown in the code block 3.4 and figure 17.

(28)

Figure 17: Summary of the reduction model

1 d e f d r _ m o d e l ( ae , l a y e r _ n a m e =’ l a t e n t _ l a y e r ’) :

2 i n p u t s = a e .i n p u t

3 o u t p u t s = a e . g e t _ l a y e r ( l a y e r _ n a m e ) . o u t p u t

4

6 model .c o m p i l e( o p t i m i z e r =’ adam ’, l o s s =’ mse ’)

7

8 r e t u r n model

Block 3.4: reduction build model code block

3.3.2 Deep LSTM

As the second part of the model, we applied a Deep LSTM model with three LSTM layers.

Nevertheless, it is fair to discuss the model structure in more detail.

One sample is a sequence of inputs that has overlap with the next sequence. one feature is one observation at a time step. The timestep is the number of times that LSTM should be unfolded (aka neuron). Accordingly, LSTMs input must be a 3-dimensional tensor representing time sequence order, and it has the shape of (n-samples, timesteps, n-features), as shown in the figure 18.

Figure 18: A 3D time series data tensor. Picture adopted from [6]

Figure 19 shows that this model’s input is in the shape of (None, 12, 10), where 12 is the

(29)

number of time steps, and 10 is the number of features. Therefore, the LSTM has been unfolded 12 times.

Figure 19: An LSTM layer with input shape of (None, 12, 10)

Units in the LSTM model are the number of hidden units, and it defined the dimension of the output. The units will be considered the LSTM capacity; therefore, the bigger the number of units is, the more learning capacity the LSTM has. It is relevant to mention;

this is one of the parameters required to be tuned to counter overfitting during the training phase. This model’s hidden layers were chosen as 200, 100, 200 units for the first, second, and third layers, respectively. The output layer is a dense layer with 69 hidden units to predict every 69 targets(sensors) value for the future 5 minutes based on information from the last 60 minutes.

Depending on the LSTM model desired to built, LSTM could have different output approaches. It should consider the hidden states as the outputs of an LSTM layer. In each LSTM layer, we have an option called return-sequence. Return-sequence is set to False by default, and it means only the last LSTM hidden state or last time step of the current sequence will be considered output. Otherwise, by setting the return-sequence to the True, the output of the LSTM will be all hidden states from all time steps in the sequence (not only the last one). In this model, we fixed the return sequence to True for the First and second hidden layer with 200, and 100 units, and the last hidden layer with 200 units, we set the return-sequence as False. In summary, if the return-state option of the LSTM layer is set to be True, then ct will be returned as output beside ht. The outputs of each three layers are shown in the Table 1.

Table 1 Output of each three hidden layers

Input-shape hidden-layer 1 and 2 outputs hidden-layer 3 output (None, 12, 10) (None, 12, 200), (None, 12, 100) (None, 200)

1 d e f l s t m _ m o d e l ( n _ t i m e s t a m p s , n _ f e a t u r e s , n _ o u t p u t s , n _ u n i t s =None , d r o p o u t _ r a t e = 0 . 2 , l r =2 e −4) :

2

3 n _ u n i t s = n _ u n i t s i f i s i n s t a n c e( n _ u n i t s , l i s t) e l s e [ 1 0 0 , 1 0 0 ]

4

5 # c r e a t e t h e i n p u t : I w a n t t o c o n s i d e r t h e e n t r y a s ( none , 64−number o f f e a t u r e s , 1 )

6 i n p u t s = t f . k e r a s . l a y e r s . I n p u t ( s h a p e = [ n _ t i m e s t a m p s , n _ f e a t u r e s ] , name=’ i n p u t s ’)

7

8 x = i n p u t s

(30)

9

10 f o r u n i t s i n n _ u n i t s [ : − 1 ] :

11 x = t f . k e r a s . l a y e r s . LSTM( u n i t s , r e t u r n _ s e q u e n c e s = T r u e ) ( x )

13

14 x = t f . k e r a s . l a y e r s . LSTM( n _ u n i t s [ − 1 ] , r e t u r n _ s e q u e n c e s = F a l s e ) ( x )

16

17 o u t p u t s = t f . k e r a s . l a y e r s . Dense ( n _ o u t p u t s , a c t i v a t i o n =’ l i n e a r ’) ( x )

18

20

21 o p t = t f . k e r a s . o p t i m i z e r s . Adam ( l e a r n i n g _ r a t e = l r )

22 model .c o m p i l e( o p t i m i z e r = o p t , l o s s =’ mse ’)

23

24 r e t u r n model

Block 3.5: Deep LSTM build model preparation code block

Statefulness in LSTM is related to using batch in the training process. Using batch-size means how many samples the network should view before updating the weights [33]. In this model, batch-size is granted 64.

To prevent overfitting during the training process, besides the other methods and hypertun- ning, one can use dropout regularization. Dropout will randomly set the output of some hidden units of a layer to zero during training—the dropout-rate in this study chosen 0.1.

The easy explanation of the math behind dropout will be passing a long sequence divided into smaller pieces or batches. The cell state of the last timestep of the ith sample from the current batch will be passed to the ith sample of the next batch to initialize its value.

The math behind an LSTM was described in the equations (2.2 - 2.6) section 2.2, and we used MSE as a Loss function. Also, we tried Adam as optimizer and linear chosen as the activation function for the LSTM layer. The implementation code of deep LSTM and its summary is shown in the code block 3.5.

The prediction model is a combination of Autoencoder reduction and Deep Long Short-term Memory models. First, a Three Multilayer Perceptrons Autoencoder used for automatic feature selection and representation learning (encoded features), as explained before. The purpose of using Autoencoder is to learn the behavior across multiple time series with a variety of patterns in order to capture the correlation among them and obtain useful features with a fixed dimension. Then by driving out the reduction dimension of input from the Autoencoder model’s latent layer, we could obtain the reduction form of the entry features.

It is essential to mention that if there is abnormal behavior in the input, it will be captured by the encoder. The next step is to feed these embedded features to the model’s prediction. The model prediction is a Deep LSTM with three layers which its input is created by autoencoder reduction in a new representation form. New representation demands to be in 3d shape, as described in section. To fulfill this need, we used the expand dimension technique. Figure 20 shows the combination of two models in 4.1.1 and 4.1.2.

(31)

Figure 20: A prediction model based on feature extracted by an autoencoder model, Picture adopted from [41]

3.4 Detection Model

From the start, we have no prior knowledge about normal and abnormal data. Despite this, our desired goal is mainly focused on predicting and detecting anomaly behavior. Section 3.3 thoroughly reported the prediction model and how Autoencoder reduction with deep LSTM could foretell five minutes ahead of each 69 signals based on one-hour history.

Figure 21 displays the principal detection model actions. The flowchart in the Figure shows the steps we took to detect anomaly behavior. The first step in detecting anomalies is that we applied the prediction model’s output in the detection model. The next step is measuring the aggregate error. The prediction error could be contained by calculating the following formula, as shown in equation 3.6. The aggregate error is computed by taking the mean of every 69 sensors’ prediction error. The aggregate error represents an error when several errors need to be wrapped in a single error. The next move in the flowchart is Finding Candidate anomaly dates and involved sensors, which could be reached with following critical steps.

(32)

Figure 21: scheme of the Detection Model Steps

Error=

(yi− y⁰_i)

⇒ Aggregate_Error = ∑^N_i=1 q

(yi− y⁰_i)

N (3.6)

3.4.1 Anomaly Scoring and Selection of Candidate Set

To accomplish this step, we must distinguish the observations whose anomaly scores are significantly deviating from others. The score technique must be applied on aggregate errors in the equation 3.6. The critical problem is finding the best cut-off threshold when the boundaries between normal and anomaly behavior are not noticeable to minimize the false positive rate while maximizing the detection rate. According to two assumptions from the paper [29] about the anomaly detection of unlabelled data, The anomalies are assumed to have a small portion of the data, which is assumed not to exceed five percent. Since we are interested in finding the fraction of anomalies with high confidence more than find all anomalies. However, the dataset might not approximately represent a large portion of the standard data. To solve this problem, we consider suitable portion of data as normal, for example, 80%. To score aggregate errors, we applied the quantile method to fix the confidence area of anomalies. This step’s implementation could be found in the block 3.6.

1 # c a l c u l a t e t h e e r r o r

2 e r r o r = np . s q u a r e ( yy − y y _ p r e d ) . T

3

4 # c a l c u l a t e t h e a g g r e g a t e e r r o r

5 a g g _ e r r o r = np . mean ( e r r o r , a x i s = 0 )

6

7 # c a n d i d a t e a g g r e g a t e e r r o r

8 a c _ a g g _ e r r o r = np . w h e r e ( a g g _ e r r o r > np . q u a n t i l e ( a g g _ e r r o r , 0 . 9 9 ) , a g g _ e r r o r , 0 )

Block 3.6: Threshold on detection model code block yy: represent the y-test and yy-pred:

represent the prediction result of Deep LSTM on test data

A quantile determines how many values in a distribution are above or below a specific limit.

The Figure 22 showed 1% of the dataset were considered anomalies and 99% as standard.

Next, we are required to select candidate sets of the dates that anomaly events occurred.

Hence, the first step is obtaining all dates in the test dataset in the anomaly’s confidence

(33)

area, which is 1% of the whole dataset, as illustrated in figure 23, part one. Next, based on the dates suspected to have an anomaly state, we need to determine the involved sensors.

The process could be arranged by investigating all 69 sensors in the anomaly’s confidence area with the quantile technique, as shown in figure 23, part two. In other words, we need to check if the sensors lie down in 1% of the dataset or not.

Figure 22: Confidence Area of anomaly and Normality area

Figure 23: Quantile Technique to find candidate set of anomaly case

(34)

4 Experimental Study and Results Analysis

In this chapter, the presentation and discussion of the results found by this study are given.

The chapter is arranged into two sections. The first one shows the prediction model results by tuning hyperparameter in order to find the best combination. The second section discusses the Detection model and its limitation.

4.1 Prediction Model

The first model concentrated on the Prediction, which is a combination of autoencoder reduction and deep LSTM. Each model’s results were described individually.

4.1.1 Reconstruction Autoencoder

For the Reconstruction Autoencoder model with multi-feature experiments, a combination of the following values was tried as listed in Table 2.

Table 3 shows the settings applied to the Autoencoder model experiment. The model is trained on the training data and validated on the validation dataset.

Table 4 and figure 24 display the tuning results for the top five combination for settled parameters in Table 3. Table 4 shows, the reconstruction autoencoder’s best result with multi-features belongs to the model using Adam optimizer, 512 and 256 as number of first and second hidden layers, 0.1 as dropout rate. Figure 25 shows the loss functions during training the Autoencoder model.

Table 2 Combination of different hyper-parameter values used for training the Autoencoder model

First hidden layer 512, 256 Second hidden layer 256, 128 Dropout Rate 0.1 , 0.2

(35)

Table 3 Parameter settings for the autoencoder model

Parameter Value

Batch-size 64

Epochs 100

Train data shape (89586, 828) Validation data shape (7905, 828)

Test data shape (7905, 828)

Latent layer 120

Optimizer Adam

Activation linear

Learning Rate 2e-4

Table 4 Error on test data set during hyperparameter tuning process for autoencoder model.

U1: represents the number of first hidden layer units and U2: shows the number of second hidden layer units

Optimizer Best Combination Mean-squared-error (MSE)

Adam dr=0.1, U1=512, U2=256 0.04205

Adam dr=0.1, U1=256, U2=256 0.04787

Adam dr=0.1, U1=512, U2=128 0.05704

Adam dr=0.1, U1=256, U2=128 0.05902

Adam dr=0.2, U1=512, U2=256 0.07366

Figure 24: Visual of Error on test data set during hyperparameter tuning process for autoencoder model. U1: represents the number of first hidden layer units and U2:

shows the number of second hidden layer units, each connection between parameters shown by a color and end with the value of mean-squred error. Cold color shows the least MSE and Warm color shows the higgest MSE

(36)

Figure 25: Train and Validation loss function plots for the first best result of table 4 for multi-feature AE model

In chapter3, we selected three-time series as samples to follow up and illustrated them in the figures (12- 14). Figure 26 explains the reconstruction results of the autoencoder model for those time series. The data in the figure 26 corresponds to the green part (test data) of the figure (12- 14). The error rate for those reconstructions can be found in the Table 5.

Table 5 The reconstruction results on train/valid/test data set.

MSE RMSE

Train Error 0.01694 0.13015 Valid Error 0.00471 0.06862 Test Error 0.01642 0.12814

(37)

Figure 26: Reconstruction results of Autoencoder multi-feature model for three time series samples of data. x-axis shows date in the test data and y-axis shows values of reconstruction and expected.

(38)

4.1.2 Deep LSTM

The second model in prediction was Deep LSTM from section methodology. The model was trained using a combination of the following parameters during tuning process, as illustrated in Table 6.

Table 6 Combination of different hyper-parameter values and hidden network sizes that were used to train the Deep LSTM model

Dropout 0.1, 0.2

First LSTM layer 100, 200 Second LSTM layer 100, 200 Third LSTM layer 100, 200

Table 7 and figure 27 display the tuning results for the top five best combination with fixed parameter: 2e − 4 as the learning rate. As this table shows, the Deep LSTM’s best result belongs to the model using Adam optimizer, 200, 100 as number of first, second, and 200 as third hidden layers. The best dropout rate and activation function are 0.1, and linear.

Figure 27: Visual of Error on test data set during hyperparameter tuning process for Deep Lstm model. lstm-U1, lstm-U2, lstm-U3: represents the number of first, second, and third hidden layer units, each connection between parameters shown by a color and end with the value of mean-squred error. Cold color shows the least MSE and Warm color shows the higgest MSE

(39)

Table 7 Error on test data set during hyperparameter tuning process for Deep LSTM model.

lstm-U1, lstm-U2, lstm-U3: represents the number of first, second, and third hidden layer units

Optimizer Activation Best Combination (MSE)

Adam linear dr=0.1, lstm-U1=200, lstm-U2=100, lstm-U3=200 0.13568 Adam linear dr=0.1, lstm-U1=200, lstm-U2=200, lstm-U3=200 0.14199 Adam linear dr=0.1, lstm-U1=200, lstm-U2=100, lstm-U3=100 0.14234 Adam linear dr=0.2, lstm-U1=200, lstm-U2=200, lstm-U3=200 0.14546 Adam linear dr=0.1, lstm-U1=100, lstm-U2=100, lstm-U3=200 0.14845

The final chosen parameter for the Deep LSTM model represented in the Table 8. Figures represents the prediction results of the Deep LSTM model for three example signals of specific gas turbine in figures (12- 14) of chapter three. The data in this figure correspond to the green part (test data) of the Figures. The error rate for those predictions can be found in the Table 9. Figure 29 shows the loss functions during training the Deep LSTM model.

Table 8 Parameter settings for the Deep LSTM model.lstm-U1, lstm-U2, lstm-U3: represents the number of first, second, and third hidden layer units

Parameter Value

Batch-size 64

Epochs 100

Train data shape (89586, 12,10)

Validation data shape (7905, 12,10)

Test data shape (7905, 12,10)

Best Combination dr=0.1, lstm-U1=200, lstm-U2=100, lstm-U3=200 lr=2e-4

Optimizer Adam

Activation linear

Table 9 The error rate of prediction result on train/ valid/ test data set.

MSE RMSE

Train Error 0.01754 0.13209 Valid Error 0.00957 0.09782 Test Error 0.14848 0.38533

(40)

Figure 28: Prediction results of Deep LSTM multi-feature model for three time series samples of data. x-axis shows date in the test data and y-axis shows values of predicted and expected.