Evaluation of Neural Networks for Predictive Maintenance : A Volvo Penta Study

(1)

Linköpings universitet SE–581 83 Linköping

Linköping

University | Department of Computer and Information Science

Master’s

thesis, 30 ECTS | Datateknik

2021

| LIU-IDA/LITH-EX-A--21/024--SE

Evaluation of Neural Networks

for Predictive Maintenance

–

A Volvo Penta Study

Utvärdering av Neuronnät för Prediktivt Underhåll

Andreas Nordberg

Supervisor : Rouhollah Mahfouzi Examiner : Petru Eles

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

As part of Volvo Penta’s initiative to further the development of predictive mainte-nance in their field test environments, this thesis compares neural networks in an effort to predict the occurrence of three common diagnostics trouble codes using field test data. To quantify the neural networks’ performances for comparison a number of evaluation metrics were used. By training a multitude of differently configured feedforward neural networks with the processed field test data and evaluating the resulting models, it was found that the resulting models perform better than that of a baseline classifier. As such it is possible to use Volvo Penta’s field test data along with neural networks to achieve pre-dictive maintenance. It was also found that Long Short-Term Memory (LSTM) networks with methodically selected hyperparameters were able to predict the diagnostic trouble codes with the greatest performance among all the tested neural networks.

(4)

Acknowledgments

This thesis would not have been possible without the fantastic support of family, friends, thesis supervisors and people at Volvo Penta. A special thanks goes to Viktor Palmqvist Berntsson, Elias Sonnsjö Lönegren, and Simon Anwell, all of whom were absolutely critical for the realization and success of this thesis.

(5)

List of Figures

2.1 A simple perceptron . . . 4

2.2 A multilayer perceptron with one hidden layer . . . 5

2.3 A backpropagation illustration. . . 8

2.4 An unfolded RNN network . . . 10

2.5 A confusion matrix . . . 13

2.6 Example ROC curves for four different models. . . 14

2.7 Example PR curves for four different models. . . 14

3.1 Exhaust Temperature [°C] . . . 19

3.2 Temperature after DPF [°C] . . . 19

3.3 Exhaust mass flow [kg/s] . . . 19

3.4 Pressure difference before and after DPF [kPa] . . . 19

3.5 Example signal measurements for the selected Soot DTC signals. . . 19

3.6 An illustration of the sample windowing. . . 21

4.1 FFN Crystallization Network #1 Confusion Matrix . . . 26

4.2 FFN Crystallization Network #1 Loss-Epoch . . . 26

4.3 FFN Crystallization Network #1 Accuracy-Epoch . . . 26

4.4 FFN Soot Network #4 Confusion Matrix . . . 27

4.5 FFN Soot Network #4 Loss-Epoch . . . 27

4.6 FFN Soot Network #4 Accuracy-Epoch . . . 27

4.7 FFN AFC Network #5 Confusion Matrix . . . 28

4.8 FFN AFC Network #5 Loss-Epoch . . . 29

4.9 FFN AFC Network #5 Accuracy-Epoch . . . 29

4.10 FFN Crystallization ROC AUC plot . . . 29

4.11 FFN Crystallization Precision-Recall AUC plot . . . 29

4.12 FFN Soot ROC AUC plot . . . 30

4.13 FFN Soot Precision-Recall AUC plot . . . 30

4.14 FFN AFC ROC AUC plot . . . 30

4.15 FFN AFC Precision-Recall AUC plot . . . 30

4.16 LSTM Crystallization Network #10 Confusion Matrix . . . 31

4.17 LSTM Crystallization Network #10 Loss-Epoch . . . 31

4.18 LSTM Crystallization Network #10 Accuracy-Epoch . . . 31

4.19 LSTM Soot Network #10 Confusion Matrix . . . 32

4.20 LSTM Soot Network #10 Loss-Epoch . . . 32

4.21 LSTM Soot Network #10 Accuracy-Epoch . . . 32

4.22 LSTM AFC Network #10 Confusion Matrix . . . 33

4.23 FNN AFC Network #5 Confusion Matrix . . . 33

4.24 LSTM AFC Network #10 Loss-Epoch . . . 33

4.25 LSTM AFC Network #10 Accuracy-Epoch . . . 33

4.26 LSTM AFC Network #5 Loss-Epoch . . . 34

4.27 LSTM AFC Network #5 Accuracy-Epoch . . . 34

(8)

4.29 LSTM Crystallization Precision-Recall AUC plot . . . 34

4.30 LSTM Soot ROC AUC plot . . . 35

4.31 LSTM Soot Precision-Recall AUC plot . . . 35

4.32 LSTM AFC ROC AUC plot . . . 35

4.33 LSTM AFC Precision-Recall AUC plot . . . 35

4.34 FFN #1 Crystallization Test Dataset Prediction Distribution . . . 36

4.35 FFN #4 Soot Test Dataset Prediction Distribution . . . 36

4.36 FFN #5 Air Filter Clogged Test Dataset Prediction Distribution . . . 36

4.37 LSTM Network #10 Crystallization Test Dataset Prediction Distribution . . . 37

4.38 LSTM Network #10 Soot Test Dataset Prediction Distribution . . . 37

4.39 LSTM Network #10 Air Filter Clogged Test Dataset Prediction Distribution . . . . 37

4.40 LSTM Crystallization with Overlap Confusion Matrix . . . 38

4.41 LSTM Soot with Overlap Confusion Matrix . . . 38

4.42 LSTM AFC with Overlap Confusion Matrix . . . 39

4.43 FFN Soot Impartial Test Dataset Confusion Matrix . . . 40

(9)

List of Tables

3.1 Selected signals for predicting DTCs . . . 18

3.2 Feedforward Network Random Search Hyperparameter Ranges . . . 22

3.3 LSTM Grid Search Hyperparameter Sets . . . 23

4.1 FFN Crystallization Hyperparameter Random Search Results . . . 26

4.2 FFN Soot Hyperparameter Random Search Results . . . 27

4.3 FFN AFC Hyperparameter Random Search Results . . . 28

4.4 LSTM Crystallization Hyperparameter Grid Search Results . . . 31

4.5 LSTM Soot Hyperparameter Grid Search Results . . . 32

4.6 LSTM Air Filter Clogged Hyperparameter Grid Search Results . . . 33

4.7 LSTM Crystallization with Overlap Classification Report . . . 38

4.8 LSTM Soot with Overlap Classification Report . . . 38

4.9 LSTM AFC with Overlap Classification Report . . . 39

4.10 FFN Soot Impartial Test Dataset Classification Report . . . 40

(10)

1 Introduction

Volvo Penta is a world-leading supplier of diesel engines and power solutions for marine and industrial applications, and a part of the Volvo Group. During the development of an engine, testing is performed both in test cells and in customer field tests which produces large amounts of structured data. While collecting this data is costly for Volvo Penta, it is currently not being used as feedback to further improve the engine operations and their tests to the extent that Volvo Penta would like. An emerging method of maintenance that Volvo Penta is looking to expand into is predictive maintenance, in which machine learning algorithms are used to predict engine faults. This thesis project will investigate any correlation between the collected field data measurements and the on-board Diagnostics Trouble Codes (DTCs), produced by the engines during field tests. If intervals of the field test data is found to contain some pattern recognizable by neural networks, a trained network model could determine if the measurement intervals precede a DTC or not. A trained neural network model can then predict the need for maintenance given an interval input, and prevent engine errors from occurring in the first place.

1.1 Motivation

All field-tested engines are equipped with data loggers monitoring the signals collected from the EMS (Engine Management System) CAN-bus (Controller Area Network-bus). Depending on the field test and engine a varying amount of signals are monitored and stored by a data logger. Usually, 60-100 signals are continuously logged with varying sample rates of 1-50 Hz. In addition to these continuous signal measurements, the DTCs are periodically logged and stored in a file. There are hundreds of different DTCs defined, each of which is triggered by some set of conditions involving sensor and signal criteria. When a DTC condition is fulfilled, it will be listed on the next periodic fetch of all active DTCs.

While some DTCs have more trivial trigger conditions than others, most of them can not be easily predicted by linearly extrapolated means and require a less direct approach to fore-see. Regardless of the triviality to foresee them, the on-board diagnostics of today are used reactively to handle the errors. The test engineers are not alerted of the errors until they have already happened. What Volvo Penta is looking to explore is the possibility of predict-ing DTCs uspredict-ing deep neural network models based on field test data collected over several years. After successfully training models to predict the trouble codes, new field test data can

(11)

1.2. Aim

be continuously inputted into the model and the EMS could proactively alert the test engi-neers when the engine is likely to require maintenance. A future Volvo Penta project could implement successfully trained models onto the field test engines themselves, where the en-gine sensor data is directly fed into a neural network model which continuously evaluates the need for maintenance.

1.2 Aim

This thesis aims to use signals and features extracted from Volvo Penta field test data related to three different diagnostics trouble codes to train feedforward and recurrent neural network models with differently configured hyperparameters, and compare the models to each other. The best model, according to the defined evaluation metrics, will serve as proof-of-concept of how Volvo Penta field test data can be used for predictive maintenance.

1.3 Research questions

The problem brought up and motivated in the previous sections with regard to whether neu-ral networks and sampled field test data can be used for predictive maintenance raises two specific questions:

1. Does the field test data from Volvo Penta preceding selected DTCs contain any kind of pattern that a deep neural network model can be trained in order to predict engine failures?

2. What alternative neural networks and configurations can achieve the best results in pre-dicting the need for maintenance as defined by the highest Receiver Operating Charac-teristic area under curve (ROC AUC) value, if any?

1.4 Delimitations

Volvo Penta collects hundreds of different diagnostic trouble codes for every field test along with corresponding signal data. To combat the limited frequency of the available DTCs and consequently the amount of data that can be used to train neural networks with, the three DTCs were selected for predictive maintenance with regard to their high frequency count. These are DTCs related to engine faults such as a clogged air filter, urea-induced crystalliza-tion in the engine aftertreatment process, and high diesel particulate filter (DPF) soot levels.

To answer the second research question an arbitrary amount of neural networks of differ-ent types and configurations were chosen for comparison. These include traditional feedfor-ward and recurrent neural networks (RNNs). Available thesis time was the limiting factor for the neural network model configuration comparison count. While alternative approaches to predictive maintenance exist such as statistical and model-based ones[7, p. 2], the machine learning approach is what this study will be limited to due to time constraints. Comparing the thesis results to other machine learning models would have given an interesting insight in their comparative performance. Ultimately, there were no time left to include any other types of algorithms. Neural networks were selected for the study due to the preconception that they were the best choices of machine learning models for the task and provided data.

(12)

2 Theory

The sections below provide the prior knowledge and theoretical background used in the fol-lowing chapters. These include a brief explanation of the neural networks to be used in the study along with related theory and the evaluation metrics used in Chapter 4 that provide the means of quantitative comparison between the trained neural network models. Information is also provided to explain how the engine works and what components are involved in the areas that pertain to the selected DTCs, to give an understanding of what conditions trigger the diagnostic errors.

2.1 Machine Learning

Machine learning is an application of artificial intelligence in which the program learns from experience by repeatedly performing some task and evaluating its performance, with the aim to iteratively increase its performance measure by repeating the task. The machine learning algorithm is tasked with approximating some function ˆf that maps x to an output y, such that y « ˆf(x). For the specific machine learning task of classification, f : Rn Ñ t1, ..., ku, where f maps the inputs of dimensionRn to any of the defined classes in t1, ..., ku where k is the number of defined classes [8, p. 98-100]. The inputs in the context of machine learning algo-rithms are commonly referred to as features. After the function f has been approximated with a training dataset during the training process, it can be used to predict what class arbitrary input features correspond to.

There are numerous types of machine learning, one of them being supervised machine learn-ing. In contrast to alternative types of machine learning such as unsupervised machine learning and reinforced learning, supervised learning algorithms are during the training process ex-posed to the correct answers, being the inputs’ corresponding targets or labels, of every pre-diction. The input x and the label output y make up a tuple,(x, y). The machine learning algorithm uses the input, label, and its prediction to learn from example. The machine learn-ing algorithms input the prediction and actual value into an objective function or loss function that quantifies their difference, which the algorithm aims to iteratively minimize for training samples by some optimization algorithm.

(13)

2.2. Feedforward Neural Networks

Dataset splitting

To train a machine learning model a dataset the algorithm should learn from is required. In the case of supervised machine learning a sample consists of an input vector and an output vector, or label, mapped to the input. The machine learning algorithm will then use these samples to learn from examples.

As to not evaluate the model on the same data it was trained with, the dataset is split into either two or three different subsets. By separating the training dataset from the other subsets, the trained model can be evaluated with unbiased samples that were not part of the training process. A trained model is evaluated using a test dataset after the training. Another type of test subset is the validation dataset, which is used to evaluate the performance of the model during the training, by some selected evaluation metric.

The three different data subsets are usually selected at random from the original dataset’s samples. The test dataset is preferably completely separate from the training’s and valida-tion’s original dataset and not selected at random.

2.2 Feedforward Neural Networks

The idea of feedforward neural networks (FNNs), also known as fully connected neural networks or simply neural networks, is based on a basic machine learning algorithm called perceptron. The perceptron algorithm takes a real-valued input vector x and calculates the output ˆy using variable weights w and bias b, stored in the perceptron model, by using the dot product w|¨x

[14].

ˆy= f(x) =

#

0 if x ¨ w+b ď 0

1 if x ¨ w+b ą 0 (2.1)

The mathematical model shown in Equation 2.1 can be interpreted as: for any input x, the model’s weights w adjust the significance of every input element xito determine whether the

input belongs to one of two binary outputs. The bias increases or decreases the likelihood of the perceptron firing, meaning it outputs a value of 1, regardless of weight distribution. The perceptron learning process will adjust the bias and weights, and therefore the importance of every input value, to reduce the difference between predicted outputs and actual outputs in the training dataset. The function transforming the expression w|¨x+b to either 0 or 1 de-pending on the expression’s value is commonly referred to as an activation function, expressed as H in Figure 2.1.

(14)

The Multilayer Perceptron algorithm expands on the idea of perceptrons and is synony-mous with deep feedforward neural networks. Instead of having only an input layer and a single output node, the input layer is mapped to numerous different nodes. While each of these nodes individually functions as in the perceptron algorithm described above, together they form a hidden layer in the case of multilayer perceptron. If there is just one hidden layer in the neural network the hidden layer nodes are mapped to the output layer node(s), and if there are more layers the hidden layer nodes are mapped to the subsequent hidden layers’ nodes. Intuitively, hidden layer nodes can weigh outputs from previous layers differently and in doing so can find any correlations between them, and these correlations are in turn weighed into the next layer. The type of hidden layer that calculates its output using linear operations as in Equation 2.1 is usually called a linear or dense layer.

Figure 2.2: A multilayer perceptron with one hidden layer

Provided the explanations so far, it follows that the final approximated function f for the trained model is made up of a network of individual functions, where every layer represents a function. The depth of the network can be increased with more hidden layers, which in the case of classification allows the network to learn more complex boundaries of which class some combination of feature inputs belongs. This depth aspect of feedforward networks leads to the use of the deep learning terminology while referring to machine learning algo-rithms involving multilayer perceptrons [8, p. 168]. The networks are considered "neural" because a node in the neural network arguably operates similarly to how a biological neuron does, in which some neuron might fire depending on the input from other layers of connected neurons.

Activation functions

As seen in Equation 2.1, a fully-connected node’s calculated value zi, where ziis assumed to

be one among many nodes in the network such that ˆy = zi, has a linear relationship to the

previous layer, zi = w|¨x+b. Recall from the previous section that a hidden layer can be

thought of as an inner function in the deep network of functions. For the network to learn more complex non-linear functions using one or more hidden layers, the layers are required to be linearly independent of each other. To create this linear independence every hidden node value is input to a non-linear function called the activation function. The activation func-tion is applied for every node in a hidden layer before being forwarded to the next layer. For feedforward networks a recommended activation function for hidden layers is the rectified linear unit (ReLU)[8, p. 170]:

gi(zi) = #

0 if ziď0 zi if zią0

(15)

This activation is computationally efficient as it is in practice just a max()function, returning 0 for negative values of zi or zi itself if it is positive-valued. ReLU is piecewise linear with

two linear parts, but a non-linear function as a whole, which fulfills the previously mentioned requirement of non-linearity for the activation function.

Different activation functions have different properties in addition to providing non-linearity. For the network to return the probability of either of two outcomes, the output node needs to output a probability value. The sigmoid activation function takes any z PR and outputs a value in the[0, 1]range [8, p. 67], according to:

g(z) =σ(z) = 1

1+e´z (2.3)

Yet another activation function is the hyperbolic tangent (tanh) activation function. It is a common activation function that can be derived from the sigmoid function. It can be thought of as a rescaled and shifted sigmoid function with outputs in the range[´1, 1][8, p. 195].

g(z) =tanh(z) = e z_´_e´z

ez₊_e´z =2σ(2z)´1 (2.4)

The tanh activation function is a common choice in some of the gate layers of Long Short-Term Memory neural network layers as described in Section 2.3.

The Cross-Entropy Loss Function

Shannon Entropy, or simply entropy in information theory, is a measure of how many bits are required to transmit a message over a binary communication channel [15, p. 11]. More intuitively, it can be thought of the least amount of yes or no questions required to guess all outcomes provided the outcomes’ probabilities. Given a random variable X in some alphabet

X, where X = x has a f(x)probability p of occurring, Equation 2.5 defines the entropyH

[15, p. 11].

H(X) =´ ÿ

xPX

f(x)˚log2(f(x)) (2.5)

A trivial example would be picking one ball from a sack of evenly distributed balls of two different colors. The entropy is calculated to answer how many questions are required to find out the color of the ball:

´(1 2˚log2( 1 2) + 1 2˚log2( 1 2)) =´log2( 1 2) =1 (2.6)

While this problem was selected for its integer answer for intuition’s sake, entropy can as-sume any value H(X )ě0.

Cross-entropy is another type of entropy and is calculated similarly to Shannon entropy [15, p. 13]. While popular in the context of machine learning classification, the cross-entropy method can in general be used to solve a great number of estimation and optimization prob-lems[15, p. 29]. Cross-entropyDaccepts two probability distributions and returns what can be thought of as the sum of distances between all elements in the probability distributions[15, p. 13]:

D(p, q) =´ÿ

xPX

p(x)˚log(q(x)) (2.7) The "log" is defined as the natural logarithm rather than the binary or base-10 logarithm. For machine learning algorithms p(x)is usually defined as the probability of a "true label" and q(x)the probability of a "predicted label". For example, if the actual chance of a binary

(16)

outcome is p=90 % but a classifier estimates the chance of that outcome to be q=55 % the cross-entropy is calculated as:

´0.9 ˚ log(0.55)´(1 ´ 0.9)˚log(1 ´ 0.55)«0.62 (2.8) A larger difference between the probabilities increases the entropy. The binary cross-entropy function is in the case of machine learning optimization often used as a loss function. In the specific case of two possible outcomes for cross-entropy, the binary cross-entropy loss function for a pair of probabilities is defined as L in Equation 2.9.

L(p, q) =´p ˚ log(q)´(1 ´ p)˚log(1 ´ q) (2.9) Here it is assumed that p and q are simple probabilities rather than distributions.

During the training process of machine learning, the selected optimization algorithm will attempt to optimize the machine learning model parameters in the process of minimizing the loss function.

Gradient Descent

An optimizer is selected to adjust the network weights and biases in order to minimize the calculated loss function values created during the training process. One of the most popular black-box optimization algorithms and the most common for neural networks by far is gra-dient descent [16]. There are different kinds of gragra-dient descent algorithms but their common goal is to update all the network’s weights and biases with respect to how they affect the cost function by a process called backpropagation. Specifically, backpropagation is the part of the gradient descent where the gradient is calculated. There are numerous methods of generat-ing initial weights for the networks, a common one begenerat-ing the Glorot Uniform Initializer[18] which returns a random value from a uniform distribution within a range that depends on the amount of input and output nodes in a layer. The formula for updating a network’s weights can be summarized into:

W1 ₌_{W ´ α}_∇_L _(2.10)

where

W1 : network weights after an update

W : network weights before an update

α: an arbitrary step size or learning rate

∇L : gradient of the loss function with respect to network parameters

While Equation 2.10 refers to the update of all network weights, consider the update of one weight wiconnected to the output layer, and the equation becomes:

w1

i =wi´ α

BL Bwi

(2.11) The partial derivative _BwBL

i is a measure of change for the loss function L when the weight wi changes. By iteratively subtracting wi by a partial derivative of L with respect to wiselected

to reduce the loss function, the network training loss should decrease. A learning rate is mul-tiplied to the partial derivative to amplify or reduce the magnitude of the partial derivative subtraction. Intuitively, this magnitude represents the length of a "movement step" that is made every iteration towards the loss function minimum. To calculate this partial deriva-tive the chain rule can be applied to transform the expression into several known factors as Equation 2.12 shows. BL Bwi = BL Bg ˚ Bg Bz˚ Bz Bwi (2.12)

(17)

where BL

Bg : derivative of the loss function L with regard to the activation function g Bg

Bz : derivative of the activation function g with regard to the total net input z Bz

Bwi : derivative of the net input z with regard to network weight wi

The derivative of the binary cross-entropy loss function, 2.9, becomes: BL Bq = B Bq(´p ˚ log(q)´(1 ´ p)˚log(1 ´ q)) = 1 ´ p 1 ´ q´ p q (2.13)

where p is treated as a constant as it does not change across training iterations, and g = q as the activation function output g is used as the input q for the loss function.

Assuming tanh is used as the activation function g for the network layer, thenBg_Bzbecomes: Bg

Bz = B

Bz(tanh(q)) =1 ´ tanh

2₍_q₎ _(2.14)

Note that for backpropagation to work, Equation 2.14 requires the activation function to be differentiable. Also note that the derivative for the ReLU activation function in Equation 2.2 is not defined for x=0. This special case is usually solved by returning 0 when 0 is the ReLU derivative function parameter.

Finally, the net input derivative with regard to the weight to be updated can be calculated as: Bz Bwi = B Bwi (w1˚q1+...+wi˚qi+b) =qi (2.15)

With the partial derivative BL

Bwi calculated according to Equation 2.12, to update the network weight wiall that remains is to multiply the partial derivative by the step size α and subtract

the result from the prior weight value wi. An illustration of how a weight w1is updated by

backpropagation can be seen in Figure 2.3.

Figure 2.3: A backpropagation illustration.

This is the basic idea of backpropagation and how a neural network updates its weights to improve its performance. To learn how hidden layer weights and biases are updated, how backpropagation is generalized into vector operations, and more, Goodfellow et al.’s book Deep Learning [8] has more information about this topic.

Deeper feedforward networks are susceptible to a problem called exploding and vanishing gradients. During backpropagation, gradients near the input layer are calculated by multipli-cation of other gradients between that layer and the output layer. As such, gradients used to update weights and biases can grow (explode) or shrink (vanish) rapidly for layers near the input layer if the network is deep [8, p. 290].

(18)

2.3. Recurrent Neural Networks

Adam Optimization

The Adam algorithm is another take on the optimization of stochastic objective functions such as the loss function. Adam is an evolution of several other gradient descent optimiza-tion algorithms such as AdaGrad and RMSProp [11]. Instead of just using a single learning rate across the entire training process for all weight updates as with the original gradient de-scent algorithms, each parameter maintains a value representing an exponentially decaying average of the previous mean mtand uncentered variance vtof the gradient [11]. Intuitively,

this means that the gradient used to calculate the weight in any previous iteration is factored into the next weight calculation of that parameter, and that this momentum is decreased by every training iteration. The update to some weight wtfor an iteration t is made as follows

[11]: mt=β1˚mt´1+ (1 ´ β1)˚ BL Bwt vt=β2˚vt´1+ (1 ´ β2)˚( BL Bwt )2 ˆ mt= mt 1 ´ β1 ˆvt= vt 1 ´ β2 wt=wt´1´ α ˚?mˆt ˆvt+e

Here β1and β2are exponential decay rates. ˆmt and ˆvtare bias-corrected moment estimates

as the mean and variance are biased towards zero at the start of the training process. The variable e is a small constant to avoid division by zero and α is the selected learning rate.

The use of uncentered variance of the gradient is a method derived from AdaGrad and RMSProp [8, p. 307]. The authors of the Adam algorithm empirically show that Adam makes fast progress in terms of approaching convergence in the least amount of iterations and in wall-clock time [11]. The publication also shows that Adam consistently outperforms alter-native gradient descent algorithms in their tests.

2.3 Recurrent Neural Networks

A traditional feedforward neural network can learn any non-linear function using snapshots of input data points, which the universal approximation theorem defines formally [8, p. 198]. However, there are cases where the previous network input(s) can help the current input make a prediction, such as predicting the next word in a sentence. Recurrent Neural Net-works (RNNs) excel in this task by using the output from one iteration as part of the input for the next iteration. By looping the RNN layer output to itself the network can share informa-tion across time, a concept known as parameter sharing. Another method of parameter sharing is to flatten the sequence of inputs into one single feedforward input by an operation called convolution, but is considered shallow compared to RNNs’ method of parameter sharing [8, p. 374]. Convolutional neural networks are yet another type of neural network that excels in object and image classification, but is not covered in this paper.

The behavior of RNNs is illustrated in Figure 2.4. RNN layers are often conceptualized where the loop is unrolled as a series of feedforward layers representing how one layer out-put is shared to the next time step. An RNN layer accepting a sequence inout-put with five elements would in its unrolled representation consist of five fully connected layers. In the RNN-specific backpropagation algorithm called backpropagation through time (BPTT), train-ing the network becomes increastrain-ingly more complex and prone to explodtrain-ing gradients with longer sequences corresponding to a deeper unrolled network of feedforward layers.

(19)

2.4. Hyperparameter Tuning

Figure 2.4: An unfolded RNN network

LSTM Networks

Alternative recurrent neural network types with more complex designs exist that counteract some of the disadvantages of simple recurrent neural networks, such as the exploding and vanishing gradient issues. One of these is the Long Short-Term Memory (LSTM) recurrent neural network, which was originally presented by Hochreiter et al. in a 1995 publication [10]. In LSTM network modules the cell state is modified over several different steps per iteration with the use of so-called gates or gate layers. Some of the gates simply feedforward information with an activation function as described with feedforward layers in Section 2.2. This is in contrast to simple RNNs that calculate the output directly using the cell state, and also modifying the cell state in the process.

One of the gates is called the forget gate which uses a sigmoid activation function to modify the significance of module inputs incoming from the previous iteration. The forget gate also mitigates the effects of exploding and vanishing gradients. Another gate called the input gate calculates the input that should be added to the forget gate output, resulting in the new cell state to be passed into the next iteration of the LSTM module. Finally, after the cell state is modified, the cell output is calculated in the output gate using the original module input. The new cell state resulted from the forget and input gate, and some activation functions. This output and the cell state are passed into the next iteration of the layer. The output is also forwarded to the next network layer. There are numerous variations of the LSTM architecture but this is the basic idea of the recurrent network subtype [9].

2.4 Hyperparameter Tuning

Hyperparameters are defined as the configurations set to control the learning algorithm be-havior before the machine learning training is started [8, p. 120]. In the case of neural net-works, hyperparameters include but are not limited to the number of hidden layers, the num-ber of nodes per layer, the numnum-ber of samples considered and averaged for each update to network parameters (commonly known as batch size), the learning rate, the number of train-ing iterations, the activation functions, connections, and more.

Unlike the dynamic network parameters updated in every training iteration, hyperpa-rameters are completely static during the entire training process and have to be intuitively selected before training the neural network, with regards to the network input. Some of the trade-offs to be considered and balanced when selecting hyperparameters include: underfit-ting, overfitunderfit-ting, training time, computer resources, and more. An underfitted network model is considered too simple or not trained enough to make accurate predictions. This could get corrected by a greater network depth, breadth, or more training iterations. An overfitted net-work model is considered be trained too much on the test data, resulting in an inability to generalize its predictions to unknown data not included in the training dataset.

(20)

2.5. Regularization

While selecting hyperparameters by trial-and-error could yield satisfactory results and is a common practice when it comes to neural networks, there are structured means of selecting them as explained in the sections below.

Grid Search

Grid search is a structured method of selecting combinations of hyperparameters. For every hyperparameter, a small number of values is selected based on prior experience such that the optimal hyperparameter value is likely to be within this range [8, p. 432]. Elements in the range are usually selected on a logarithmic scale. An example grid search value set for the learning rate hyperparameter is t0.1, 10´2, 10´3, 10´4u. For every hyperparameter combina-tion, the network is evaluated by some performance metric, for example loss. Usually, one hyperparameter is adjusted between tests. If the lowest loss is achieved with the maximum or minimum value in a range, the range might need to be shifted to achieve even better per-formance. Similarly, if the best performance is found in some inner part of a range, the range can be shrunk to hyperparameter values closer to the best performing value in the previous range.

One flaw of this method is that the number of tests grows exponentially with the number of possible hyperparameter combinations, at least if all combinations are tested. If only one set of hyperparameters is changed between tests, and the best hyperparameter in that range is used while evaluating the next range of hyperparameters, the computation cost grows linearly with the number of considered hyperparameters. However, because the different hyperparameter types might depend on each other, eliminating one hyperparameter in the hyperparameter set could rule out a better combination of hyperparameters later in the grid search. Because of this, evaluating all combinations can result in a better-performing network.

Random Search

Random search works similarly to grid search except that only a minimum and maximum value are set for the possible hyperparameters, again based on prior experience. The dis-tribution of the randomly selected hyperparameter values are defined to have a categorical distribution for discrete hyperparameters and logarithmically distributed for real-valued pa-rameters [8, p. 434]. After every test, the performance metric and the used hyperpapa-rameters are saved. It can be shown that random search can find as good or better-performing net-works in a fraction of the time it takes for grid search [5].

2.5 Regularization

Regularization is a common name for the many different strategies available to counteract overfitting and therefore reduce the test dataset error [8, p. 228]. One method of regular-ization is to add randomized noise to the samples in a training dataset. Another strategy known as dropout randomly ignores nodes in a network over the training session, with the ig-nored nodes changing every training iteration [8, p. 258]. Yet another regularization method is called early stopping, whereas some metric is monitored over the training session. If the network does not improve the monitored metric sufficiently by some defined amount and epochs, the training is stopped early.

Unless the test data and all future data are very similar to the training data, using some regularization for machine learning training is good practice in general.

2.6 Evaluation Metrics

There are several evaluation metrics that can be used to quantify the trained neural network models’ performances, each providing a different insight into a machine learning model’s

(21)

2.6. Evaluation Metrics

abilities [8, p. 422]. A model scoring high in one metric might not score as well with another metric. Some optimization algorithms for neural networks can be set to optimize for a specific metric, which can be chosen with regard to what application the model will have. The ones used in this thesis are included in the subsections below. Each of their usefulness with regard to predictive maintenance is discussed in Chapter 5.

Accuracy

Accuracy is arguably the most simple and widely used measure of prediction performance. In the context of binary classification it is defined as the ratio of correctly classified predictions (true positives, true negatives) out of all predictions made, as shown in Equation 2.16.

Accuracy= TP+TN

TP+TN+FP+FN (2.16)

The error rate, or 0-1 loss, is sometimes used instead. The error rate is defined as the ratio of incorrect predictions out of all predictions.

The accuracy metric values true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) equally. Sometimes, however, a false positive can be more acceptable in practice than a false negative which the accuracy value does not take into account. An example of this would be an automatic car breaking if it detects objects or people in front of it. In such a case, prioritizing TP and FP over FN could prevent more disasters than if they were all valued equally.

For predicting a rare event, a naive binary classifier that never predicts the rare event will achieve almost perfect accuracy, even if this classifier is completely useless. For these kinds of tasks with imbalanced datasets, other performance measures provide better insight into the machine learning model’s performance [8, p. 423].

Precision, Recall, F-score

A predictor’s recall is the amount of correctly classified "positive events", TP, out of all pos-itive events, TP+FN. Precision is the number of events correctly classified as positive, that actually were positive events. F-score is the harmonic mean of precision and recall, which values smaller values more and higher values less than the "regular" arithmetic mean.

Recall= TP

TP+FN (2.17)

Precision= TP

TP+FP (2.18)

F-score=2 ˚ Recall ˚ Precision

Recall+Precision (2.19)

Reconsider the prediction of a rare event in the previous section. The naive classifier that never predicts the rare event would achieve a perfect precision value of almost 1 with regards to predicting the common event, but a recall value of 0 with regards to predicting the rare event [8, p. 423]. For classification tasks, the output is usually a vector of probability values

yrepresenting the probability p(y|x)of an output belonging to some class for an input x. A probability threshold t, usually defined as 0.5, can be used in the case of binary classification as a cut-off point to decide whether or not the prediction belongs to either class. For example, it might be desirable to predict one of the classes if and only if it has a very high probability value. t can be changed to trade recall for precision in these cases as discussed in Section 5.1.

(22)

2.6. Evaluation Metrics

Confusion Matrix

A confusion matrix is an evaluation plot of a classifier’s predictions. One axis represents the "true labels" of y while the other axis represents the "predicted labels" ˆy. In the case of binary classification, the matrix shows all TP, FP, TN, and FN of the predictions, as Figure 2.5 shows. The sample counts in the confusion matrix are usually normalized to sum up to 1 over either rows, columns, or over the entire matrix. In this thesis, they are normalized and summed up to 1 over the rows. Confusion matrixes can also be used to evaluate multiclass predictions. They are commonly used to visualize the distribution of predictions among all available classes.

Figure 2.5: A confusion matrix

ROC and PR curve, AUC

As mentioned in Section 2.6, the prediction threshold can be adjusted with various trade-offs, and increase some performance metrics at the cost of others. One way of evaluating machine learning models’ performances is to plot two metric values against each other for different probability threshold values, as a means of visualizing these trade-offs. The Receiver Oper-ating Characteristic (ROC) curve plot recall against the false-positive rate (FPR; _FP+TNFP ). The Precision-Recall (PR) curve plot recall against precision. The difference between the ROC and PR plots is the x-axis, where the PR curve’s precision axis gives a minority class’s false posi-tives more emphasis and the majority class less emphasis. The ROC curve is more often used for balanced datasets and can better show the rate of false positives in the predictions with its false positive rate x-axis. Example ROC and PR curves are provided in Figures 2.6 and 2.7. An ideal ROC or PR curve has a recall value of 1 for all probability thresholds. For the ROC curve plot, the dashed line represents a classifier making as many correct classifications as false classifications which could be considered a classifier’s baseline performance.

(23)

2.7. Diesel Engine Domain Knowledge

Figure 2.6: Example ROC curves for four different models.

Figure 2.7: Example PR curves for four different models.

These kinds of graphs can be used to extract another performance metric; the area under curve (AUC). AUC can be interpreted as a metric of how robust the classifier is. A classifier has the maximum possible AUC value of 1 if the classifier has 100 % accuracy regardless of the probability threshold. A likelier scenario could be if most of the probability values in a set of predictions are near 0 and 1. In such a case the predictions will not change as much with a varying probability threshold, which would provide a relatively high AUC value. There-fore, a network could be considered robust if the AUC remains high for a large number of probability thresholds, and if the selected threshold is insignificant for the network’s overall performance.

The AUC can be artificially increased by simply removing any "uncertain predictions" near the default probability threshold. This is further discussed in Section 5.1. The threshold decreases from 1 to 0 from left to right in the ROC plot in Figure 2.6, and increases from 0 to 1 from left to right in the PR plot in Figure 2.7.

The Zero Rule Baseline Model

The Zero Rule (Zero-R) model is a simple baseline classification model [19]. It is commonly used to calculate an initial performance score to improve on with trained machine learning models. Consider a dataset with two classes. The Zero-R classifier will then always predict the same class regardless of what the input is, usually the class with the most samples in the dataset. If a dataset of 100 samples has 90 samples with class 0 and 10 samples with class 1 the classifier will always classify inputs as class 0. In this case, the Zero-R classifier will have an accuracy of 90 %. Even though there is no predictability power in Zero-R it can still achieve decent accuracy depending on the dataset class distribution. As such, beating the Zero-R classifier can be considered a good baseline performance for machine learning classifiers.

2.7 Diesel Engine Domain Knowledge

This section will summarize parts of the Volvo Penta diesel engine that relates to the diagnos-tic trouble codes selected for this thesis.

Turbocharger Air Filter

A turbocharger is the component of a combustion engine that works to increase the engine’s power output [13]. In comparison to a naturally aspirated engine that passively supplies air to the combustion engine with the use of atmospheric pressure, the turbo creates additional pressure to force more air into the combustion chamber. The additional air pressure allows for a proportionally larger amount of fuel per combustion, increasing the engine’s power output. To create this increased air pressure external air is taken from the engine’s

(24)

surround-2.8. Related Work

ings which can contain dirt and debris. Unless filtered the polluted air can carry unwanted particles into the closed system and cause major damage to the engine

Over time the air filter will become clogged and reduce its ability to filter and pass through clean air to the inlet valve. Sensors measuring the pressure before and after the air filter can help detect if it is time to change the filter.

Diesel Engine Aftertreatment System

The Volvo Penta aftertreatment system of every engine consists of several different compo-nents that work sequentially to reduce emissions and fulfill legislations set in different parts of the world1. The current EU legislation for emission criteria in diesel engines as of this the-sis is referred to as Stage V which the Volvo Penta’s diesel engine aftertreatment is made to comply with. Stage V introduced a number of stricter requirements for emissions compared to the previous Stage IV regulations [6].

The first part of the aftertreatment system is the Diesel Oxidation Catalyst (DOC) which turns some of the exhaust gases, hydrocarbon (HC), nitrogen monoxide (NO), and carbon monoxide (CO), into water, carbon dioxide (CO2), and nitrogen dioxide (NO2).

The next part of the aftertreatment system is the Diesel Particulate Filter (DPF). The DPF collects and reduces the number of soot particles by an oxidation process called regeneration, using the nitrogen dioxide from the DOC, with carbon dioxide as a by-product [4]. Regenera-tion is therefore a continuous process in the DPF. However, some soot is still collected by the DPF and over time the soot levels can reach problematic levels. Sensors measuring pressure near the DPF can detect abnormal soot levels. Active regeneration, by heating the DPF to above a certain level, can lower the DPF soot levels.

The rest of the aftertreatment system works to reduce nitrogen monoxide and nitrogen dioxide from the emissions. Urea (CH4N2O) is injected into the system before the Selective

Catalytic Reduction (SCR) catalyst. The SCR catalyst provides the conditions required for the nitrogen molecules to turn into water and nitrogen gas (N2). The use of urea also has some

less beneficial properties, as it can cause crystals to form in the aftertreatment system under sub-optimal conditions [17]. The crystals can cause issues such as abnormal sensor values and require regular maintenance. The abnormal sensor values and a reduced nitrogen oxide conversion rate can help detect the presence of urea crystals.

Finally, at the end of the aftertreatment system, the Ammonia Slip Catalyst turns any leftover urea into water and nitrogen gas.

2.8 Related Work

Predictive maintenance is an emerging method of maintenance for industrial applications and is a hot topic of research alongside the advancement of machine learning algorithms. Volvo Penta has previously initiated a project to ascertain whether predictive maintenance is applicable for their needs2. This project, however, used regression analysis to construct artifi-cial ideal signals and compare them to measured signals to find any deviations, which would hint at an upcoming error. This is in contrast to classifying preceding periods of signals, a completely different approach. In the project’s conclusion, it is said that predictive mainte-nance for Volvo Penta field test data can be achieved, but no practical examples of predictive maintenance are provided. One of the diagnostic error codes analyzed in this project is urea crystallization, which is one of the DTCs this thesis aims to predict.

While the present thesis studies the probability of requiring predictive maintenance by classification, there is another approach to predictive maintenance called "time to failure" 1_{A Volvo Penta video describing their aftertreatment system: https://www.youtube.com/watch?v=}

6Nkw9U3F-0c- Viewed: 2021-05-14

(25)

2.8. Related Work

by regression analysis. Korvesis et al. try in [12] to predict equipment failure in aviation circumstances. By analyzing previously recorded events, the machine learning model output is set to the time remaining to a failure instead of a probability of whether a fault is upcoming or not. This is an interesting approach to predictive maintenance that is further discussed in Section 5.2.

Yet another machine learning thesis study has been conducted at Volvo Penta by Alexan-dersson et al. [1]. In this study, regression neural networks are used to virtually generate some of the Penta engines’ sensor signals, as part of Volvo Penta’s initiative to virtualize some of the engine testing that is made in-house.

A highly related study made by Aydin et al. [2] attempts to predict engine failure by neu-ral networks. Engine sensor measurements are input to LSTM networks and the output is a value in the range[0, 1]where 0 indicates complete engine failure and 1 indicates a com-pletely healthy engine. For the training set the output is calculated as

([time_to_ f ailure]´[current_age])/[time_to_ f ailure]. This approach is in contrast to predict-ing several different kinds of engine faults uspredict-ing different sets of sensor signals.

Carvalho et al. provide a systematic literature review of machine learning algorithms for predictive maintenance in [7]. The review also mentions that in addition to machine learn-ing algorithms, predictive maintenance is also achievable uslearn-ing model-based and statistical approaches. A model-based approach aims to predict faults by constantly monitoring the equipment with mechanistic knowledge of it. It is noted that the methodology to achieve predictive maintenance is highly dependent on the application it is used in. The review finds that real data is used more frequently than synthetically created data in predictive mainte-nance studies. The review also finds that some of the covered projects use classification to predict maintenance while others use regression. All of the applications aim to increase the time of use of the equipment, reduce the need for preventative maintenance on a set schedule, reduce costs and the required labor.

(26)

3 Method

The practical part of this thesis can be considered to consist of four parts. This chapter will go into the details of the project pipeline; from how the field data was fetched and parsed, to how the neural network models were designed, trained, and evaluated.

3.1 Data Understanding

To learn what features to extract from the raw field test data as inputs to neural networks it is important to understand the data in order to know how to process it. Neural networks are during the training process able to learn the importance of every feature input that is best fit to classify a set of samples and adjust the network parameters accordingly. If a feature connections’ weights are low relative to other features after the model is finished training, that input will generally affect the trained model’s predictions less than the other features. For a network to learn the significance of features a large enough dataset to do so is required. The sample size extracted from the field tests was found to be small. Removing insignif-icant features will always be beneficial and reduce the neural network load, but the small sample size made it critical to manually select the most significant signals with the help of diagnostics engineers at Volvo Penta. With help of the engineers, three different DTCs among hundreds were selected, and they could confirm the importance for Volvo Penta of being able to predict the selected DTCs. The first selected DTC is triggered when the diesel engine’s air filter is clogged. The second DTC is triggered DPF requires regeneration. As for when urea crystallization has occurred during the NOx reduction process between the DPF and the SCR catalyst, there is no single DTC and respective trigger for this occurrence but instead several related DTCs likely triggered because of crystallization. Several dozens of signals in total with high frequencies are sampled from a total of 33 field tests. These field tests include a number of different operations related to forestry, mining, material handling, and more. For each DTC the diagnostic engineers selected a set of signal measurements most likely contain-ing some information that neural networks can identify to predict each of the DTCs. These are described in the subsections below and listed in Table 3.1.

(27)

3.1. Data Understanding

Urea Crystallization

The urea crystallization can occur during the nitrogen oxide emission reduction process of the aftertreatment system under sub-optimal conditions [17]. The diagnostic engineers deter-mined that signals measuring NOxvalues could be valuable to include, as the NOxreduction

process can be negatively affected after crystallization has occurred, which can be noticed from NOxsignals. The amount of urea injected for the reduction process at any time is also

measured and included. The correct temperature is critical to the chemical reaction for the reactants to turn into the desired products, and so numerous signals measuring temperature in the aftertreatment system are included. Finally, the signal measuring the total flow of the exhaust mass is included.

High DPF Soot Levels

This DTC was selected for predictive maintenance due to its relatively high frequency in field tests. It consequently had one of the largest amount of samples to train neural networks with among all DTCs. The DTC is simply triggered when the diesel particulate filter’s soot levels reach a point where regeneration is required for the aftertreatment system to function as intended. The signals best indicating high soot levels are the ones measuring a pressure difference before and after the filter. Another signal selected to predict this condition is the exhaust mass flow. The Volvo Penta engineers also suggested signals measuring temperature around the problem area. To illustrate what these signal measurements can look like while the engine is running, Figure 3.5 shows approximately 2.5 hours of measurements for these four signals from an arbitrarily selected field test and day. Due to corporate secrecy, the signals are re-scaled to a[0, 1]range, which doesn’t change the relative appearance of the plots. The noisy exhaust mass flow and pressure difference signals in Figure 3.5 are noteworthy. They are proportional to the torque output of the engine which varies a lot during a field test.

Air Filter Clogged

A filter is used to purify the air before it is compressed by the turbocharger and used for combustion. A DTC will trigger when this air filter is clogged. As with high DPF soot lev-els, a difference in pressure appears before and after the filter during congestion. Unfortu-nately, no sensor measuring pressure was available right after the filter. Instead, there were pressure measuring sensors before the air filter and after the pressure-increasing turbocharger located after the air filter, both of which signals were included. The control signals for the tur-bocharger were available and included. The signal measuring the entire engine torque value was also included. While a pressure sensor directly after the air filter could have helped de-tect a pressure difference before and after the air filter directly, these other signals acted as substitutes for the absent sensor and the neural network models used these substitutes to predict the DTC.

Table 3.1: Selected signals for predicting DTCs

Crystalization Signals Unit Soot Signals Unit Air Filter Clogged Signals Unit

Exhaust temperature °C Exhaust temperature °C Air pressure before filter kPa Temperature after DPF °C Temperature after DPF °C Air pressure after turbo kPa Exhaust mass flow kg/s Exhaust mass flow kg/s Air pressure after turbo setpoint kPa Temperature inside SCR °C Pressure difference

before and after DPF kPa

Air pressure after turbo,

setpoint and measured difference %

Temperature after DOC °C Engine speed rpm

Amount of urea to inject g/s Engine torque Nm Incoming NO_x levels ppm

(28)

3.2. Data Processing

Figure 3.1: Exhaust Temperature [°C] Figure 3.2: Temperature after DPF [°C]

Figure 3.3: Exhaust mass flow [kg/s] Figure 3.4: Pressure difference before and after DPF [kPa]

Figure 3.5: Example signal measurements for the selected Soot DTC signals.

3.2 Data Processing

With data carefully selected for analysis, the next phase of the study was to process the raw signal data to acquire samples ready to be used with the neural networks. The raw test data was structured by engine name and their sample time range. The engine types used to create the field test data were the ones in the Volvo Penta Stage V engine range1. The data from the different engines were not distinguished from each other, as the engine types’ sensor placements did not differ significantly. Training network models for each engine type could increase their respective performances, but it was decided that the neural network models should be able to be used across the engine types. The sampled DTC statuses were separate from the signal measurements.

With this, the first practical task of the study was to create scripts that would extract data, concatenate segmented data files, and map the signal periods to their corresponding DTCs by their timestamps. While all of the samples used for the different neural networks origi-nated from the same source, they were processed to appropriate formats for their respective networks.

Extracting Sample Class Types

For all types of samples and neural networks the strategy to achieve predictive maintenance was to take periods of signal data preceding a DTC and label them as periods of signals 1_URL: _{https://www.volvopenta.com/industrial/off-road/off-road-engine-range/#/}

(29)

3.2. Data Processing

leading up to some DTC, or a "DTC period" for short. The other class of samples then were selected to be periods of signal data when the engine did not have any upcoming error, an error-free period, or a "non-DTC period". To achieve the greatest possible contrast between the differently labeled samples, non-DTC periods were defined to be the signal data shortly fol-lowing the clear of the DTC in question. After the DTC clear it was assumed the engine error is resolved. Using this approach every reported engine error provided up to two samples of different classes if a period’s signals were not erroneous. The engine statuses and active DTCs are updated in periods of a few hours during the field tests. An active DTC means the error is present in the engine, and these statuses produced up to two samples as previously men-tioned. There are also status updates that guarantee that the error is not currently present in the engine. To increase the sample size in the dataset, periods of data preceding these "okay statuses" were also extracted as non-DTC period samples, which created some imbalance in samples between the classes. Each of the DTC and non-DTC periods were initially set to con-tain a maximum of two days of signal data, but as field tests require manual labor and are not run all the time, these periods usually ended up with a shorter length.

Because the sampled signals in the set of signals for every DTC did not always start and end at the same time, the signals in every sample were lined up using linear interpolation. During interpolation, the signals were also downsampled to 1 Hz from varying, higher fre-quency rates. The signals were also trimmed so that every signal in the period started at the same time as the signal with the latest starting time, and so every signal ended at the same time as the earliest ending signal. Some data was collected when the engine was turned off or inactive and was considered erroneous. To remove this data, all data points where the engine speed was below 500 rpm were filtered. In addition to removing data points from all signals if the engine speed was too low, samples were also filtered if any measured signal had an in-valid or missing value. All signals were rescaled using the Python processing library sklearn and its module MinMaxScaler2such that the minimum and maximum values of every signal became 0 and 1, respectively. Rescaling all signals to the same range is important to ensure that the network parameters are not optimized with regard to the signals’ magnitude instead of their significance during the neural network training.

The data fetching script was made to not overlap any fetched samples. For the DTC periods, the script confirmed that the preceding days of data did not contain any other DTC sampling of the same type which would cause an overlap.

Feature extraction

The result from the previous section was samples with up to two days of raw signal data per sample, containing different sets of signals depending on the DTC in question as motivated in Section 3.1. For the feedforward network described in Section 3.3, the input was decided to be a vector of scalar values derived from the two-day signal periods. The question arose of how to turn the time series data into vectors of scalar values representing the two days. For example, snapshots of raw signal values could be extracted at some interval and be made into samples resulting in N amount of features, where N is the number of selected sensor sig-nals for some DTC. Instead, to increase the amount of information every sample contained, a number of different values were extracted from every hour in the two day periods for every signal listed in Table 3.1. This resulted in N ˚ V amount of features per hourly sample, with V being the number of values extracted from every hour. The features extracted from the hourly segmented signals were values such as the maximum and minimum value, the mean, standard deviation, etc., for every hourly segment created from the full-length signals. For this feature extraction, a Python library was used called Time Series Feature Extraction Library (TSFEL) [3]. The library provides a function that accepts a time series signal, a window size, and output[signal_length]/[window_size]amount of samples, each containing V amount of 2_URL: _{https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.}

(30)

3.3. Designing the Neural Network Models

features. The "statistics" subset of feature extractions in the TSFEL library was used as it included basic set features to represent a signal sequence, in contrast to the alternative "tem-poral domain" and "spectral domain" with more specialized features. A total of 17 features were extracted for every signal. All feature extractions included in the statistical domain that were used as features in the feedforward networks can be found in [3, p. 5].

Windowing

Recurrent neural networks are especially useful for analyzing sequenced data (see Section 2.3). One input format that the field data signals can assume is vectors of some arbitrary length, containing several consecutive signal measurement samples each. This number of consecutive measurements per input is called the window size.

As mentioned in 3.2, the field data signals were downsampled to 1 Hz. The neural net-works in this study were mainly trained with segmented signal sequences between 10 and 300 seconds, equating to window sizes between 10 and 300. The script was designed to create this dataset of windows so as there is no overlap between the samples for most of the LSTM networks. This is achieved by introducing a stride value of "window size" between the samples. This is illustrated in Figure 3.6 where there are 8 measurement data points and a window size value of 4 is used.

After finding the best LSTM network without overlapping samples, the same hyperpa-rameters were used to train LSTM networks with overlapping samples and a stride value of one, as shown in Figure 3.6. This increased the sample count and the computer resources required to train the networks significantly.

Figure 3.6: An illustration of the sample windowing.

3.3 Designing the Neural Network Models

With the data pipeline complete the next objective was to design at least one neural network per DTC able to beat a Zero-R baseline classifier, to prove that predictive maintenance is possible to achieve using neural networks trained with historical Volvo Penta field test data. Completing this objective would answer the thesis’ first research question. Basic feedforward neural networks were used to prove this. After ascertaining whether or not the field test data can be used for predictive maintenance, recurrent neural networks including LSTM layers

Evaluation of Neural Networks for Predictive Maintenance : A Volvo Penta Study

Linköping

University | Department of Computer and Information Science

Master’s

thesis, 30 ECTS | Datateknik

2021

| LIU-IDA/LITH-EX-A--21/024--SE

Evaluation of Neural Networks

for Predictive Maintenance

A Volvo Penta Study

Utvärdering av Neuronnät för Prediktivt Underhåll

Andreas Nordberg

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Theory

2.1

Machine Learning

Dataset splitting

2.2

Feedforward Neural Networks

Activation functions

The Cross-Entropy Loss Function

Gradient Descent

Adam Optimization

2.3

Recurrent Neural Networks

LSTM Networks

2.4

Hyperparameter Tuning

Grid Search

Random Search

2.5

Regularization

2.6

Evaluation Metrics

Accuracy

Precision, Recall, F-score

Confusion Matrix

ROC and PR curve, AUC

The Zero Rule Baseline Model

2.7

Diesel Engine Domain Knowledge

Turbocharger Air Filter

Diesel Engine Aftertreatment System

2.8

Related Work

3

Method

3.1

Data Understanding