Identifying Machine States and Sensor Proper- ties for a Digital Machine Template

(1)

time series cluster analysis

Jakob Viking

Master Thesis — Computer Engineering MA, Final Project Main field of study: Computer Engineering

Credits: 30 hp

Semester, year: Spring, 2021

Supervisor: Luca Beltramelli, luca.beltramelli@miun.se, Johan Deckmar, johan.deckmar@knightec.se Examiner: Mikael Gidlund, mikael.gidlund@miun.se

(2)

Abstract

Digital twins have become a large part of new cyber-physical systems as they allow for the simulation of a physical object in the digital world. In ad-dition to the new approaches of digital twins, machines have become more intelligent, allowing them to produce more data than ever before. Within the area of digital twins, there is a need for a less complex approach than a fully optimised digital twin. This approach is more like a digital shadow of the physical object. Therefore, the focus of this thesis is to study machine states and statistical distributions for all sensors in a machine. Where as majority of studies in the literature focuses on generating data from a digi-tal twin, this study focuses on what characteristics a digidigi-tal twin have. The solution is by defining a term named digital machine template that contains the states and statistical properties of each sensor in a given machine. The primary approach is to create a proof of work application that uses tradi-tional data mining technologies and clustering to analyze how many states there are in a machine and how the sensor data is structured. It all results in a digital machine template with all of the information mentioned above. The results contain all the states a machine might have and the possible statistical distributions of each senor in each state. The digital machine tem-plate opens the possibility of using it as a basis for creating a digital twins. It allows the time of development to be shorter than that of a regular digital twin. More research still needs to be done as the less complex approach may lead to missing information or information not being interpreted correctly. It still shows promises as a less complex way of looking at digital twins since it may become necessary due to digital twins becoming even more complex by the day.

Keywords: Digital twin, Digital shadow, Unsupervised learning, Cluster

(3)

Acknowledgements

First, I want to thank my supervisor at the university, Luca Beltramelli, for the help and support during the work and also for giving me tips on how to write.

(4)

List of Figures vi

List of Tables vii

Terminology viii 1 Introduction 1 1.1 Background . . . 1 1.2 Problem motivation . . . 2 1.3 Overall aim . . . 2 1.4 Research questions . . . 3 1.5 Scope . . . 3

1.6 Concrete and verifiable goals . . . 3

1.7 Outline . . . 4

2 Theory 5 2.1 The concepts of digital twins . . . 5

2.1.1 Digital model . . . 6

2.1.2 Digital shadow . . . 6

2.1.3 Digital twin . . . 6

2.2 Finite state machines . . . 6

2.3 Clustering. . . 7 2.3.1 Hierarchical clustering . . . 7 2.3.2 Centroid-based clustering . . . 8 2.3.3 Distribution-based clustering. . . 8 2.3.4 Density-based clustering . . . 8 2.3.5 Grid-based clustering . . . 9 2.3.6 HDBSCAN . . . 9 2.3.7 PCA . . . 9

2.4 Internal cluster evaluation . . . 10

2.4.1 Silhouette coefficient . . . 10

2.4.2 Davies–Bouldin index. . . 10

2.4.3 Variance ratio criterion . . . 11

2.5 Approaches to clustering multivariate time series data . . . 12

2.6 Chi-squared goodness of fit test . . . 13

2.7 Related works . . . 13

(5)

2.7.2 Data Driven Modelling of Nuclear Power Plant

Perfor-mance Data as Finite State Machines . . . 14

2.7.3 A review on time series data mining . . . 14

2.7.4 Digital twin as a service (DTaaS) in Industry 4.0: An Architecture Reference Model . . . 15

2.7.5 Automatic Log Analysis using Machine Learning . . . . 15

2.7.6 Analysis of Clustering Algorithms in Machine Learning for Healthcare Data . . . 16

2.7.7 Pattern Recognition in Multivariate Time Series: To-wards an Automated Event Detection Method for Smart Manufacturing Systems . . . 16

3 Method 17 3.1 Workflow . . . 17

3.1.1 Study of research area . . . 17

3.1.2 Software design . . . 18

3.1.3 Testing . . . 18

3.1.4 Evaluation . . . 18

3.1.5 Proposed solution . . . 18

3.2 Development of digital template generator . . . 19

3.2.1 Recording machine data . . . 19

3.2.2 Choice of algorithm in implementation . . . 20

3.2.3 Recognize the states in the machine . . . 20

3.2.4 Evaluation of cluster result . . . 20

3.2.5 Analyze statistical properties of sensors . . . 21

3.2.6 Digital machine template . . . 21

4 Construction 22 4.1 Overview . . . 22

4.1.1 Toolchain . . . 23

4.2 Sensor data recording module . . . 23

4.3 Sensor data clustering module. . . 24

4.4 Sensor analysing module. . . 24

4.4.1 Formatting the data . . . 25

5 Result 27 5.1 Digital machine template generator . . . 27

5.2 State clusters . . . 28

5.3 Sensor analysis . . . 31

5.4 Digital machine template . . . 32

6 Discussion 34 6.1 Framework analysis . . . 34

6.2 Development of application. . . 34

(6)

6.4 Applications of a DMT . . . 36 6.5 Ethical and social aspects . . . 36 6.6 Future work . . . 37 7 Conclusions 38 7.1 Research questions . . . 38 7.2 Validity . . . 39 7.3 Contribution . . . 39 References 41

A Appendix A: Detailed output file generated from application 1

B Appendix B: Results from the sensor analysis of gas sensor

array data set 5

C Appendix C: Results from the sensor analysis of IoT telemetry

data set 6

D Appendix D: Results from the sensor analysis of C8

produc-tion data set 7

E Appendix E: Results from the sensor analysis of C9

(7)

(b) feature-based, (c) model-based [19] . . . 12

4 Flowchart displaying the workflow . . . 17

5 Three-dimensional digital twin model [5] . . . 19

6 Five-dimensional digital twin model [29] . . . 19

7 Detailed digital machine template program flow . . . 22

8 Flow chart over the DMT generation software . . . 27

9 Clusters in each data set . . . 28

(8)

dataset . . . 32

4 Results of sensor distribution analysis from sensors in gas

sensor array data set . . . 5

5 Results of sensor distribution analysis from sensors in IoT

telemetry data set . . . 6 6 Results of sensor distribution analysis from sensors in C8

pro-duction data set . . . 7 7 Results of sensor distribution analysis from sensors in C9

(9)

Terminology

CRISP-DM Cross-Industry Standard Process for Data Mining

DBSCAN Density-Based Spatial Clustering of Applications with Noise

DMT Digital Machine Template

FSM Finite-State Machine

GDPR General Data Protection Regulation

HCTSA Highly Comparative Time-Series Analysis

HDBSCAN Hierarchical Density-Based Spatial Clustering of Applications with Noise

IIoT Industrial Internet of Things

JSON JavaScript Object Notation

OPC-UA Open Platform Communications-Unified Architecture

OPTICS Ordering Points To Identify the Clustering Structure

(10)

1 Introduction

This thesis covers the area of digital twins. A digital twin is a virtual repre-sentation that serves as the real-time digital counterpart of a physical object or process. Furthermore, digital twins allow the testing of “what-if” sce-narios against business objectives [1]. Digital twins have opened up possi-bilities of simulating complex processes from aerospace engineering [2] to healthcare [3]. In addition, the use of digital twins enables new approaches like predictive maintenance and improved production line simulation. In this section, a small introduction is presented and the concrete goals which this thesis aims to solve.

1.1 Background

In recent years the industries have become smarter and more intelligent [4]. With the machines connected and talking to each other and sending infor-mation all over the network where the machine is connected to the indus-trial internet of things (IIoT). Modern communication technology has al-lowed machines to become smart because the new generation of machines have the ability to communicate and send information about their current production. This allows for an enormous data lake with data about all of the machines in a factory. This data lake contains information about each sensor value in each machine and also what the machine’s status is. This enables the possibility to use this data in order to create a digital copy of the machine.

As stated previously, a digital twin is a virtual representation that serves as the real-time digital counterpart of a physical object. The first white pa-per published on the topic was in 2014 by Michael Grieves [5]. He talks about how the digital twin can be used to model and simulate a machine or a process. This can even extend to the digital twin becoming a complete virtual copy of the physical machine or device and that runs in parallel to the physical machine. The idea of having a complete model which runs si-multaneously to the real machine opens up new possibilities to analyze the machine or device. According to Zheng, Yang and Cheng [6], the use of dig-ital twins has recently accelerated and the use of digdig-ital twins is mainly to monitor a complete system of machines and from that create a model which is used to simulate and possibly predict how the system behaves.

(11)

that the connection between virtual and physical devices and the second is the processing of data from either the physical or virtual device. The digital twin frameworks [8] used currently is widely different and creates room for adaptation and customization depending on what purpose the digital twin has.

Finite state machines have also affected the digital twins which are currently being developed [9]. This combination of using state machines greatly im-proves the complexity of the digital twin since the digital twin gains the capacity to mimic real-life machine states [10].

This leads to the combination of both digital twins and finite state machines to create a lightweight version of a digital twin to solve trivial problems in less time. The proposed solution is a template that contains enough infor-mation to characterize the sensor data. The proposed solution is a digital machine template that acts as the lightweight digital twin. Its purpose is not to replace the digital twins since they still perform better than a simpler system by their high simulation accuracy. As a digital twin is already fairly explored it allows for this thesis to explore the possibility of a less complex digital twin while keeping all of the wanted characteristics of a digital twin.

1.2 Problem motivation

The ever-increasing development of digital twins pushes software opers into creating digital twins for their physical machines. This devel-opment creates a need where easy and straightforward descriptions of ma-chines are preferable. The benefits of having machine descriptions allow the software developers to create fully functioning data generation software that can mimic a physical machine. This is preferable to a digital twin since a digital twin requires enormous amounts of data to accurately simulate a machine or production line. Therefore, there is a need to understand how a machine outputs data and how each sensor behaves during the machine’s run time. The study of sensor data acts as the main subject of what will be covered and introduced in this thesis.

1.3 Overall aim

(12)

1.4 Research questions

The main research questions in this thesis are as follows:

• Q1: Can cluster analysis be used to find the characteristics of an in-dustrial machine?

• Q2: How can digital twins and finite state machines be used to create an understanding of how a machine works?

These research questions aim to investigate how a digital twin can be re-duced in complexity and size while still retaining the key characteristics of what a digital twin is.

1.5 Scope

Within the company, there is a need for a way to analyze and replicate data from machines in a simpler way than a digital twin does. Earlier the com-pany have developed a platform which is used to generate machine data which is supposed to look like real data from a real machine. This is mainly for the companies research and development of anomaly detection soft-ware and predictive maintenance. With this in mind, the main interest is in analysing the sensor data in the machine. The data which the implemen-tation is supposed to analyse is only numerical time series data, this is to simplify the preprocessing of the data and to remove the difficult task of parsing text data. The digital machine template will not be able to generate data by it self as the task of generating data is already done by a previous work done by the company. This also means that no actual digital twin will be created as they should have the ability to generate data by it self. In ad-dition, the focus of recognizing transitions between states in the machine is also not required since that is to close to the purpose of the software previ-ously created by the company.

1.6 Concrete and verifiable goals

In the following subsection the concrete and verifiable goals are presented: 1. Define what characteristics a DMT should have.

(a) It should contain characteristics about the machine. i. The number of states in the machine?

ii. What type of sensors the machine use?

(13)

iv. For each state, what statistical properties the values each sen-sor have?

(b) Explain the application and use cases of a DMT. 2. Create a program that generates a DMT.

(a) Read the data from a machine from the recorded data. (b) Analyze the data to extract features from the recorded data.

(c) Display these features as a DMT.

3. Verify and evaluate the integrity and reliability of the DMT.

1.7 Outline

In chapter 2, the theory is presented. The purpose is to cover the relevant in-formation needed to understand the implementation, what the result shows and also to follow the discussion in the last chapters.

In chapter 3, the methodology is presented. Here the research method and planning are shown. This section covers the way to achieve the concrete and verifiable goals mentioned in section 1.6

In chapter 4, a more detailed description of the construction of the applica-tion developed. In addiapplica-tion the framework and concept behind the imple-mentation is presented as this is the basis for the result.

In chapter 5, the results are presented. This is both about the core definition of what a digital machine template is and how the implementation which generated the digital machine template resulted.

In chapter 6, a deeper discussion about what possibilities there exists for a digital machine template and further improvements and future use cases is presented.

(14)

2 Theory

In this section the relevant theory is presented to understand the result and analysis. This includes explanations about digital twins, finite state ma-chines, clustering analysis, evaluation of clustering results and methods to analyse data by chi-squared goodness of fit test. In addition a number of related works are presented in the end of this section.

2.1 The concepts of digital twins

A digital twin is a representation of a physical device but in virtual form. The role of a digital twin is to take data from the physical object and analyse it and then return suggestions on how the physical object should improve This means that the purposes for a digital twin can be many, from aerospace [11] to healthcare [12], [13]. The digital twin is often used to simulate a pro-cess or machine in the digital world. There exists three levels of implemen-tations of these and they are: a digital model, a digital shadow and a digital twin 1. These three levels are characterized by how the data flow is handled between the physical world and the digital world. These different levels of integration is key to understanding what a digital twin is and what the two levels below are capable of producing. This is also shown in the figure 1 where each of the implementation’s is shown [14].

Figure 1: Digital model, shadow and twin

(15)

2.1.1 Digital model

A digital model is a digital copy that does not update itself with new data or supply the existing machine with new data. It only exists to model an already constructed machine or device. This means that there are no auto-matic data exchanges between the digital model and the physical object. In this sense the changes made to a digital model does not affect the physical object at all.

2.1.2 Digital shadow

This is the next level of digital representation, the digital shadow takes data automatically from the physical machine but does not return the data. This creates a data exchange which updates the digital shadow based on what the physical object outputs. A digital shadow is therefore a advanced model which has the information of what the machine is doing but since it requires manual data exchanges back to the physical machine no automation can be achieved.

2.1.3 Digital twin

The third iteration is the most inclusive, it takes data from the physical ma-chine and runs the simulation and from the output data it generates possible improvements which the physical device takes into account the next time it runs into similar data. This is done automatically and since it’s done auto-matically the physical object and the digital object exists in a closed loop. This loop feeds itself and uses this information to correct any wrong infor-mation gathered from the physical object. The definition of a Digital twin is:

"A Digital Twin is a set of virtual information that fully describes a potential or actual physical production from the micro atomic level to the macro geometrical level." Zhang, 2019 [7]

This means that the digital twin is far more advanced than any of the previ-ous two and for the case of usability also quite advanced. Each part of the physical object is described, from a single sensor to the whole machine as per the previously stated definition.

2.2 Finite state machines

(16)

abstract construction which contains each possible state of a machine and the probability of transitioning to the other connected states. As shown in figure 2 which is a example of a turnstile written as a state diagram.

Figure 2: Turnstile as a state machine

As mentioned above finite sate machines contains states which is a descrip-tion of the status of a system. A FSM can transidescrip-tion between different states. A transition is a set of actions that is executed when a condition is fulfilled or when a event is received.

2.3 Clustering

In data mining there exists two kinds of learning algorithms, supervised and unsupervised. Supervised learning is often referred as classification or regression as the purpose of the algorithms is to guess the label or class of the inputted data. On the opposite unsupervised learning is when the algorithms does not have any classes or labels to refer to. This includes clustering, anomaly detection or neural networks. For this, the main focus is on clustering and how it is used.

There exists different ways of performing clustering, these are hierarchical clustering, centroid-based clustering, distribution-based clustering, density-based clustering and grid-density-based clustering [17]. Since there is a lot of differ-ent clustering algorithms only a brief description is provided.

2.3.1 Hierarchical clustering

(17)

related to each other than objects further away. This allows for the algo-rithms to make the clusters from object that are in dense regions in the data set. General agglomerate hierarchical algorithms have a complexity of O(n3) and the decisive versions have a complexity of O(2n−1). These are not by themselves good enough to produce results in a reasonable time. They were instead the basis for more advanced clustering algorithms.

2.3.2 Centroid-based clustering

Centroid-based clustering is where the selected parameter is how many cen-troids are supposed to be centers of the clustering algorithms. This means the formal definition is as an optimization problem: find the k cluster cen-ters and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. This is then the core concept where the distances is the main comparison which decides which data point is in any given cluster.

2.3.3 Distribution-based clustering

Distribution-based clustering, is most closely related to statistics based on distribution models. Clusters can easily be defined as objects belonging most likely to the same distribution. This means that the structure of the data points is compared to actual statistical information. While the theo-retical foundation of these methods is excellent, they suffer from one key problem known as overfitting, unless constraints are put on the model com-plexity. A more complex model will usually be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult.

2.3.4 Density-based clustering

(18)

2.3.5 Grid-based clustering

Grid-based clustering uses a grid where it compares the data points in each grid section which is called cell. Grid-based clustering is faster and has low computational complexity and it excels at handling multi dimensional data sets. These algorithms partition the data space into a finite number of cells to form a grid structure and then form clusters from the cells in the grid structure.

2.3.6 HDBSCAN

One notable algorithm is HDBSCAN or Hierarchical Density-based Spatial Clustering of Applications with Noise [18] is a algorithm which combines the strengths of both hierarchical clustering and density clustering. This algorithm works in similar ways of the of the DBSCAN algorithm while also utilizing the hierarchical structure of a hierarchical clustering algorithm. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters. Firstly the algorithms transforms the space which simplifies the calculation for later part of the algorithm. Secondly the algorithm constructs a minimum spanning tree which is done by using Prim’s algorithm, this is then the basis to select the denser regions as the cluster centers. Thirdly the cluster hierarchy is constructed from the minimum spanning tree, which is then used to condense the cluster hierarchy. The condensed cluster tree is then used to select which clusters are extracted. This is done by selecting the leaf nodes in the spanning tree unless there is a denser area further up in the spanning tree.

2.3.7 PCA

(19)

2.4 Internal cluster evaluation

To be able to evaluate the clustering results a couple of metrics is needed. This part goes into detail about three approaches to evaluate the clustering result. These are then used to pick the right amount of clusters in each result.

2.4.1 Silhouette coefficient

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette measures how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess the results. This measure has a range of[−1, 1]. Silhouette coefficients as these values are referred to as near+1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. These values are therefore classified as those nearing +1 is the best value while those who are near−1 are the worst value and is often to be considered as wrongly classified. The full definition for the metric is seen in equations 1, 2, 3 and 4.

For data point i in cluster Ci where d is the distance between data point i

and j, let a(i) = 1 |Ci| −1_j∈C

∑

i,i6=j d(i, j) (1) b(i) =min k6=i 1 |Ck| _j∈C

∑

k d(i, j) (2)

It can therefore defined as:

s(i) = (b(i) −a(i))

max{a(i), b(i)}, i f|Ci| >1 (3) and

s(i) =0, i f|Ci| =1 (4)

In equation 3 and 4 the result is what is the final score for the clustering result provided by the algorithm.

2.4.2 Davies–Bouldin index

(20)

and less dispersed will result in a better score. The minimum score is zero, with lower values indicating better clustering.

The index is defined as the average similarity between each cluster Ci for

i=1, ..., k and its most similar one Cj. In this context the similarity is defined

as Rij. The definition of the index is as following:

Let Si be the average distance between each point of cluster i and the

cen-troid of that cluster, also called the cluster diameter.

Let Mij be the distance between cluster centroids i and j. Which leads to the

measure Rij being defined as,

Rij =

Si+Sj

Mij

(5) which is used to define Dias,

Di ≡max

j6=i Rij (6)

which is used to define the DBI as,

DBI ≡ 1 k k

∑

i=1 Di (7)

This is dependent on the data as well as the algorithm. Di automatically

chooses the worst case scenario and this value is equal to Rij for the most

similar cluster Ci. Therefore the minimal value calculated by the Davies–Bouldin

index is the way to choose the best clustering result.

2.4.3 Variance ratio criterion

Variance ratio criterion or as it is also called Calinski-Harabasz score is used to calculate the within cluster variance. It is done by calculating the SSB,

be-tween cluster sum-of-squares and the SSW, within cluster sum-of-squares

and then dividing it by the ratio of the number of clusters against the num-ber of samples. For n numnum-ber of samples and k numnum-ber of clusters can be defined as following equation 8.

VRCk = SSB k−1 SSW n−k (8)

(21)

2.5 Approaches to clustering multivariate time

series data

Since clustering is a prominent solution to find the states in the machine there also exists different ways of finding these. In the article written by T. Warren Liao [19], he mentions three approaches to find the clusters. These are, raw-data-based, feature-based and model-based. Each of the have strengths and weaknesses which means that the they work in similar ways but better or worse depending of the data and requirements for the result.

Figure 3: Three time series clustering approaches: (a) raw-data-based, (b) feature-based, (c) model-based [19]

As shown in the figure 3 the first is the raw data approach, when the sen-sor data is inputted into the clustering algorithm and it then outputs the states or patterns which represents the states found in the machine. This approach takes the time series data and adapt the clustering algorithm in such a way so the clustering algorithm has the possibility of clustering the time series data. The change done is to customize the distance calculations in the clustering algorithm. In addition some clustering algorithms might struggle with this approach since the more computational-heavy algorithms needs to preform larger amount of calculations. Algorithms which operate iterative may struggle since the need of preforming an iterative solution to decide the mostly optimal iteration of cluster parameter.

(22)

to a feature vector and by using a number of different distance equations it has the possibility to compute the clusters in the time series data. This requires even more customization of the clustering algorithms.

The third and final approach is to use model based clustering. It is quite sim-ilar to the feature based approach but with the added customization where the features is converted to modeling parameters. This works by first cal-culating the model from the features and later analysing cluster from that model.

2.6 Chi-squared goodness of fit test

The main way to guess the statistical distribution of a unknown stream of data is by performing a chi-squared goodness of fit test [20]. The entire test is two parts. First the sum of differences between observed values and expected frequency of values is calculated. As shown in equation 9.

x2 = n

∑

i=1 (Oi−Ei)2 Ei (9)

Oiis the observed count for bin i and ei is the expected count for bin i.

The expected count of Eiis calculated by the following equation 10:

Ei = (F(Yu) −F(Yl))N (10)

F = the cumulative distribution function for the probability distribution be-ing tested. Yu = the upper limit for class i, Yl = the lower limit for class i,

and N = the sample size. Both equation 9 and 10 is used together to form the basis on what the optimal statistical distribution is.

2.7 Related works

In this section articles are mentioned which have more relevance for this work. Since the work is multi disciplinary two articles with different out-looks is presented.

2.7.1 Comparison of Time Series Clustering Algorithms

for Machine State Detection

(23)

the different algorithms can be used to cluster time series data. They pro-posed to used a feature extraction method named HCTSA and the cluster-ing method the proposed was an algorithm named K-medioids. It was also mentioned that each implementation require their own clustering method and their cannot simply be a pair of feature extraction and clustering gorithm to solve all problems. In it they mention different clustering al-gorithms like agglomerative clustering and DBSCAN which are both con-nected to this thesis.

The similarities of this article and this thesis is the use of a way to customize the clustering capabilities by using similar clustering algorithms which the main clustering algorithm is based of in this thesis.

2.7.2 Data Driven Modelling of Nuclear Power Plant

Per-formance Data as Finite State Machines

In the article “Data Driven Modelling of Nuclear Power Plant Performance Data as Finite State Machines” by Kshirasagar Naik, Mahesh D. Pandey, Anannya Panda, the concept of modeling a machine as a finite state ma-chine is discussed [22]. The article mentioned and proposes a methodology of combining both machine learning principles with the added function of a finite state machine. In addition, finite state machine representation sup-ports identification of normal and abnormal operation of the plant, thereby suggesting that the given approach captures the anomalous behaviour of the plant.

The interesting part which applies both to this work and also to the article mentioned, is the idea of recognizing the states and information about a machine while using both a finite state machine and clustering. The method proposed is also similar due to the authors state that they use both feature reduction tools like PCA. Their result is that the finite state machine created in their case was able to successfully show how the nuclear power plant behaved during all of the time the data was collected. They also conclude that without PCA or similar ways of reducing the dimensionality of the data set the result could not be produced.

In short this article relates to this thesis by using clustering method to iden-tify states in a large data set. They also show the possibility of using simple machine learning principles to solve this.

2.7.3 A review on time series data mining

(24)

high dimensionality and necessary to update continuously. Moreover time series data, which is characterized by its numerical and continuous nature, is always considered as a whole instead of individual numerical field. The primary objective of the article is to serve as an overall picture on the current time series data mining development and identify their potential research direction to further investigation.

This article further developed the understanding of time series clustering and what its challenges is. In addition to the added knowledge about the time series data the article also informed what different criteria the cluster-ing algorithm needs to be successful at clustercluster-ing time series data.

2.7.4 Digital twin as a service (DTaaS) in Industry 4.0: An

Architecture Reference Model

In the article written by Aheleroff, Xu and Zhong [24] the authors writes how the concept of a digital twin can be turned into a software as a service. This presented interesting approaches since the design of a digital twin as a service leads to a lot of changes and a introduction to a new framework. A digital twin reference architecture was developed and applied in an in-dustrial case. The findings indicate that there is a significant relationship between digital twin capabilities as a service and mass individualization. The part where the article was interesting and relevant to this thesis is where the definition of the different integration’s of the framework could be de-signed in such a way which would have improved the feedback from the regular digital twin framework which is referenced in this thesis.

2.7.5 Automatic Log Analysis using Machine Learning

In the article by Weixi Li [25] the need of an automatic log analysis method using machine learning is studied. This tool presented in the master the-sis is the second iteration of the automatic log analythe-sis software. The soft-ware’s purpose was to automatically classify anomalous logs from ordi-nary data logs. The solution to automatically preform the detection was achieved by using clustering. The clustering methods used by the thesis are K-means, DBSCAN and the neural network approach of Self-Organizing Feature Maps (SOFM). The results concluded that a mixture of SOFM and DBSCAN produced the best F-score of 0.950.

(25)

2.7.6 Analysis of Clustering Algorithms in Machine

Learn-ing for Healthcare Data

In the article by Ambigavathi, M. and Sridharan, D. [26] the authors com-pare different clustering algorithms for the purpose of clustering healthcare data. The different clustering algorithms of K-means, K-mediods and is shown to perform differently on smaller and larger datasets. The authors concluded that even with the help of evaluation methods like a silhouette coefficient or the Dunn index it was still very hard to evaluate which algo-rithm is the beast to cluster large amounts of healthcare data.

Since the main idea is to cluster the large data sets the article gave further insight into the idea of evaluating clustering algorithms. This lead to the discovery of internal cluster evaluation and most importantly the use of a silhouette coefficient.

2.7.7 Pattern Recognition in Multivariate Time Series:

To-wards an Automated Event Detection Method for Smart

Manufacturing Systems

In the article written by Vadim Kapp, Marvin Carl May, Gisela Lanza and Thorsten Wuest [27] the authors propose a method of automatically rec-ognizing states and event in smart manufacturing systems. The proposed framework is mainly a pattern recognition software which identifies a pat-tern by utilizing a sliding window method find large discrepancies of the sensor data. This approach led the authors to discover that the result is heavily affected by standard PCA analysis. This resulted in their conclusion being that with the right preprocessing methods and optimal hyper param-eter selection in their clustering approach the anomalous sensor data was identified correctly.

(26)

3 Method

In this section the methodology used to achieve the concrete and verifiable goals in section 1.6 will be presented.

3.1 Workflow

In the following section the workflow will be presented to motivate choices in the following chapters. Figure 4 shows the overall workflow in this study.

Figure 4: Flowchart displaying the workflow

3.1.1 Study of research area

The study of digital twins and time series clustering in the thesis is done by reading relevant articles to the research areas. Most of these articles are gathered from Google Scholar and from online databases provided by the University. These articles consist of digital twin frameworks, different ap-proaches towards developing digital twins and possible use cases of a digi-tal twin.

In addition to these, some articles which contains information about finite state machines is needed to understand how a machine behaves. This nec-essary to understand how each of the states are connected to each other. These two areas are presented in more detail in chapter 2. The results of this research is presented in section 2.7.

(27)

3.1.2 Software design

The literature study gives the required information on how the design of the software should look. The design should handle the required specifications identified by the literature study. The system is not expected to be able to generate results that can be easily compared with existing systems since that would only create a less optimized version of a digital twin. Since if a DMT is capable of producing data it should be representative of the machine, but without the proper training of that a digital twin uses the results would not be usable. Instead, a system more like the digital shadow, as mentioned in section 2.1.2, should be constructed. It would not have all of the features a full digital twin has but it should significantly reduce the time needed to de-velop it. Another aspect is to be able to get the characteristics of a relatively small dataset. More detailed methods of how the software is designed can be read in section 3.2.

3.1.3 Testing

In this part of the workflow the software is tested if it works and has pos-sibility to produce results. The testing is mainly to ensure that each of the modules is working the way it is intended and what clustering algorithm that the finalized solution should have. For example the cluster analysis should try a number of cluster algorithms before arriving to a conclusion of which is most optimal one. The comparison of each algorithm is done by comparing the specifications in the documentation and the type of data the algorithms can cluster. The most important distinction is that the cluster al-gorithm should not specify the number of cluster centroid before clustering.

3.1.4 Evaluation

The evaluation part of the workflow aims to evaluate the internal construc-tion and the results produced fulfills the menconstruc-tioned goals in secconstruc-tion 1.6. If the goals is not fulfilled the process restarts from the software design. The clause of "good enough" is mainly based on the evaluation metrics used which are mentioned in section 2.4 and how result is produced, this means that if only one cluster is identified the state analysis cannot be done and therefore the process is repeated.

3.1.5 Proposed solution

(28)

3.2 Development of digital template generator

Following part of the methodology describes how the development of the implementation is designed. The system is not expected to be able to gen-erate results that can be easily compared with existing systems, but will be built as a proof-of-concept to show what a less complex version of a digital twin is capable of. This is partly based on the three-dimensional model of digital twins. This model is found in multiple articles, but unfortunately since there isn’t a set standard the methodology has taken a couple of mod-els into account. The three dimensional digital twin model as seen in figure 5 and the more detailed version the five dimensional digital twin model in figure 6. These two models are the basis behind the implementation. By combining the theory of both models a core structure of the implementation is created. This is done by dividing the four areas of the implementation to the relevant part of the three model framework. These frameworks also appear in [28].

Figure 5: Three-dimensional digital twin model [5]

Figure 6: Five-dimensional digital twin model [29]

3.2.1 Recording machine data

(29)

the solution used is the opcua-asyncio from GitHub [30]. This is sent to the next module where the data set is analyzed. One issue is that the access to live industrial machines is not available, this leads to the need of import-ing multivariate time series datasets with different lengths. These imported datasets will be used to test the sensor data clusterer and the sensor analysis of the implementation.

3.2.2 Choice of algorithm in implementation

In the implementation the choice of clustering algorithm is important. There exists a lot of different clustering algorithms which operates differently which is discussed in detail in section 2.3.

The chosen algorithm is HDBSCAN as mentioned in section 2.3.6. This is a hybrid which has the strengths of both a hierarchical clustering algorithm and a density based clustering algorithm. Algorithms like K-means cannot be used due to them having a requirement of stating the number of clus-ters that should be found which is not preferable since the point is to find the amount of clusters based on density. Also the other algorithms which is density-based either can’t handle the large data sets used but also their complex structure increases the computational time exponentially. If only a hierarchical algorithms was used it would encounter similar problems with the large data set due to them having a computational time ofO(n3)and a memory requirement ofΩ(n2).

3.2.3 Recognize the states in the machine

The next part of digital template generator is the state recognizer. This part should read the sensor data from the data set created by the data recorder. Then it should find the states by clustering the time series data [31]. Clus-tering is preferred is because of the nature of the sensor data in the data sets. The sensor datasets is not labeled which means that classifying meth-ods cannot be used. In addition, the data sets is quite large which means that the clustering algorithms chosen need to handle data efficiently and the required amount of preprocessing is needed. This module should there-fore handle all of the data preprocessing in addition to the clustering as the result should be a labeled data set.

3.2.4 Evaluation of cluster result

(30)

evaluation methods, see section 2.4. This part is important since the ba-sis of the state recognition all lies in these three metrics. The evaluation is also part of the traditional data mining process and as the CRISP-DM model state that one of the most important part is to evaluate the results.

3.2.5 Analyze statistical properties of sensors

After the states are recognized and as part of the information processing layer of this framework, each sensor in each state needs to be analyzed. This is mainly due to the need of a more advanced understanding of how the data in each sensor behaves. This is done by implementing a chi-squared goodness of fit test and using the chi-square value from each test to choose the most fitting statistical distribution for the chosen sensor. This is required to do iteratively since each sensor appears in each state. Then after the mod-ule has produced all statistical distributions for each sensor in each state it does the same with distribution parameters and extreme values.

3.2.6 Digital machine template

The last part of the implementation is, as shown in figure 5, the virtual part of the framework. This part is mostly formatting and visualization but the main purpose of the syntax in the file is to be compatible with the already made data generation software produced by the organization. Since this is the final result of digital machine template generator, a simple JSON-file is needed to summarize the two analyses done in the previous part.

(31)

4 Construction

This chapter will present the details on how the implementation works and what each part of the application is responsible for.

4.1 Overview

Figure 7: Detailed digital machine template program flow

(32)

data clusterer and the sensor data analyzer is both part of the information processing layer and in it the data is processed so it can be used later.

4.1.1 Toolchain

The main technologies used for this implementation mostly data mining methods and python programming. This is meant to be proof of work for the company and therefore no full application is needed. This is therefore separated into four modules which each has a singular purpose.

The libraries in python used are as following: • Pandas version 1.2.4 [32]

• Scikit-learn version 0.24.2 [33] • Numpy version 1.20.0 [34]

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Pandas serves the purpose of inputting the data into the system and manip-ulating the data. is a Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, re-gression and clustering algorithms and is design to operate alongside other libraries to produce results for numerical and scientific libraries. Scikit-learn has all of the machine learning algorithms, data processing and evaluation method all integrated into one library for ease of use. Numpy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. Numpy is also a large part of the system since the numerical analysis is only capable with the help of methods from the numpy library.

4.2 Sensor data recording module

(33)

4.3 Sensor data clustering module

In this part the sensor data is clustered. This is based on the CRISP-DM [35] model where the data is processed to create the optimal result where the result is later used as the basis of how many states the machine has. The module begins with loading the data from the data set generated by the sensor data recording module. This data is a multivariate time series where each sensor is represented as a column. The module takes this data set and begins pre-processing it, this includes normalizing the values, checking for highly correlated features and applying PCA dimension reduction to reduce the dimensions for easier clustering. Due to hardware limitations there was a limit on how big the data set could be. Therefore if the data set had more than 100 000 rows of data points it reduced it in size while keeping the char-acteristics of the data set as a whole. The method used for this is systematic sampling where it simply takes every tenth row of the data set, effectively reducing the size of a data set by 90 %. If the dataset exceeds 1 million data entries the sampling method switches to only random sampling selecting only a maximum number of 50 000 entries. This is done with the argument that the data set contains enough data to create a sufficiently diverse sample. After that the data is more processed before clustering. This time the data is checked for highly correlated features. If the correlation between the fea-tures in large enough the column is removed to reduce the number of di-mensions that will be clustered. Then the data is scaled down to a reason-able range so the data can be used in a dimension fusing method called principal component analysis or PCA for short. The purpose of PCA is to reduce the dimensions and combine the columns to improve the result even more. When the PCA is done the module moves to the second part of the CRISP-DM model, the module now begins the clustering.

The chosen clustering algorithm is HDBSCAN, the algorithm calculates based on the density of the time series data how many clusters or states there are in that given data set. In addition to calculating the clusters, it also calculates how good the clustering result is, this then the last part of the clustering module. To calculate how good result the cluster has produced three met-rics is produced: Silhouette coefficient, Davies–Bouldin index and Variance Ratio Criterion. These three all has the purpose to gauge if the result is good enough.

4.4 Sensor analysing module

(34)

Scikit-learn which are chosen to compare: • argus, • beta, • crystalball, • expon, • f, • gamma, • gausshyper, • kappa3, • kappa4, • lognorm, • norm, • pearson3, • triang, • uniform, • weibull_max, • weibull_min

In the module the labeled data set is first sorted so only the labels with cluster labels and not the noise is included in the goodness of fit test. A more detailed explanation of how the goodness of fit works is in section 2.6. The main flow as shown in figure 8 and the part after the data has been cluster is where the sensor analysis is done. In each state there is a number of sensors where their data has been labeled and the labeled data is then analysed via a chi-squared goodness of fit test with dynamic bin lengths. The dynamic bin lengths is done by the implementation of a Freedman-Diaconis algorithm [36] which purpose is to calculate how big the bin should be before the chi square goodness of fit test is done. This is then repeated a number of times equal to the number of sensors in each state.

4.4.1 Formatting the data

(35)

(36)

5 Result

In this section the final results are presented. The section covers the result-ing implementation, a description of a DMT and results from the program while running different data sets.

5.1 Digital machine template generator

This flow chart represents the final construction of the DMT generator.

Figure 8: Flow chart over the DMT generation software

The first layer which is the input of the DMT generator requires a multivari-ate time series data set. For input a selection of publicly available data sets where used to simulate a machine recording. The data sets chosen for this are, a Gas Sensor array data set [37], an IoT Telemetry Data [38], c9 produc-tion data [39], c8 producproduc-tion data [40].

The second layer is the processing layer and in this layer is where all of the data preprocessing, clustering and analyzing are performed. This is ex-plained in detailed in section 4.3. In addition the results from this module is also presented in section 5.2. The results from this module is the cluster-ing of the states and the deeper sensor analysis of each sensor in each state. These two are the core parts of what makes up the DMT.

(37)

sensor later when using the DMT for data generation purposes. In addition to only using the DMT for data generation purposes the DMT can be used to help anomaly detection software and predictive maintenance software. This mainly due to the ability to recreate specific scenarios in the machines life time. This might be when a new sensor is installed or when an old one needs replacing. These are edge cases where the help of a DMT might speed up the process of finding when the threshold is met for either sensor anomaly detection or predictive maintenance of the machine.

5.2 State clusters

(a) Gas sensor array data set (b) IoT Telemetry data set

(c) C8 Production data set (d) C9 Production data set

Figure 9: Clusters in each data set

(38)

the DMT generator. The different colors indicate different clusters which are the identified states of a given data set or machine. Note that each point in the scatter plots is not a singular measurement in the data set. These scatter plots is both scaled and dimensional reduced to be able to produce a simple visualisation capable of being shown in three dimensions. In tables 1 and 2 all the results from the clustering part is shown. As mentioned in section 2.4 the target numbers for the silhouette coefficient is 1.0, as it ranges from -1.0 to 1.0. The Davies-Bouldin index uses a real number as its metric and a lower value will mean that the clustering is better. Finally the variance ratio criterion also utilizes a real number where a higher value will mean that the clustering is better.

In figure 9a there is the most clusters and also the most data points, this cre-ates a more dense clustering environment and this means that the results is more sensitive to smaller changes in the data. This figure also has the great-est amount of noise towards the amount of labeled points. Both the high noise ratio and dense clusters therefore results in quite mediocre metrics by only scoring a 0.084 as the silhouette coefficient, 2.393 as the Davies-Bouldin index and the variance ratio criterion is only 2658.688, this data is then used to motivate the amount of clusters found.

In figure 9b the clusters are more pronounced than previous figure. In addi-tion the clusters have a more unusual shape than the previous figure. This clustering also has the lowest noise to data point relation with only 0.126 %. This gives the final cluster evaluation to be a silhouette coefficient of 0.379, a 1.736 as Davies-Bouldin index and the variance ratio criterion is 13442.468. In figures 9c and 9d the clusters found is shown. Due to both of them being data sets which contain failure data they are quite similar. Both of them have three clusters and similar noise ratio to data points. In figure 9c its clearly shown that one large cluster has on region which is quite separated from the rest of that same cluster label. Regardless of that both of the show quite good results in the evaluation of their clustering results. The results are also quite similar, figure 9c has a noise ratio of 4.557 % and figure 9d has a noise ratio of 4.059 %. Since the results is shown in table 2 the one notable mention is that due to the larger density of the clusters figure 9d has a better variance ratio criterion even though the data sets are quite similar.

Table 1: Results showing clusters and noise points

Data set Clusters Noise points Noise ratio

Gas Sensor Array 5 4457 15.072 %

IoT Telemetry 3 51 0.126 %

C8 Production 3 720 4.557 %

(39)

Table 2: Results showing internal cluster evaluation

Data set Silhouette Davies-Bouldin Variance Ratio Criterion

Gas Sensor Array 0.084 2.393 2658.688

IoT Telemetry 0.379 1.736 13442.468

C8 Production 0.508 0.570 9903.594

C9 Production 0.440 0.884 16612.350

These cluster are then the basis for the second result, which is to plot each of the clusters against the time in the data set. This produces a plot which shows the transitions between states. This is shown in figure 10 where -1 indicates noise and the rest of the cluster labels are machine states.

(a) Gas sensor array data set (b) IoT telemery data set

(c) C8 production data set (d) C9 production data set

Figure 10: Machine states from each data set

(40)

figure shows there is a state transition, the noise is also notable but since it is the largest data set more noise is expected.

Also in figure 10b it shows overlapping clusters, this as the previous fig-ure shows the working states alternate between different states during the operational time the data was recorded. The noise is also notable smaller because of the more efficient clustering.

A more detailed description of figure 10c shows that at first the machine is in state zero, the result cannot describe what the machine does during this sate the only thing it can show is by a certain point the machine begins a transition period to state two. This is shown by the gradually shift in states between cluster label zero to one to two. This could indicate that state one is the transition period or transition state where the machine transitions to another state. This may be due to an increased or decreased load on a couple of sensors. A similar result is also shown in figure 10d since the two data sets has similar structures.

In figures 10a, 10b, 10c and 10d there exists a clear picture where it shows the different states and transitions in the graph. This is therefore used as both a visualisation how the machine behaves during its run time but also shows that clusters have a connection to each other. By following the time on the x-axis the states shift from one state to another based on the sensor data which was previously clustered to find these states. This result is then utilized to separate the original sensor values into the states for the next module where the distribution analysis is done.

5.3 Sensor analysis

The results shows the sensor distributions for each sensor in each state. As mentioned in section 4.4 the purpose of the sensor analyzer is to estimate the most likely statistical distribution of each sensor in the time series data in each state by using a chi-squared goodness of fit test. Since the some tables are to big to show, only the results from the IoT telemetry dataset is showed in table 3. The main focus of these results are the chi-square values since they decide which statistical distribution is the most probable of any given test. The p-value is if there are two similar distributions, then the better p-value decides which statistical distribution is chosen.

Table 3 represent the distribution information about the analyzed data. This data is only one of multiple series of information which the application pro-duces. As this is the second part of the DMT it is important to pick the most correct distribution which represents that particular time series data.

(41)

Table 3: Results of sensor distribution analysis from the IoT telemetry dataset

State 1 State 2 State 3

Sensor Distr. [chi2/p] Distr. [chi2/p] Distr. [chi2/p] CO lognorm 4753, 3.3e-54 f 786, 5.1e-23 weibull_min 5698, 1.1e-68 Humidity crystalball 2656, 9.2e-23 lognorm 217, 3.1e-26 pearson3 97571, 0 LPG f 4039, 5.7e-52 lognorm 976, 1.5e-23 pearson3 6046, 1.8e-108 Smoke lognorm 4337, 1.6e-54 lognorm 978, 1.2e-23 weibull_min 5631, 1.2e-73 Temp f 3725, 6.2e-93 lognorm 1079, 5.6e-54 lognorm 123340, 0

Note the difference between the states and the chosen statistical distribution found. In state one the sensor has a lognorm distribution and changes to a f distribution in state two and ends up as a weibull_min distribution in state three.

This process is then repeated for each other sensor in that data set and later on the whole process is done for each other data set. Due to some data sets having between 15-20 sensors only the smallest data set is included in the thesis for visual purposes. Appendix B, C, D and E shows the full tables containing the result from the sensor statistical analysis.

5.4 Digital machine template

The final result gathered from this thesis is the final DMT and its overall structure is shown in code listing 1. They are supposed to represent all the vital information from the analyzed machine. The code listing contains the states, sensor name from data set, maximal and minimal value of sen-sor data, the analyzed statistical distribution and required parameters for the statistical distribution to reproduce similar results as the machine. This structure describes what is included in the DMT, the main part is the state field is how many states is contained in the DMT. In each state field there are a number of sensor fields which has all the information about each sensor in each state. For an example the DMT of the IoT telemetry data set is shown in A.

Listing 1: Structure example of DMT

(42)

10 } , 11 " s e n s o r _ N ": { 12 " name ": " s e n s o r _ n a m e " , 13 " m a x _ v a l u e ": 1, 14 " m i n _ v a l u e ": 0, 15 " s t a t _ d i s t ": " d i s t r i b u t i o n _ n a m e " , 16 " d i s t _ p a r a m e t e r s ": [] 17 } 18 } , 19 " s t a t e _ N ":{ 20 " s e n s o r 1 ": { 21 " name ": " s e n s o r _ n a m e " , 22 " m a x _ v a l u e ": 1, 23 " m i n _ v a l u e ": 0, 24 " s t a t _ d i s t ": " d i s t r i b u t i o n _ n a m e " , 25 " d i s t _ p a r a m e t e r s ": [] 26 } , 27 " s e n s o r _ N ": { 28 " name ": " s e n s o r _ n a m e " , 29 " m a x _ v a l u e ": 1, 30 " m i n _ v a l u e ": 0, 31 " s t a t _ d i s t ": " d i s t r i b u t i o n _ n a m e " , 32 " d i s t _ p a r a m e t e r s ": [] 33 } 34 } 35 } 36 }

The data set which produced the digital machine template is the data set IoT Telemetry Data [38]. This data set had five sensors and all have their distinct representation in this file. It clearly shows what each sensor produced. In the rest of the file is a repetition of states found as each sensor is analyzed in each state separately to understand how that particular sensor behaves isolated in that particular state.

(43)

6 Discussion

In this section a number of points surrounding the work is discussed and the future works is featured at the end of this section.

6.1 Framework analysis

The constructed framework is the core of the thesis’s idea. By utilizing two models from previous research a more streamlined and customized frame-work was created to identify the required information for the DMT. The framework is based on the models introduced in section 3. This with the added knowledge with the CRISP-DM model the framework was designed with simplicity in mind. By using sensor data as the physical part and a clear JSON file as the digital part the focus is to process and handle the data. This leads to the layer between the physical part and the digital part having a lot of information. The information in this layer is also the basis of how the DMT is created. The framework therefore has a internal processing module which takes the pure data and extracts the required information from it. This creates a framework which is quite similar to the three dimensional model proposed by Grieves [5] and also similar to the five dimensional model mentioned in a survey about digital twins [29]. These two models could not fit the idea of this work instead the three dimensional model was improved upon and created a processing layer between the physical layer and the digital layer. This allows for the information to be more processed before it is used inside the digital device.

6.2 Development of application

During the development of the application some things were clear from the start, the idea of creating a production ready application was not in the scope. Since the application is only intended for programmers to use the separation of each module was done because of the easy modification and customization. This also lead to some shifts in focus while developing the application.

(44)

The search for both the clustering algorithm and the correct parameters for the state analysis was a interesting part as the definition of a DMT states that it should recognize the number of states in a machine by clustering. For this a number of algorithms could not be used as they were either to slow or to simple as the multivariate time series data in the data sets is both irregular and has different densities. Since the definition of the DMT easily narrowed down the number of algorithms that could be used the problem was instead to find algorithms which allowed the algorithm to not state the number of clusters beforehand.

The use of a goodness to fit chi square test for the sensor analysis was iden-tified during the research phase but to compare seeming random data se-quences required a lot of computational time. For this simple implementa-tion is is not a problem but if the size of the data recorded or gathered is to large the application might become slow and not yielding results. Imagine a data set with 25 sensors and the clustering algorithm identifies six states in the data, the number of chi square tests needed to compute the statisti-cal distribution increase drastistatisti-cally for the number of distributions that the application aims to test.

In addition the application needed to produce metrics which were used for evaluation and this is not something that should be included in a final ap-plication release.

6.3 Evaluation of results

In the result there are two different evaluation methods and both are used to find the best possible representation of either the number of states or the best statistical distribution. These two are the main focus of evaluation of the results of the implementation.

(45)

the chi square goodness to fit test operates creates a good suggestion of what statistical distribution the sensor’s data might have.

These results is therefore quite good since the implementation is designed around the CRISP-DM model and by that creates results which can be moti-vated by concrete sources and choices made during the research and devel-opment stages. It clearly shows that the method used to create the lightweight application is possible. The results also shows that the understanding of how the machine behaves is almost on par with a digital twin with the ex-ception of more advanced

6.4 Applications of a DMT

The two most prominent use cases of DMT are i) to help the development of predictive maintenance algorithms and ii) to help develop data log anomaly detection algorithms. This would help the developers to generate similar data which the machine produced. It can later be tested to more efficiently develop anomaly detection algorithms or to help develop algorithms for predictive maintenance. This would create an environment which would act similar to that of a digital twin but with the added perk of having a more detailed description of the behaviour of each of the sensors. This leads to more precise predictions and by that more successful algorithms which identify these problems previously stated.

The second part is that the DMT could be used as a template to design dig-ital twins with capability of reproducing the recorded machine data. This with added support of a software which takes the sensor and state infor-mation and the labeled data set to recreate the data which was one of the specifications done by the company.

6.5 Ethical and social aspects

Since the area of digital twins is quite new the issues are not fully realised either, the core problems is if the digital twin is compromised there could be large ramifications due to all of the data being available for abuse. This could even lead to complete failure of the digital twin if the data is modified to shut down or to damage either the physical or the digital device.

(46)

be leaked.

Security is also an aspect as whole systems has a lot of data which is ex-ploitable. This data which is needed to create the DMT is capable of disclos-ing the identity of a company. Therefore a lot of focus would be needed on the security aspects of both encryption and safe storage of the data.

As each of the parts effect the production line simulation the most critical is as mentioned above the way the data is used and handled, and this is the terms a digital twin needs to remain successful. This leads to a issue where the data and if the digital twin is not optimized the results from a digital twin might produce results which contradicts the settings in a real production line and that means some data may be conflicting.

6.6 Future work

The technology in the area of digital twins is expanding by the day and new and exiting approaches appear constantly. This thesis has only scratched the surface on that area. The task of creating a full digital twin is a enormous undertaking and by the definition of this thesis was not required. The idea was to use the properties from the digital twins to analyze the sensor data in order to create a template which was supposed to be the basis for data generation.

First of all the work done in this thesis is preliminary and the specifications did not require the time information to be embedded into the DMT, instead the time information was only specified to exist inside the paring of the other part which the company has already has. This means that the resulting DMT should be improved upon by adding the temporal dimension to each sensor in each state.

One area where the inspiration to the thought of recognizing states is the study of automata theory and the creation of finite state machines. These fi-nite state machines is the next, by using either hidden Markov chains as mentioned in the article "Hidden markov model-based digital twin con-struction for futuristic manufacturing systems" by Angkush Kumar Ghosh, AMM Sharif Ullah and Akihiko Kubo [41]. They talk about generating a digital twin from the concept of a hidden markov chain.

(47)

7 Conclusions

In this section, the conclusions which were found are summarized and pre-sented. There will also be a short discussion about the validity of the results. In section 1.6 the presented concrete goals was mentioned. Therefore these needs to be fulfilled. This were done as explained in section 3. By analyzing the final results, a number of conclusions can be drawn.

Goal one was defining what a DMT is by studying several research articles, and identifying what a machine produces was the first part of solving this. Secondly, the DMT needed to explain how the data behaved, and this leads to the machine states and sensor analysis being a large part of the machine’s behavior. As if the states and sensor value distribution are displayed, one could replicate the data from the original machine. Therefore this under-standing of what kind of data the machine is capable of producing is central to the definition of the DMT. The end product is, therefore, a collection of statistical distributions of each sensor in each state.

Goal number two was to create a program that generates a DMT. This part was also served as a testing environment for the later part where the ver-ification of the DMT was needed. This creation of a testbed but also a production-ready environment was challenging but created a result that served both purposes. The finalized application was able to perform each task specified by the goals. This means that the application can produce a file that has all of the relevant information which the DMT defined.

The third and final goal was to analyze and verify the DMT as a whole. This is quite challenging due to there not existing a set model which evaluates how good a representation is at representing the intended device or ma-chine. Instead of an evaluation of the whole DMT, parts of the DMT could be evaluated. Therefore the clustering evaluation and the sensor analysis evaluation are used to evaluate the DMT. This creates an evaluation which is enough to validate the results as the evaluation method are gathered from trusted sources.