Data mining techniques for modeling the operating behaviors of smart building control valve systems

(1)

Master of Science in Computer Science June 2020

Data mining techniques for modeling the operating behaviors of smart

building control valve systems

AMIRMOHAMMAD EGHBALIAN

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

(2)

The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identiﬁed as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:

Author(s): AMIRMOHAMMAD EGHBALIAN E-mail: ameg18@student.bth.se

University advisors:

Prof. Veselka Boeva Shahrooz Abghari

Department of Computer Science

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00

SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Background. One of the challenges about smart control valves system is processing and analyzing sensors data to extract useful information. These types of information can be used to detect the deviating behaviors which can be an indication of faults and issues in the system. Outlier detection is a process in which we try to ﬁnd these deviating behaviors that occur in the system.

Objectives. First, perform a literature review to get an insight about the machine learning (ML) and data mining (DM) techniques that can be applied to extract pat- tern from time-series data. Next, model the operating behaviors of the control valve system using appropriate machine learning and data mining techniques. Finally, evaluate the proposed behavioral models on real world data.

Methods. To have a better understanding of the diﬀerent ML and DM techniques for extracting patterns from time-series data and fault detection and diagnosis of building systems, literature review is conducted. Later on, an unsupervised learning approach is proposed for modeling the typical operating behaviors and detecting the deviating operating behaviors of the control valve system. Additionally, the proposed method provides supplementary information for domain experts to help them in their analysis.

Results. The outcome from modeling and monitoring the operating behaviors of the control valve system are analyzed. The evaluation of the results by the domain experts indicates that the method is capable of detecting deviating or unseen oper- ating behaviors of the system. Moreover, the proposed method provides additional useful information to have a better understanding of the obtained results.

Conclusions. The main goal in this study was achieved by proposing a method that can model the typical operating behaviors of the control valve system. The generated model can be used to monitor the newly arrived daily measurements and detect the deviating or unseen operating behaviors of the control valve system. Also, it provides supplementary information that can help domain experts to facilitate and reduce the time of analysis.

Keywords: Machine learning, Data mining, Outlier detection, Time-series, HVAC&R

(4)

(5)

Acknowledgments

First of all, I would like to thank my supervisors Professor Veselka Boeva and Shahrooz Abghari for their trust, support, and precious feedbacks. Without your guidance and great patience, I was not able to do this.

I would like to thank Farhad Basiri, the CEO of iquest company and his colleague Otto Sandstrom who gave me this opportunity and provided the materials for this study.

Last but not leat, I would like to thank my dear family who supported me both emotionally and ﬁnancially.

iii

(6)

(7)

Abstract i

Acknowledgments iii

1 Introduction 1

1.1 Problem Statement . . . . 2

1.2 Research Questions . . . . 2

1.3 Aims and Objectives . . . . 3

1.4 Background . . . . 3

1.4.1 Machine Learning . . . . 3

1.4.2 Clustering Analysis . . . . 4

1.4.3 Data Mining . . . . 4

1.4.4 Outlier Detection . . . . 5

1.4.5 District Heating System . . . . 7

1.5 Outline . . . . 7

2 Related Work 9 2.1 Search Terminology . . . . 9

2.2 Literature Review . . . . 9

3 Methods and proposed approach 13 3.1 Methods . . . . 13

3.2 Proposed Approach . . . . 14

4 Experiment and Evaluation 17 4.1 Data . . . . 17

4.2 Data Preprocessing . . . . 17

4.3 Experimental Setup and Implementation . . . . 19

5 Results and Analysis 23 6 Discussion 31 7 Conclusions and Future Work 33 7.1 Answering Research Questions . . . . 33

7.2 Future Work . . . . 34

References 35

v

(8)

A.2 Operating Modes . . . . 39

vi

(9)

Chapter 1 Introduction

Nowadays, a countless number of devices are equipped with smart sensors that can communicate together and can be accessed via the Internet. This is what we call IoT and it made our life, city and premises more modern by making progress in different fields like computing, communication and electronic. Smart environment is a key phrase in IoT. One of the domains that it covers is smart buildings. The solutions that have been provided by IoT had a great effect on decreasing energy waste that is caused by suboptimal management and human activities. Some of the examples of the automation systems based on IoT are SmartThings, Vera, Microsoft Lab of Things (LoT), openHAB, Ninjablocks, Twine, CASAS Smart Home project [25]. To encourage energy consumers to use these technologies, their comfort should be con- sidered [17].

Heating, Ventilation, Air Conditioning, and Refrigeration (HVAC&R) is a system designed to resolve the thermal needs and requirements for diﬀerent buildings such as residential, industrial and so on. There are diﬀerent types of HVAC&R systems and all these systems can be categorized in two main groups: central and local sys- tems. Simply, the main task of HVAC&R system is to heat or cool the outdoor air according to the desired and required temperature and then draw it into the building.

One of the important parts of HVAC&R is control system. The role of the control system in HVAC&R is to regulate the operation and performance of HVAC&R.

Energy management and safety are other capabilities that we expect from modern control systems. Energy management in HVAC&R means that these systems should provide their main tasks in the most efficient way. Safety is a function that protect people and the HVAC&R system from receiving damage. Limiting the temperature to prevent overheating or freezing is an example of safety function in HVAC&R system. There are five pieces in the control loop of each control system: a sensing element, transmitter, controller, final control element, and process. For a long time, the control valve is used as the main final element in the control systems of many HVAC&R equipment [23, 34]. A control valve is a type of valve which is used to control the flow and pressure of fluids or gas. One of the advances of technology in smart buildings is the advent of smart control valves. Smart control valves are equipped with sensors that can collect diagnostic data such as valve position or performance.

1

(10)

1.1 Problem Statement

One of the challenges about smart control valves is processing and analyzing the collected data from the sensors. The sensors in these types of valves collect a large amount of data which makes the analysis process and extracting useful information difficult. The extracted information is of great importance and the reason is that by investigating these information we can realize the deviating operational behaviors of the system. The deviating operational behaviors of the system can be an indication of faults such as cavitation or misconfiguration and in some cases the unsuitable size of the control valve. Detecting the faults at early stages can reduce the maintenance cost, improve the energy efficiency, and more important reduce carbon dioxide emis- sions. Finally, the better performance of the system will bring more comfort to the customers.

1.2 Research Questions

With respect to the characteristics of the time-series data, and the problem at hand this thesis is going to address the following research questions:

• RQ1: What algorithms are suitable to extract patterns from time-series data?

Motivation: The reason for studying RQ1 is that many algorithms can be used for time-series pattern extraction. The algorithm for this purpose cannot be selected randomly and selecting the right algorithms needs analysis of the data. Pattern extraction is one of the prerequisites of modeling the operating behaviors of the system.

• RQ2: What type of ML and DM methods are suitable for fault detection and diagnosis (FDD) in control valve system?

Motivation: The study of RQ2 will give an insight of the domain. Addition- ally, we gain knowledge about the type of ML and DM methods used in the domain for FDD in control valve system.

• RQ3: How the gained knowledge from RQ1 and RQ2 can be used to propose a method for modeling and monitoring control valve system behaviors?

Motivation: RQ3 is tightly related to RQ1 and RQ2. The answer of RQ3 can work as a proof that shows the suitability of the algorithms and the methods that were selected in RQ1 and RQ2.

To answer RQ1 and RQ2, a literature study will be performed. To answer RQ3,

the gained knowledge from RQ1 and RQ2 will be used to propose a method for

detecting deviating behaviors of control valve system. Finally, an experiment

will be performed to show the applicability of the proposed method on the real

world data.

(11)

1.3. Aims and Objectives 3

1.3 Aims and Objectives

The aim of this thesis is to use diﬀerent data mining and knowledge discovery tech- niques for modeling, analysing and understanding control valve system operating behavior in smart buildings. In this thesis, the data that we use is in the form of time-series.

The objectives of this thesis are:

1. Performing a literature study to get an insight about data mining techniques, and algorithms that can be used for extracting patterns from time-series data, building and visualizing behavioral models in case of non-labeled data.

2. Modeling the operating behavior of control valve system by applying suitable data mining techniques.

3. Evaluate the proposed behavioral models on real world datasets and discuss the obtained results with the domain experts.

1.4 Background

1.4.1 Machine Learning

Machine learning (ML) is a subcategory of artificial intelligence (AI) that enables computers to learn and enhance without the use of explicit programming. The general definition of ML in Peter Flach’s book is presented as [21]: “Machine learning is the systematic study of algorithms and systems that improve their knowledge or performance with experience.” In ML the main effort is to select proper set of features to create suitable models and use the built models to accomplish the right tasks. Classification, regression, clustering, and descriptive modeling are some of these tasks. Data can be labeled or unlabeled and base on this concept, machine learning methods can be classified in two major groups:

Supervised Learning

In supervised learning tasks, input and output variables are used to learn the function

that can map them to each other. The main goal is to ﬁnd a mapping function that

can predict the output of the new input data. Classiﬁcation and regression are two

main groups of supervised learning tasks. In classiﬁcation problems, the outputs

are categorical (or discrete) variables but in regression problems the outputs are

numerical (or continuous) variables. Classifying the input data from heating system

into faulty or non-faulty is an example of classiﬁcation problem and predicting the

valve openness can be an example of regression problem. Linear regression, random

forest, and support vector machines including support vector regression are some of

the popular examples of supervised learning algorithms.

(12)

Unsupervised Learning

In this type of learning, only the input variables are available. The aim of unsu- pervised learning algorithms is to learn more about the data by extracting patterns from the data and detecting hidden structures in the data. Clustering is an example of unsupervised learning which will be explained in the next section.

1.4.2 Clustering Analysis

Clustering is an unsupervised learning task in which the data will be split into differ- ent groups (clusters). In each cluster, the similarity between the data points are high while the similarity between the clusters are as low as possible. Clustering algorithms can be classified into 5 different groups which are partitioning-based, hierarchical- based, model-based , density-based , and grid-based algorithms. More information about each group can be found in [19]. In this study, k-means (see section 3.1) which is a popular partitioning-based algorithm is applied.

Multi-view Clustering

Most common clustering methods use a single set of features (views) to cluster the data points which is called single-view clustering. Multi-view clustering (MVC) [14]

is a type of clustering in which the data points are demonstrated using diﬀerent feature sets (views). Regardless of the view, in MVC, the data points should be clustered in the same group. In the current century, multi-view data can be seen in many real-world applications such as multimedia content. It is possible to explain each segment of multimedia content from two diﬀerent views. One view can be the video signals from video recording devices and another view can be the audio signals from audio recording devices .

1.4.3 Data Mining

Data mining is the process of finding information from data using computerized techniques. It is mostly used in the exploratory analysis where we look for new patterns and information that are not trivial. Through this process, we can combine our knowledge of formulating problems with computers’ strong abilities in searching to attain the greatest outputs [27]. Prediction and description are two main goals of data mining. In prediction the main effort is to predict the future values of specific variables using available data. Descriptive models are used to extract the patterns that can describes the data in human understandable ways.

Data Preprocessing

Lack of tight control on data collection is one of the reasons that makes data prepro-

cessing important in data mining. Data noises are one of the obstacles that makes

data mining process complicated. In many cases, existence of irrelevant data makes

the results of the analysis unreliable. Data preprocessing is a process that transform

or encode the features so that the machine can parse them easily. Some of the data

preprocessing techniques are:

(13)

1.4. Background 5

• Feature Selection: Creating the model base on the irrelevant features can re- duce the accuracy of the model. Feature selection is a manual or an automated process that helps to select the most relevant features to the output which we are interested in.

• Data Cleaning: A process in which we try to correct or remove inconsistent data. Duplicate data points and missing values are some of the examples of inconsistent data that aﬀect the quality of the data and the results. Generally, in data cleaning process to handle inconsistent data, one of the following tasks may be applied: 1) Removing 2) Correcting 3) Imputation

• Data Discretization: A process in which the continuous data such as time- series data can be transformed to discrete data such as nominal attributes.

• Feature Scaling: A process that helps to normalise the independent features in a speciﬁc range. This technique can make the comparison of the independent features easier and also in some cases it will increase the speed and performance of the machine learning algorithms.

1.4.4 Outlier Detection

The patterns in the data that do not correspond to the expected behaviors are called anomalies and the act of detecting these patterns is called outlier detection [13].

Detecting anomalies in the data is of great importance and the reason is that inter- preting them often gives us vital information in diﬀerent domains. A simple way to detect outlier is to identify an area that describes normal behavior in the data and other data points that are not in this area can be considered as anomalies. There are some factors that make this simple approach pretty complicated. One is that detect- ing an area that includes all available normal behaviors is not an easy process. Also, most of the times the margin within normal and deviating behavior is not accurate.

Three main classification of anomalies are: Point anomalies, Contextual anomalies, and collective anomalies. Data points that can be viewed as outlier with regard to other data points in the dataset are called point anomalies. Collective outlier is a collection of relevant data points that since they occurred together are considered as anomalies with regard to other data points in the dataset. Data points consid- ered as outlier in a particular context are called contextual anomalies (conditional anomalies). The definition of the context varies with respect to the structure of the dataset and it should be determined when we are defining the problem. A data point can be defined by two set of features: contextual attributes and behavioral attributes. Attributes applied to define the context of the data points are called con- textual attributes. As an example, in heating system domain, outdoor temperature can be considered as a contextual attribute. On the other hand, attributes applied to specify the non-contextual properties of the data is called behavioral attributes.

As an example, the monthly average valve openness or the monthly valve standard

deviation in a speciﬁc building can be considered as behavioral attributes in heating

systems .

(14)

Fault Detection and Diagnosis

Fault Detection and Diagnosis (FDD) is the process in which we try to detect the faults and understand the cause of the faults in the physical system. HVAC&R is an example of physical system in the buildings. Automated Fault Detection and Diagnosis (AFDD) equipment are the tools and technologies that can be applied to automate the FDD process [9]. In general, FDD techniques can be classiﬁed into three main groups as follows: 1) quantitative model-based methods, 2) qualitative model-based methods, and 3) process history-based methods. Quantitative model- based methods are a collection of quantitative mathematical relations relying on the underlying physics of processes. The growth of control systems’ complexity and com- puters’ usage, increased the importance of quantitative model-based FDD systems [24, 37, 43]. These methods are more complex, computationally intensive, most ac- curate and reliable compared to qualitative model-based and process history-based methods. Modeling the temporary behaviors of the system is a task that quan- titative model-based methods can perform better than other modeling techniques.

These methods can be classiﬁed into detailed physical models and simpliﬁed physical models. In physical model-based approaches, a set of measured inputs such as tem- perature will be used to predict or estimate the behaviors or outputs of the system.

Finally, to detect faults, these estimated values will be compared with the measured outputs. Using detailed information of the physical connections and features of all the parts in a system is the main idea of the detailed physical models. On the other hand, simplified physical models usually applies simpler approaches that need fewer computations compare to detailed physical models. Qualitative model-based methods are a collection of qualitative relations inferred from knowledge of the un- derlying physics. Qualitative model-based approaches can be divided into rule-based models and qualitative physics-based models. In both of the mentioned methods, causal knowledge will be used for fault diagnosis in the system. In rule-based mod- els, if-then-else rules will be created by using priori knowledge and from these rules, conclusions will be made. In situations where there is uncertainty about the available knowledge or the available knowledge is incomplete, qualitative physics-based mod- els are able to draw conclusions about a system’s state. The process history-based (data-driven) methods are the methods that only use measurement data coming from processes to create behavioral models. In process history-based models, feature ex- traction is a technique in which input and output data will be converted and applied as a priori knowledge. In these methods, the main goal is to connect measured inputs and outputs using mathematical relationships. Black box and gray box models are two classifications of process history-based models. In the black box models, fault detection in the systems is based on parameter estimation. In most situations, the parameter deviation does not have a physical significance. The gray-box models are a combination of physics based models and data-driven models. In these types of models, the physics based models use data to learn the parameter estimates (e.g.

coeﬃcients) of the equations. Linear regression and artiﬁcial neural network (ANN)

are some of the examples of process history-based methods [28].

(15)

1.5. Outline 7

1.4.5 District Heating System

A system used to deliver heat and hot water from a central boiler to the buildings that are located in a limited geographical area through a piping system. The main parts included in a district heating system are:

1 Production unit in which the required heat and hot water will be produced.

2 Distribution unit that delivers the produced heat and hot water from produc- tion unit to the consumption unit.

3 Consumption unit in which the buildings are located and will receives the heat and hot water produced in the production unit.

The parts mentioned earlier form the primary side (circuit) of the district heating system. Additionally, there is a secondary side (circuit) which refers to the heating system located in the consumption unit. The main components included in the secondary side are a sub-station with heat exchanger, a piping system that circulate the water, and equipment such as radiators to transfer the heat to the room’s space.

A sub-station is a component that connects primary side to the secondary side.

Moreover, it is responsible for setting the suitable pressure and temperature for the supply water.

Figure 1.1: District Heating System

1.5 Outline

The remaining parts of this study are structured as below:

• Chapter 2: Reviews the studies related to FDD of the building systems and clustering of time-series data.

• Chapter 3: Presents a method for modeling and monitoring the operating behaviors of the control valve system.

• Chapter 4: Conducts an experiment by using the proposed method with real

world data.

(16)

• Chapter 5: Presents the results of the experiment and their analysis.

• Chapter 6: Presents a discussion about the proposed method, the obtained results, and threats to the validity.

• Chapter 7: Presents the conclusion of this study, answers the research ques-

tions, and introduce an idea for the future work

(17)

Chapter 2 Related Work

2.1 Search Terminology

The papers and resources are mostly found and collected from digital libraries such as Scopus. At the beginning, the reviews and surveys related to FDD methods for smart building systems and extracting patterns from time-series data has been selected. Next, to collect more literature, backward snowballing technique has been applied on the initial papers found. The literature are collected using following strings:

• "Extract patterns" + "Time-series data"

• "Extract patterns" + "Time series data"

• "Fault detection" + "Smart building"

• "Fault detection" + "Smart building" + "control Valve"

To select the most related papers the following inclusion and exclusion criteria are used:

Inclusion Criteria:

• Published in English language

• Related to the topic, i.e., title, abstract and conclusion sections of the literature are checked to understand whether they are related to the topic or not.

Exclusion criteria:

• Published in a language other than English

• The full-text of the literature is not available

2.2 Literature Review

Time-series data have some speciﬁc characteristics that makes pattern extraction diﬃcult such as high demensionality, high correlation between the features, and high number of noisy data points. The review of the studies related to time-series cluster- ing from last decade shows that partitioning algorithm are the ones that are mostly

9

(18)

used. k-means is an example of partitioning algorithms that are widely applied for clustering time-series data [32]. The fast response of these types of algorithms is one of the reasons they are relatively popular. In these types of algorithms the number of clusters should be assigned before the application of the algorithm. The latter is one of the reasons that cause the application of partitioning algorithms in real world cases diﬃcult. Additionally, these algorithms are proper to use in situations where we are working with equal-sized time-series data [8].

Active study related to Fault Detection and Diagnosis (FDD) in HVAC&R sys- tems started in the 1980s and since then FDD and data mining techniques have matured considerably [10, 42]. In 1987 and 1989, automated Fault Detection and Diagnosis (AFDD) methods on refrigeration based on vapor-compression have been studied by McKellar [33] and Stallard [40] respectively. In the 1990s, the major- ity of the applications related to FDD concentrated on vapor-compression devices and air-handling units (AHUs). Generally, these applications were using tempera- ture and/or pressure measurement for general Fault Detection and Diagnosis. The International Energy Agency (IEA) conducted a project (Annex 25) at the begin- ning of 1990 on real-time simulation of HVAC&R systems. The Annex 25 was able to detect general issues in different types of HVAC&R systems [26]. After Annex 25, another study (Annex 34) was conducted by IEA to show the application of FDD systems in buildings [16]. From 1998 to 2001, the U.S. department for en- ergy (DOE) conducted and supported several projects such as developing diagnostic tool for the whole-building , FDD for outdoor-air ventilation [12, 29, 30], simpli- fied physical-model-based FDD for air-handling units [37], and FDD for centrifugal chiller systems [38]. Until 2018, around 200 publications related to AFDD for build- ing systems were published. Around %62 of these publications related to process history-based AFDD methods, %26 related to qualitative model-based , and %12 related to quantitative model-based methods. There are two reasons that process history-based methods are mostly applied. The first reason is that these methods use historical data for creating the model. Another reason is that in these methods the modeling complexity is reduced. %72 of the studies that applied process history- based methods, derived from black box models, %12 of them derived from gray box models, and the last %16 of the studies derived from a combination of these methods.

Black box models can be classiﬁed into statistical , ANN, and pattern recognition techniques.

Some of the studies that used pattern recognition techniques are [22, 35, 36, 39].

Ren et al. [36] used a black box support vector machine (SVM) model to classify the patterns in the refrigeration system. These classiﬁed patterns were used to in- vestigate whether an issue exists in the system. In this study, with respect to seven operating patterns (a normal state and six fault states), SVM was used to recognize the best pattern to match the faults. Najaﬁ et al. [35] presented an AFDD approach for air-handling unit diagnosis. In this approach, Bayesian network was used to analyze and compare the current behavioral patterns of the system with the faulty behavioral patterns that was produced by the system faults to select the most similar pattern that can demonstrate the current behavior of the system. Han et al. [22]

presented a hybrid method which was a combination of SVM and multi-label (mL)

(19)

2.2. Literature Review 11 technique to detect and diagnose multiple-simultaneous faults (MSF) automatically.

In this study, the application of the proposed hybrid strategy was shown by using this method on a building chiller system. Srivastav et al. [39] presented a Gaussian Mixture Regression (GMR) method to model the building energy usage. The results of this method showed better performance in prediction accuracy and local conﬁ- dence estimation compare to multivariate linear regression and another method that was proposed in [41].

From all of the studies that applied black box models, most of them used sta- tistical techniques. Cui and Wang [15] proposed an on-line AFDD technique to demonstrate a centrifugal chiller system health state. For this purpose the authors applied polynomial regression black box model. The presented model is simple both in structure and application but in terms of statistical modeling it has limitations.

In another study, the authors [46] proposed an approach for fault detection in the AHU using Autoregressive-Moving-Average (ARMA) which is a kind of black box statistical model. The method uses a threshold to measure the performance degra- dation of the system which is caused by the existing fault in the system to understand whether the system needs service or not. Armstrong et al. [11] proposed a ﬁfth-order AR black box model that is able to ﬁnd the faults such as compressor valve leakage, in the rooftop units (RTUS) by getting one input. A PCA-based AFDD method was presented by Xiao et al. [44] for AHU system . Using this method it is possible to monitor the status of the sensor in an AHU system in real time. Inability to detect the complex sensor faults is one of the drawbacks of the proposed method.

Du et al. [18] introduced black box model based method using a joint angle analysis (JAA) that can detect faults in variable-air volume (VAV) systems. Xu et al. [45]

introduced a method based on black box models for centrifugal chiller system that decides whether the operation in the system is normal or abnormal. It is mentioned that for the purpose of fault diagnosis, another method such as JAA is needed.

There are some studies that used black box ANN model in their AFDD methods.

Kim et al. [31] presented an AFDD method based on black box ANN technique for air-conditioning system of a residential building. In this AFDD method, the proba- bilities of the normal and faulty state of the system is calculated and in cases where the probability of the faulty state is higher than the probability of the normal state, the system is marked with fault. Fan et al. [20] presented an AFDD method for AHU that is based on ANN balck models and wavelet analysis. In this method for the normal operation of the system a threshold is selected and the method can ﬁnd the faults in the system when the residuals are higher than the threshold.

In general, the review of the studies related to Fault Detection and Diagnosis

of building systems shows that process history-based methods are the most applied

methods. The presented methods in the studies were mostly focused on refrigeration

system, chiller system, variable-air volume, compressor valve in rooftop units, and

air-conditioning systems. Control valve is one of the important parts in control sys-

tem of many HVAC&R systems. Hence, the performance of control valve can eﬀect

the performance and eﬃciency of the HVAC&R systems. Since we could not ﬁnd

any approach that can be applied to solve our problem immediately, therefore, we

(20)

proposed a method which is capable of modeling the typical behaviors and detect-

ing the deviating behaviors of control valve system. The proposed method will be

explained in the next chapter.

(21)

Chapter 3 Methods and proposed approach

3.1 Methods

In this section, the machine learning and data mining techniques, and distance mea- sure that has been used in this study are explained.

k-means

An iterative machine learning algorithm with the purpose of grouping the data points into k distinct groups (clusters) and it makes sure that each data point belongs to one cluster. k-means is one of the simplest and most popular clustering algorithm in which, each cluster has a centroid which is the arithmetic mean of all the data points that exist in that cluster. This algorithm, assure that the sum of the squared distance between the data points that are assigned to a cluster and the centroid of that cluster is at the lowest amount. The process in k-means is that first we specify the number of clusters ( k). Next, the algorithm shuffles the dataset and selects k number of data points randomly and without replacement as centroids. This process will be repeated until the sum of squared between the data points and the centroids in each cluster are at the lowest. To find the optimal number of k, some methods such as elbow method and silhouette score can be used.

Elbow Method

One of the most common methods to find the optimal number of k. The way this method works is that a range of values for k will be selected. Then the selected values will be used to fit the model. The section in the line chart where shows an elbow will be selected as the optimal value. In some cases, finding the elbow part of the chart is a hard task. So using other methods as complement methods can be a good idea [1].

Silhouette Score

A method that is used to measure the closeness of each point in one cluster to the points in the adjacent clusters. Mean nearest-cluster distance (m) and mean intra-cluster distance (n) are two important parameters for calculating Silhouette score (SC). Mean nearest-cluster distance is the distance between a data point and the nearest cluster to that data point. Mean intra-cluster distance is the distance

13

(22)

between the instances in the same cluster. The silhouette score uses the following formula:

SC = m − n

max(n, m) (3.1)

The possible values for SC are in the range of -1 and 1. Values closer to 1 means that instances are far from the adjacent clusters. On the other hand, values closer to -1 means the instances are placed in the wrong clusters. Values close to zero shows that cluster are overlapping [5].

z-score

A technique that scale the features in a way that the distribution of the features have mean equal to zero and standard deviation equal to one. The following equation shows the z-score for scaling feature’s value:

z = F

i

− μ

σ (3.2)

In this equation z-score is the scaled version of the feature’s value, F

i

is the feature’s value, μ is the mean of the feature and σ is the standard deviation of the feature.

Euclidean Distance Measure

The length of the line that connects point x to point y is called Euclidean distance.

The position of the points in Euclidean space is called Euclidean vectors. The Eu- clidean distance between two points can be calculated using following equation:

d(x, y) =

ⁿ

i=1

(x

i

− y

i

)

²

(3.3)

3.2 Proposed Approach

The analysis of the system and the data shows that the nature of the data is multi-

view. The features represent diﬀerent characteristics of the studied phenomena which

is the control valve system and its behaviors. As a result. in this study a multi-view

data analysis approach is proposed for understanding and analyzing the behaviors

of control valve system. The measurements of the control valve system can describe

diﬀerent characteristics of the system such as performance, context, and so on. There-

fore, it is possible to group the measurements that represent a speciﬁc characteristic

of the system together. Suppose we have a dataset with N measurements and each

measurement has n data points which is in the form of daily proﬁles (average daily

values). By considering the latter, the main steps of the proposed method are as

follows:

(23)

3.2. Proposed Approach 15 1. Categorizing control valve system measurements

Analyze and categorize the measurements base on the information they provide about the characteristics of control valve system. As an example, we can group the measurements of control valve system as follows:

(a) measurements that provides information about the operating behaviors of control valve system (operating behavior view).

(b) measurements that provides information about the performance of control valve system (performance view).

(c) measurements that provides information about the context of control valve system (context view).

2. Modeling typical operating behaviors (typical operating modes) of control valve system

(a) Analyze and select daily proﬁles of the features (measurements) that pro- vides information about the operating behaviors of control valve system and cluster them. For convenience, we call this set of features F

1

. To cluster the daily proﬁles of the features in F

1

, k-means algorithm and to understand the similarity or dissimilarity of the daily proﬁles, Euclidean distance measure can be used.

(b) Label the clusters (typical operating modes) with the features in F

1

. The clusters (typical operating modes) will be labeled with the features that represent the operating behaviors of control valve system. To do that, for each cluster (typical operating mode):

i. Select all the features in F

1

.

ii. For each feature, calculate the average values of the daily proﬁles.

iii. Label each cluster (typical operating mode) with the calculated aver- age values.

3. Label each cluster (typical operating mode) with the performance views’ data (a) Analyze and select daily proﬁles of the features (measurements) that pro-

vides information about the performance of control valve system (perfor- mance indicators). For convenience, we call this set of features F

₂

. (b) Label the clusters (typical operating modes) with the features in F

₂

. The

clusters (typical operating modes) will be labeled with performance indi- cators of the control valve system. To do that, for each cluster (typical operating mode):

i. Select all the features in F

2

.

ii. For each feature, calculate the average values of the daily proﬁles.

iii. Label each cluster (typical operating mode) with the calculated aver-

age values.

(24)

4. Label each cluster (typical operating mode) with the context views’ data

(a) Analyze and Select daily proﬁles of the features (measurements) that pro- vides information about the context of control valve system (context in- dicators). For convenience, we call this set of features F

3

.

(b) Label the clusters (typical operating modes) with the features in F

₃

. The clusters (typical operating modes) will be labeled with context indicators of the control valve system. To do that, for each cluster (typical operating mode):

i. Select all the features in F

₃

.

ii. For each feature, calculate the average values of the daily proﬁles.

iii. Label each cluster (typical operating mode) with the calculated aver- age values.

5. Analyze and understand the control valve system behaviors with multi-instance cluster analysis

So far, a model has been built that describes the typical operating modes of control valve system with their context and expected performance. To have a better understanding and get a deeper insight of the control valve system and its behaviors, the typical operating modes can be categorized base on the dif- ferent criteria such as performance or context indicators. Also, it is possible to represent each typical operating mode with the zoomed views of the features.

The zoomed view is a 24-hour label in which each hour represent the average of the hourly values of all the data points in a speciﬁc cluster (typical operating mode). The zoomed views can be used to compare diﬀerent operating modes to understand the similarity or dissimilarity between them.

6. Monitoring control valve system’s operating behaviors

By considering the gained knowledge from previous step and the model at hand, we can monitor the operating modes of control valve system on a daily basis.

Using this way, the new daily profiles can be labeled with specific modes. To label new daily profiles, three scenarios can be considered:

(a) The operating and contextual features’ values of the new daily data are similar to a typical operating mode and the values of the new daily data performance indicator is as we expect.

(b) The operating and contextual features’ values of the new daily data are similar to a typical operating mode but the values of the new daily data performance indicator is not as we expect. The latter can be an indication of deviating behaviors in the system.

(c) The operating and contextual features’ values of the new daily data are

not similar to any typical operating mode. In that case, further analysis

is needed to determine whether the new daily data is an indication of

deviating behaviors or not.

(25)

Chapter 4 Experiment and Evaluation

HVAC&R system can consist of different systems such as heating system, cooling system, tap water system, and so on. It is possible that each system has several control valve system. After having discussion with iquest domain experts, heating system of a specific building was recommended. By analyzing the heating system of the building, we understood that it has three control valve system. One of them is the main control valve that receive the heat from primary side first. The other two valves that are located on different sides of the building will receive the heat from the main valve. In this experiment, the main focus will be on the building heating system’s main valve.

4.1 Data

The data used in this experiment are sensors data which are unlabeled and in the form of time-series. Additionally, building’s data are anonymized by iquest to protect and secure customer’s privacy. For some features (measurements) such as outdoor temperature, the data will be collected and stored in the database each 10 or 15 min- utes but for some features (measurements) such as VOM, the data will be collected and stored if the difference between the new value of the collected measurement and the last value of the collected measurement is more than 0.5. The time window of the data collected from the data base are hourly. By considering the proposed method, the daily profiles (Average daily values) are used to model the behaviors of control valve system. Hourly profiles are used to create zoomed views of the features (mea- surements). The collected data are concatenated and stored in pandas dataframe for analysis. The time period selected to extract the data is between 1st of January 2019 and 1st of January 2020.

4.2 Data Preprocessing

First the collected data were checked for possible missing values. The last four days of 2019 (from 28

^th

of December to 31

^st

of December) are removed due to the reason that in these dates the system was shutdown. To select the heating seasons, the heating threshold is compared with each week’s OT (weekly outdoor temperature).

The heating threshold used in this experiment is 10°C which is selected by having discussion with domain experts. As a result, the weeks with OT lower than 10°C are selected as heating seasons.

17

(26)

Table 4.1 shows the list of all the features in this study with their units. By consid- ering the collected measurements, four new features are added. These new features can provide important information about the operating behaviors, performance, and context of control valve system. Also, they can help us to perform deeper analysis of the control valve system. The three new features are:

1. PD: The diﬀerence between PST and PRT

P D = P ST − P RT (4.1)

2. SE: Is the eﬃciency of the sub-station that applies the features (measurements) from primary side and secondary side to shows how well a sub-station is work- ing. The SE of a sub-station can be calculated using following equation:

SE = P D

P ST − SRT (4.2)

In this equation PD is the diﬀerence between the PST and PRT (PD), PST is the primary supply temperature, and SRt is the secondary return temperature [7].

3. RWB: The ratio of weekends and business days

4. RCA: The ratio of number of days in each cluster to total number of days available in data set

Table 4.1: List of the features

No. Acronyms Feature name Units

1 PST Primary Supply Temperature °C

2 PRT Primary Return Temperature °C

3 SST Secondary Supply Temperature °C

4 SRT Secondary Return Temperature °C

5 PHL Primary Heat Load kW

6 OT Outdoor Temperature °C

7 VOM Valve Openness Mean %

8 VOS Valve Openness Standard deviation %

9 PF Primary Flow m

³

/h

10 PE Primary Energy MWh

11 PV Primary Volume m

³

12 PD Primary Delta °C

13 SE Sub-station Eﬃciency %

14 RWB The ratio of weekends and business days % 15 RCA The ratio of number of days in each cluster to

total number of days available in data set

%

The features from No. 1 to No. 11 (above horizontal line) shows the measurements collected

from the sensors. The remaining features are the ones that were created and added to the

table.

(27)

4.3. Experimental Setup and Implementation 19

4.3 Experimental Setup and Implementation

The tools used in this experiment are mentioned below:

1. Python: [6] is an open-source general purpose interpreted programming lan- guage that is developed by Guido van Rossum in the late 1980s. Python is a suitable programming language for fast developments and it is widely preferred in data science and machine learning.

2. Pandas: [3] is a powerful python library which is developed for data analysis and manipulation.

3. Matplotlib: [2] is a complete python library used for visualization.

4. Sklearn: [4] is a free open-source machine learning library developed for python which can be used for predictive data analysis.

The data are divided into two groups. One that represent our historical data (historical dataset) which will be used to create the model. Another group that rep- resent the new arrival data (monitored dataset) which will be used for monitoring the behaviors of the control valve system. Since each month has its own nature and characteristics, %70 of the days in each month are considered as the historical data and the remaining %30 is considered as the monitoring data. To mitigate the bias in our data, the process mentioned earlier is applied 5 times on the same data to create 5 diﬀerent dataset pairs (historical and monitored dataset). Additionally, to select the days for each group, random sampling technique is applied.

Domain experts were able to provide us with several days that show abnormal behaviors. The dates are between 12

^th

of March and 29

^th

of April. These days are added to the monitored dataset to validate and test our method. Since we do not have additional information about the days with abnormal behaviors, we assume all the days used to create our model are showing normal behaviors. In this sec- tion, to demonstrate the application of the method, the ﬁrst datatest pair is selected (dataset0).

As mentioned earlier in the proposed method, three views has been selected for control valve system which are: operating behaviors, performance, and context.

Features are selected for each view by analyzing the data and having discussion with domain experts. The selected features for each view are explained below:

1. PHL, SST, and SRT are selected as the features that can represent the typical operating behaviors of control valve system. These features are selected due to the reason that secondary supply and return temperature represent the oper- ating behaviors of the secondary side of the heating system and PHL represent the operating behaviors of the primary side of the heating system. PD is an- other candidate that can represent the primary side. Some of the reasons that makes PHL a better representation of primary side are:

• PHL has higher correlation with PF, PE, PV, and VOM (Figure 4.1)

(28)

• PHL has a high negative correlation with OT (Figure 4.1)

2. SE, VOM, and VOS are the features selected as the ones that represent the performance of control valve system.

3. OT, RWB, and RCA are the features selected as the ones that represent the context of control valve system.

Figure 4.1: Heat map that shows the correlation of the features collected from the database

To model the typical operating behaviors of control valve system, clustering tech- nique is applied. The features are scaled using z-score technique before clustering.

The average daily values (daily proﬁles) of the features that represent the typical

operating behaviors are clustered using k-means algorithm. To ﬁnd the similarity

between the daily proﬁles, Euclidean distance measure is applied. Since we have 5

diﬀerent datasets and for each dataset it is possible that the optimal value of the k

will be diﬀerent, the elbow and silhouette score techniques has been applied on all

the datasets and an average value has been selected as the optimal k which is 5 .

Figure 4.2 displays the clusters (typical operating modes) and their data points. By

considering the optimal value of k, we will have 5 clusters with their linked context

and expected performance (Table 5.1).

(29)

4.3. Experimental Setup and Implementation 21

Figure 4.2: Clustering plot

To have a better understanding and a clearer view of the control valve system behaviors, zoomed view of the performance and context indicators are created (Ta- ble 5.2). The typical operating modes are categorized and compared using diﬀerent criteria such as VOM, OT, SE, VOS, RWB, and RCA. The following tables show the groups that has been created for the typical operating modes using the mentioned criteria (Table 5.3).

Finally, each daily profile in monitoring dataset is compared with the typical op- erating modes (TOMs). The purpose of comparison is to find the TOM with the similar behaviors as the monitoring daily profile. To understand the similarity be- tween TOMs and the daily profile, four features are selected which are SST, SRT, PHL, and OT. The distance between the TOM’s features and the daily profile’s fea- tures are calculated using Euclidean distance measure. The TOM with the shortest distance is the most similar TOM to the daily profile. Next, the distance between VOM and SE of the selected TOM and the daily profile are calculated using Eu- clidean distance measure. If the calculated distance for VOM or SE are 2 times greater than the selected TOM’s VOM or SE standard deviation, then the daily profile is considered as a day that shows deviating behaviors.

It is possible, none of the typical operating modes (TOMs) are similar to the

daily proﬁle. To investigate the latter, the distance between the selected TOM

and the daily proﬁle is compared with the furthest data point that belongs to the

selected TOM. If the distance between the selected TOM and the daily proﬁle is

2 times greater than the selected TOM furthest data point, then it means that

the daily proﬁle is not really close to the selected TOM. As a result, the model is

updated with a new operating mode which is called NUOM (Newly observed Unseen

Operating Mode). In next iterations, the distance between the daily proﬁle and the

most similar TOM will be compared with the distance between the new arrival data

and each NUOMs found. In case, a NUOM is found that is closer to the daily proﬁle

than the most similar TOM, then the daily proﬁle will be considered as a day with

unseen behaviors.

(30)

(31)

Chapter 5 Results and Analysis

The results from applying the proposed method on dataset0 are presented in this chapter. Table 5.1 displays the typical operating modes (TOMs) of the control valve system with their performance and context indicators. Table 5.2, shows the zoomed view of the performance and context indicators related to TOM 0. It can be seen that, there is no zoomed view of VOS, RWB, and RCA. For VOS the reason is that more often, only one value of VOM will be collected in an hour. As a result, VOS will be zero. By considering the latter, comparing diﬀerent typical operating modes using VOS zoomed view is not a fare comparison. Also, RWB and RCA are not time-series data. As a result, there is no hourly data available for these features to create zoomed views.

By looking at the zoomed view mentioned earlier we can understand the changes happening in each TOM in more depth. As an example, by considering TOM 1, from 03:00 A.M. to 04:00 A.M., it can be seen that the Sub-station eﬃciency (SE) is increasing. One possible interpretation is that in the early morning, due to some reasons such taking shower the heat demand is high. Since we have the zoomed views of the performance and context indicators which are in the form of time-series, they can be visualized to get a deep insight of their patterns (Figure 5.1). Depending on the user preference, the zoomed views can be displayed in the form of table or in the form of plots.

The comparison of typical operating modes (TOMs) shows that when the out- door temperature (OT) decreases the valve openness mean (VOM), valve openness standard deviation (VOS), sub-station eﬃciency (SE), and primary heat load (PHL) increases. As an example, TOM 1 has the lowest OT, so it has the highest VOM, VOS, SE, and PHL. On the other hand, comparison of TOM 2 and 4 shows that TOM 2 has lower OT than TOM 4 but the VOM is slightly lower than TOM 4. One of the possible reasons is that the average standard deviation of OT in TOM 4 is higher than TOM 2 (Table 5.4). Another possible reason is that the number of days in TOM 4 is slightly higher than of those in TOM 2 (Table 5.5). Additionally, a pattern can be seen between SE and RWB. When the outdoor temp is above zero. It can be seen that the increase in RWB, increases the SE as well. This can mean that in weekends when the people use more heat and hot water, the SE will be improved as well (Table 5.1).

23

(32)

T a ble 5.1: TOMs (clusters) with their link ed con text and exp ected p erformance TOM SST (°C) SR T(°C) PHL(k W) V O M (%) V O S (%) SE (%) OT (°C) R W B(%) R C A (%) 0 46.713 38.260 38.175 17.493 1.515 98.5 1.787 36.4 19.6 1 52.118 40.976 48.718 18.763 2.280 100.0 -2.909 15.8 17.0 2 45.407 39.579 23.190 13.566 1.123 96.9 3.204 34.5 25.9 3 35.104 32.577 13.999 11.942 1.014 90.0 10.497 25.0 10.7 4 41.949 36.621 23.072 13.909 1.017 96.3 6.146 30.0 26.8 Note. The fe atur es ar e showing aver age values for each TOM

(33)

25 Table 5.2: Zoomed views of the performance and context indicators for TOM 0 Time VOM (%) SE (%) OT (°C)

0:00 14.430 92.3 1.426

1:00 14.423 92.0 1.276

2:00 14.344 92.1 1.261

3:00 17.825 92.2 1.192

4:00 17.856 97.7 1.378

5:00 18.093 97.1 1.298

6:00 18.141 98.3 1.593

7:00 17.781 96.8 1.703

8:00 17.873 98.0 1.896

9:00 17.735 97.6 2.075

10:00 18.002 97.9 2.249

11:00 17.796 101.2 2.383

12:00 17.848 98.1 2.636

13:00 18.023 99.5 2.451

14:00 17.939 100.1 2.281

15:00 17.983 99.7 2.096

16:00 18.001 97.4 1.904

17:00 17.959 96.2 1.686

18:00 18.287 97.4 1.575

19:00 18.344 96.0 1.522

20:00 18.454 96.0 1.419

21:00 18.615 94.8 1.395

22:00 18.703 94.9 1.370

23:00 14.850 92.8 1.206

Note. The features are showing average

values for each time

(34)

Figure 5.1: Time-series plots of the performance and context indicators’ zoomed views

For better understanding, it is possible to categorize the typical operating modes (TOMs) base on diﬀerent criteria. As an example, base on outdoor temperature we can categorize the TOMs into 3 groups:

(a) TOMs with outdoor temperature below 0°C (b) TOMs with outdoor temperature around 0°C

(c) TOMs with outdoor temperature above 0°C

Table 5.3 are showing the classiﬁcation of TOMs base on OT, SE, VOM, VOS,

RWB, and RCA respectively. More detail information about each TOM (cluster) are

given in table 5.5. By looking at this table we can get information such as which

month(s) are included in each TOM or how many days are there in each month

included in each TOM.

(35)

27 Table 5.3: Comparison of the typical operating modes using OT, SE, VOM, VOS, RWB, RCA

Features Categories TOM

OT

Below 0°C 1

Around 0°C None Above 0°C 0,2,3,4 SE

Below %96 3

Around %96 4

Above %96 0,1,2 VOM

Below %14 3

Around %14 2,4 Above %14 0,1 VOS

Below %1.5 0,3 Around %1.5 2

Above %1.5 1

RWB

Below %25 1

Around %25 3

Above %25 0,2,4 RCA

Below %20 3

Around %20 0,1 Above %20 2,4

Table 5.4: Comparison of the typical operating modes using OT standard deviation TOM OT standard deviation (°C)

0 ± 1.119

1 ± 1.950

2 ± 1.091

3 ± 1.887

4 ± 1.297

(36)

Table 5.5: Month, number of days, and average OT of each month for all TOMs TOM Month Number of days OT (°C)

0 January 11 0.987

February 6 1.012

March 5 1.538

Total 22

1 January 11 1.542

February 4 3.649

March 2 1.842

December 2 0.905

Total 19

2 October 3 1.367

November 13 1.046

December 13 1.073

Total 29

3 April 1 4.939

May 2 3.002

October 9 1.301

Total 12

4 February 10 1.446

March 1 2.119

May 2 1.933

October 5 1.644

November 8 1.077

December 4 0.406

Total 30

Note. the Total shows the total number of days in each TOM (cluster)

The results from monitoring the daily proﬁles in monitoring dataset are shown in table 5.6. Some examples are given below to explain the given information in the table:

(a) No. 1 shows a day with deviating behaviors. The reason is that the VOS of this day is 2 times higher than the VOS of the most similar TOM. Since the type of outlier is deviating, it can be an indication some faults or issues in the system. As a result, this day needs further analysis and should be diagnosed (Analyze and Diagnose).

(b) No. 3 is an example that shows a day with unseen behaviors. This means that the distance between the daily proﬁle operating features (SST, SRT, PHL, and OT) and the operating features of the closest TOM is greater than 2 times of the furthest data point in the cluster. In this case, the suitable action will be to update the mode with a new operating mode (NUOM). Later on, expert consideration is needed to determine whether the daily proﬁle is an indication of deviating behaviors or not (Expert Consideration).

(c) It is also possible that the daily proﬁle is similar to an existing unseen operating

(37)

29 mode. As a result, it will be added as a AUOM. No. 4 is an example of this scenario. In this situation, domain experts needs to analyze this day (Expert Consideration) and consider whether this day is showing deviating behaviors or typical behaviors.

The results from monitoring the daily proﬁles show that, between 12

^th

of March

and 14

^th

of April, since no similar TOM were found, the method labeled them with

unseen behaviors that needs further analysis. Other days are considered as the ones

with deviating behaviors because the performance indicators of these daily proﬁles

are diﬀerent from the performance indicators of the TOM selected as the most similar

operating mode. Additionally, the obtained results has been analyzed and evaluated

by the iquest domain experts. The evaluation and analysis of the results shows that

between 12

^th

of March and 29

^th

of April, due to a malfunction on the primary side

the system was not working properly. As a result, these days are showing deviating

behaviors.

(38)

Table 5.6: Results from monitoring the daily proﬁles using the proposed method No. Date Day OT (°C) Outlier type Reason Action

1 2019-02-23 Sat 3.875 Deviating VOS Analyze and Diagnose 2 2019-03-12 Tue -1.510 Deviating VOS Analyse and Diagnose 3 2019-03-13 Wed 2.171 Unseen NUOM Update model, Expert

Consideration

4 2019-03-14 Thu 4.795 Unseen AUOM Expert Consideration 5 2019-03-15 Fri 3.522 Unseen AUOM Expert Consideration 6 2019-03-16 Sat 2.842 Unseen AUOM Expert Consideration 7 2019-03-17 Sun 4.097 Unseen AUOM Expert Consideration 8 2019-03-18 Mon 3.850 Unseen AUOM Expert Consideration 9 2019-03-19 Tue 3.235 Unseen AUOM Expert Consideration 10 2019-03-20 Wed 4.403 Unseen AUOM Expert Consideration 11 2019-03-21 Thu 7.522 Unseen AUOM Expert Consideration 12 2019-03-22 Fri 6.713 Unseen AUOM Expert Consideration 13 2019-03-23 Sat 8.295 Unseen AUOM Expert Consideration 14 2019-03-24 Sun 7.364 Unseen AUOM Expert Consideration 15 2019-03-25 Mon 5.201 Unseen AUOM Expert Consideration 16 2019-03-26 Tue 3.999 Unseen AUOM Expert Consideration 17 2019-03-27 Wed 6.691 Unseen AUOM Expert Consideration 18 2019-03-28 Thu 9.138 Unseen NUOM Update model, Expert

Consideration

19 2019-03-30 Sat 9.507 Unseen NUOM Update model, Expert Consideration

20 2019-03-31 Sun 5.735 Unseen AUOM Expert Consideration

21 2019-04-01 Mon 5.903 Unseen AUOM Expert Consideration

22 2019-04-02 Tue 6.149 Unseen AUOM Expert Consideration

23 2019-04-03 Wed 7.150 Unseen AUOM Expert Consideration

24 2019-04-04 Thu 8.806 Unseen AUOM Expert Consideration

25 2019-04-05 Fri 6.188 Unseen AUOM Expert Consideration

26 2019-04-07 Sun 7.635 Unseen AUOM Expert Consideration

27 2019-04-08 Mon 2.901 Unseen AUOM Expert Consideration

28 2019-04-09 Tue 1.885 Unseen AUOM Expert Consideration

29 2019-04-10 Wed 2.567 Unseen AUOM Expert Consideration

30 2019-04-11 Thu 3.302 Unseen AUOM Expert Consideration

31 2019-04-12 Fri 3.868 Unseen AUOM Expert Consideration

32 2019-04-13 Sat 4.832 Unseen AUOM Expert Consideration

33 2019-04-14 Sun 4.716 Unseen AUOM Expert Consideration

34 2019-12-19 Thu 1.482 Deviating SE Analyze and Diagnose

35 2019-12-26 Thu 1.833 Deviating VOS Analyze and Diagnose

Note. The OT values are showing the average outdoor temperature of each day

(39)

Chapter 6 Discussion

Although detecting deviating behaviors in the system are crucial, providing addi- tional information related to these deviating behaviors are far more important. Due to the complexity of the control valve system and existence of high number of fea- tures, finding the cause of deviating behaviors are extremely difficult and time con- suming for the domain experts. The interpretative results from the model can work as recommendations for the domain experts to facilitate the analysis process and re- duce the time of analysis. The additional information may give domain experts the opportunity for better understanding and gaining additional knowledge about the operating behaviors of control valve system. Another aspect of the proposed method is that the operating behaviors of the control valve system are modeled using the historical data but this is not just a static model. The monitoring step of method is a continuous learning process in which the model will be updated in case some days with unseen behaviors appears. The days with unseen behaviors need further anal- ysis to determine whether they are representing the deviating or typical behaviors of the control valve system. Last but not least, we keep the domain experts in the loop. In this way, we have the opportunity to get additional information from the domain experts which can be used to refine the labels during monitoring step and improve the learning process.

Validity Threats

Implementation faults and errors are the main issues that can threat the validity of the results of the experiment. To reduce the eﬀect of programming faults and errors, the implementation has been tested several times with diﬀerent test set couples.

Another threat that may arise is that the algorithm selected for clustering is not suitable. There are some parameters in the experiment that can effect the results that we get from monitoring the behaviors of the control valve system. The threshold that have been chosen to check the similarity between the selected TOM and the new arrival day is an example of these parameters. In this study, the thresholds used in the monitoring section of the proposed method have been selected by having discussion with domain experts and also by testing different values. One of the important factors in a experiment is that the data shouldn’t be biased. To prevent that issue, in this experiment, the method has been tested with 5 different test set couples. Also, to select the days for each groups in the test set couples, random sampling technique has been used. Selecting an appropriate population is another important factor in

31

(40)

experiment. To do that, suitable inclusion and exclusion criteria should be deﬁned.

In this experiment, since the focus was on the heating system, a speciﬁc technique

was deﬁned to select the heating seasons.

(41)

Chapter 7 Conclusions and Future Work

In this study, we achieved the main goal by modeling the operating behaviors of con- trol valve system using machine learning and data mining techniques. The behavioral models were tested using real world data from iquest and the obtained results were evaluated by domain experts. The results show that the proposed method is capa- ble of detecting deviating behaviors that occurs in control valve system. The main contribution of the proposed method is the additional information provided besides detecting the deviating behaviors. These types of information are very helpful for the domain experts in the industry. They can have a better understanding of the system behaviors. Also, they may gain additional knowledge about the system. Moreover, these supplementary information can make the analysis of the system behaviors eas- ier and diminish the time it takes to perform the analysis.

7.1 Answering Research Questions

• RQ1:What algorithms are suitable to extract patterns from time- series data?

To answer RQ1, studies related to time-series data clustering has been reviewed.

The review of the studies shows that partitioning algorithm are the most com- mon types of algorithms applied for clustering time-series data. Among all the partitioning algorithms, k-means is the most popular one.

• RQ2:What type of ML and DM methods are suitable for fault de- tection and Diagnosis (FDD) in control valve system?

The relevant studies about FDD in HVAC&R and building system has been re- viewed for answering RQ2. Around %62 of the studies applied process history- based (data-driven) methods and among those, approximately %72 used black box models.

• RQ3: How the gained knowledge from RQ1 and RQ2 can be used to propose a method for modeling and monitoring control valve system behaviors?

By reviewing the studies related to FDD in building systems, we could not ﬁnd any solution that can be applied to our problem immediately. As a result, by considering the nature of our data and the gained knowledge from RQ1 and RQ2, we proposed a method for modeling and monitoring the operating

33

Data mining techniques for modeling the operating behaviors of smart building control valve systems

Master of Science in Computer Science June 2020

Data mining techniques for modeling the operating behaviors of smart

building control valve systems

AMIRMOHAMMAD EGHBALIAN

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identiﬁed as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:

Author(s): AMIRMOHAMMAD EGHBALIAN E-mail: ameg18@student.bth.se

University advisors:

Prof. Veselka Boeva Shahrooz Abghari

Department of Computer Science

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00

SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

Abstract

Keywords: Machine learning, Data mining, Outlier detection, Time-series, HVAC&R

Acknowledgments

First of all, I would like to thank my supervisors Professor Veselka Boeva and Shahrooz Abghari for their trust, support, and precious feedbacks. Without your guidance and great patience, I was not able to do this.

I would like to thank Farhad Basiri, the CEO of iquest company and his colleague Otto Sandstrom who gave me this opportunity and provided the materials for this study.

Last but not leat, I would like to thank my dear family who supported me both emotionally and ﬁnancially.

iii

Contents

Abstract i

Acknowledgments iii

1 Introduction 1

1.1 Problem Statement . . . . 2

1.2 Research Questions . . . . 2

1.3 Aims and Objectives . . . . 3

1.4 Background . . . . 3

1.4.1 Machine Learning . . . . 3

1.4.2 Clustering Analysis . . . . 4

1.4.3 Data Mining . . . . 4

1.4.4 Outlier Detection . . . . 5

1.4.5 District Heating System . . . . 7

1.5 Outline . . . . 7

2 Related Work 9 2.1 Search Terminology . . . . 9

2.2 Literature Review . . . . 9

3 Methods and proposed approach 13 3.1 Methods . . . . 13

3.2 Proposed Approach . . . . 14

4 Experiment and Evaluation 17 4.1 Data . . . . 17

4.2 Data Preprocessing . . . . 17

4.3 Experimental Setup and Implementation . . . . 19

5 Results and Analysis 23 6 Discussion 31 7 Conclusions and Future Work 33 7.1 Answering Research Questions . . . . 33

7.2 Future Work . . . . 34

References 35

v

A.2 Operating Modes . . . . 39

vi

Chapter 1

Introduction

One of the important parts of HVAC&R is control system. The role of the control system in HVAC&R is to regulate the operation and performance of HVAC&R.

1

1.1 Problem Statement

1.2 Research Questions

With respect to the characteristics of the time-series data, and the problem at hand this thesis is going to address the following research questions:

• RQ1: What algorithms are suitable to extract patterns from time-series data?

• RQ2: What type of ML and DM methods are suitable for fault detection and diagnosis (FDD) in control valve system?

Motivation: The study of RQ2 will give an insight of the domain. Addition- ally, we gain knowledge about the type of ML and DM methods used in the domain for FDD in control valve system.

• RQ3: How the gained knowledge from RQ1 and RQ2 can be used to propose a method for modeling and monitoring control valve system behaviors?

Motivation: RQ3 is tightly related to RQ1 and RQ2. The answer of RQ3 can work as a proof that shows the suitability of the algorithms and the methods that were selected in RQ1 and RQ2.

To answer RQ1 and RQ2, a literature study will be performed. To answer RQ3,

the gained knowledge from RQ1 and RQ2 will be used to propose a method for

detecting deviating behaviors of control valve system. Finally, an experiment

will be performed to show the applicability of the proposed method on the real

world data.

1.3. Aims and Objectives 3

1.3 Aims and Objectives

The aim of this thesis is to use diﬀerent data mining and knowledge discovery tech- niques for modeling, analysing and understanding control valve system operating behavior in smart buildings. In this thesis, the data that we use is in the form of time-series.

The objectives of this thesis are:

1. Performing a literature study to get an insight about data mining techniques, and algorithms that can be used for extracting patterns from time-series data, building and visualizing behavioral models in case of non-labeled data.

2. Modeling the operating behavior of control valve system by applying suitable data mining techniques.

3. Evaluate the proposed behavioral models on real world datasets and discuss the obtained results with the domain experts.

1.4 Background

1.4.1 Machine Learning

Supervised Learning

In supervised learning tasks, input and output variables are used to learn the function

that can map them to each other. The main goal is to ﬁnd a mapping function that

can predict the output of the new input data. Classiﬁcation and regression are two