Master of Science in Computer Science June 2020
Data mining techniques for modeling the operating behaviors of smart
building control valve systems
AMIRMOHAMMAD EGHBALIAN
Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden
The thesis is equivalent to 20 weeks of full time studies.
The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.
Contact Information:
Author(s): AMIRMOHAMMAD EGHBALIAN E-mail: ameg18@student.bth.se
University advisors:
Prof. Veselka Boeva Shahrooz Abghari
Department of Computer Science
Faculty of Computing Internet : www.bth.se
Blekinge Institute of Technology Phone : +46 455 38 50 00
SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57
Abstract
Background. One of the challenges about smart control valves system is processing and analyzing sensors data to extract useful information. These types of information can be used to detect the deviating behaviors which can be an indication of faults and issues in the system. Outlier detection is a process in which we try to find these deviating behaviors that occur in the system.
Objectives. First, perform a literature review to get an insight about the machine learning (ML) and data mining (DM) techniques that can be applied to extract pat- tern from time-series data. Next, model the operating behaviors of the control valve system using appropriate machine learning and data mining techniques. Finally, evaluate the proposed behavioral models on real world data.
Methods. To have a better understanding of the different ML and DM techniques for extracting patterns from time-series data and fault detection and diagnosis of building systems, literature review is conducted. Later on, an unsupervised learning approach is proposed for modeling the typical operating behaviors and detecting the deviating operating behaviors of the control valve system. Additionally, the proposed method provides supplementary information for domain experts to help them in their analysis.
Results. The outcome from modeling and monitoring the operating behaviors of the control valve system are analyzed. The evaluation of the results by the domain experts indicates that the method is capable of detecting deviating or unseen oper- ating behaviors of the system. Moreover, the proposed method provides additional useful information to have a better understanding of the obtained results.
Conclusions. The main goal in this study was achieved by proposing a method that can model the typical operating behaviors of the control valve system. The generated model can be used to monitor the newly arrived daily measurements and detect the deviating or unseen operating behaviors of the control valve system. Also, it provides supplementary information that can help domain experts to facilitate and reduce the time of analysis.
Keywords: Machine learning, Data mining, Outlier detection, Time-series, HVAC&R
Acknowledgments
First of all, I would like to thank my supervisors Professor Veselka Boeva and Shahrooz Abghari for their trust, support, and precious feedbacks. Without your guidance and great patience, I was not able to do this.
I would like to thank Farhad Basiri, the CEO of iquest company and his colleague Otto Sandstrom who gave me this opportunity and provided the materials for this study.
Last but not leat, I would like to thank my dear family who supported me both emotionally and financially.
iii
Contents
Abstract i
Acknowledgments iii
1 Introduction 1
1.1 Problem Statement . . . . 2
1.2 Research Questions . . . . 2
1.3 Aims and Objectives . . . . 3
1.4 Background . . . . 3
1.4.1 Machine Learning . . . . 3
1.4.2 Clustering Analysis . . . . 4
1.4.3 Data Mining . . . . 4
1.4.4 Outlier Detection . . . . 5
1.4.5 District Heating System . . . . 7
1.5 Outline . . . . 7
2 Related Work 9 2.1 Search Terminology . . . . 9
2.2 Literature Review . . . . 9
3 Methods and proposed approach 13 3.1 Methods . . . . 13
3.2 Proposed Approach . . . . 14
4 Experiment and Evaluation 17 4.1 Data . . . . 17
4.2 Data Preprocessing . . . . 17
4.3 Experimental Setup and Implementation . . . . 19
5 Results and Analysis 23 6 Discussion 31 7 Conclusions and Future Work 33 7.1 Answering Research Questions . . . . 33
7.2 Future Work . . . . 34
References 35
v
A.2 Operating Modes . . . . 39
vi
Chapter 1
Introduction
Nowadays, a countless number of devices are equipped with smart sensors that can communicate together and can be accessed via the Internet. This is what we call IoT and it made our life, city and premises more modern by making progress in different fields like computing, communication and electronic. Smart environment is a key phrase in IoT. One of the domains that it covers is smart buildings. The solutions that have been provided by IoT had a great effect on decreasing energy waste that is caused by suboptimal management and human activities. Some of the examples of the automation systems based on IoT are SmartThings, Vera, Microsoft Lab of Things (LoT), openHAB, Ninjablocks, Twine, CASAS Smart Home project [25]. To encourage energy consumers to use these technologies, their comfort should be con- sidered [17].
Heating, Ventilation, Air Conditioning, and Refrigeration (HVAC&R) is a system designed to resolve the thermal needs and requirements for different buildings such as residential, industrial and so on. There are different types of HVAC&R systems and all these systems can be categorized in two main groups: central and local sys- tems. Simply, the main task of HVAC&R system is to heat or cool the outdoor air according to the desired and required temperature and then draw it into the building.
One of the important parts of HVAC&R is control system. The role of the control system in HVAC&R is to regulate the operation and performance of HVAC&R.
Energy management and safety are other capabilities that we expect from modern control systems. Energy management in HVAC&R means that these systems should provide their main tasks in the most efficient way. Safety is a function that protect people and the HVAC&R system from receiving damage. Limiting the temperature to prevent overheating or freezing is an example of safety function in HVAC&R system. There are five pieces in the control loop of each control system: a sensing element, transmitter, controller, final control element, and process. For a long time, the control valve is used as the main final element in the control systems of many HVAC&R equipment [23, 34]. A control valve is a type of valve which is used to control the flow and pressure of fluids or gas. One of the advances of technology in smart buildings is the advent of smart control valves. Smart control valves are equipped with sensors that can collect diagnostic data such as valve position or performance.
1
1.1 Problem Statement
One of the challenges about smart control valves is processing and analyzing the collected data from the sensors. The sensors in these types of valves collect a large amount of data which makes the analysis process and extracting useful information difficult. The extracted information is of great importance and the reason is that by investigating these information we can realize the deviating operational behaviors of the system. The deviating operational behaviors of the system can be an indication of faults such as cavitation or misconfiguration and in some cases the unsuitable size of the control valve. Detecting the faults at early stages can reduce the maintenance cost, improve the energy efficiency, and more important reduce carbon dioxide emis- sions. Finally, the better performance of the system will bring more comfort to the customers.
1.2 Research Questions
With respect to the characteristics of the time-series data, and the problem at hand this thesis is going to address the following research questions:
• RQ1: What algorithms are suitable to extract patterns from time-series data?
Motivation: The reason for studying RQ1 is that many algorithms can be used for time-series pattern extraction. The algorithm for this purpose cannot be selected randomly and selecting the right algorithms needs analysis of the data. Pattern extraction is one of the prerequisites of modeling the operating behaviors of the system.
• RQ2: What type of ML and DM methods are suitable for fault detection and diagnosis (FDD) in control valve system?
Motivation: The study of RQ2 will give an insight of the domain. Addition- ally, we gain knowledge about the type of ML and DM methods used in the domain for FDD in control valve system.
• RQ3: How the gained knowledge from RQ1 and RQ2 can be used to propose a method for modeling and monitoring control valve system behaviors?
Motivation: RQ3 is tightly related to RQ1 and RQ2. The answer of RQ3 can work as a proof that shows the suitability of the algorithms and the methods that were selected in RQ1 and RQ2.
To answer RQ1 and RQ2, a literature study will be performed. To answer RQ3,
the gained knowledge from RQ1 and RQ2 will be used to propose a method for
detecting deviating behaviors of control valve system. Finally, an experiment
will be performed to show the applicability of the proposed method on the real
world data.
1.3. Aims and Objectives 3
1.3 Aims and Objectives
The aim of this thesis is to use different data mining and knowledge discovery tech- niques for modeling, analysing and understanding control valve system operating behavior in smart buildings. In this thesis, the data that we use is in the form of time-series.
The objectives of this thesis are:
1. Performing a literature study to get an insight about data mining techniques, and algorithms that can be used for extracting patterns from time-series data, building and visualizing behavioral models in case of non-labeled data.
2. Modeling the operating behavior of control valve system by applying suitable data mining techniques.
3. Evaluate the proposed behavioral models on real world datasets and discuss the obtained results with the domain experts.
1.4 Background
1.4.1 Machine Learning
Machine learning (ML) is a subcategory of artificial intelligence (AI) that enables computers to learn and enhance without the use of explicit programming. The general definition of ML in Peter Flach’s book is presented as [21]: “Machine learning is the systematic study of algorithms and systems that improve their knowledge or performance with experience.” In ML the main effort is to select proper set of features to create suitable models and use the built models to accomplish the right tasks. Classification, regression, clustering, and descriptive modeling are some of these tasks. Data can be labeled or unlabeled and base on this concept, machine learning methods can be classified in two major groups:
Supervised Learning
In supervised learning tasks, input and output variables are used to learn the function
that can map them to each other. The main goal is to find a mapping function that
can predict the output of the new input data. Classification and regression are two
main groups of supervised learning tasks. In classification problems, the outputs
are categorical (or discrete) variables but in regression problems the outputs are
numerical (or continuous) variables. Classifying the input data from heating system
into faulty or non-faulty is an example of classification problem and predicting the
valve openness can be an example of regression problem. Linear regression, random
forest, and support vector machines including support vector regression are some of
the popular examples of supervised learning algorithms.
Unsupervised Learning
In this type of learning, only the input variables are available. The aim of unsu- pervised learning algorithms is to learn more about the data by extracting patterns from the data and detecting hidden structures in the data. Clustering is an example of unsupervised learning which will be explained in the next section.
1.4.2 Clustering Analysis
Clustering is an unsupervised learning task in which the data will be split into differ- ent groups (clusters). In each cluster, the similarity between the data points are high while the similarity between the clusters are as low as possible. Clustering algorithms can be classified into 5 different groups which are partitioning-based, hierarchical- based, model-based , density-based , and grid-based algorithms. More information about each group can be found in [19]. In this study, k-means (see section 3.1) which is a popular partitioning-based algorithm is applied.
Multi-view Clustering
Most common clustering methods use a single set of features (views) to cluster the data points which is called single-view clustering. Multi-view clustering (MVC) [14]
is a type of clustering in which the data points are demonstrated using different feature sets (views). Regardless of the view, in MVC, the data points should be clustered in the same group. In the current century, multi-view data can be seen in many real-world applications such as multimedia content. It is possible to explain each segment of multimedia content from two different views. One view can be the video signals from video recording devices and another view can be the audio signals from audio recording devices .
1.4.3 Data Mining
Data mining is the process of finding information from data using computerized techniques. It is mostly used in the exploratory analysis where we look for new patterns and information that are not trivial. Through this process, we can combine our knowledge of formulating problems with computers’ strong abilities in searching to attain the greatest outputs [27]. Prediction and description are two main goals of data mining. In prediction the main effort is to predict the future values of specific variables using available data. Descriptive models are used to extract the patterns that can describes the data in human understandable ways.
Data Preprocessing
Lack of tight control on data collection is one of the reasons that makes data prepro-
cessing important in data mining. Data noises are one of the obstacles that makes
data mining process complicated. In many cases, existence of irrelevant data makes
the results of the analysis unreliable. Data preprocessing is a process that transform
or encode the features so that the machine can parse them easily. Some of the data
preprocessing techniques are:
1.4. Background 5
• Feature Selection: Creating the model base on the irrelevant features can re- duce the accuracy of the model. Feature selection is a manual or an automated process that helps to select the most relevant features to the output which we are interested in.
• Data Cleaning: A process in which we try to correct or remove inconsistent data. Duplicate data points and missing values are some of the examples of inconsistent data that affect the quality of the data and the results. Generally, in data cleaning process to handle inconsistent data, one of the following tasks may be applied: 1) Removing 2) Correcting 3) Imputation
• Data Discretization: A process in which the continuous data such as time- series data can be transformed to discrete data such as nominal attributes.
• Feature Scaling: A process that helps to normalise the independent features in a specific range. This technique can make the comparison of the independent features easier and also in some cases it will increase the speed and performance of the machine learning algorithms.
1.4.4 Outlier Detection
The patterns in the data that do not correspond to the expected behaviors are called anomalies and the act of detecting these patterns is called outlier detection [13].
Detecting anomalies in the data is of great importance and the reason is that inter- preting them often gives us vital information in different domains. A simple way to detect outlier is to identify an area that describes normal behavior in the data and other data points that are not in this area can be considered as anomalies. There are some factors that make this simple approach pretty complicated. One is that detect- ing an area that includes all available normal behaviors is not an easy process. Also, most of the times the margin within normal and deviating behavior is not accurate.
Three main classification of anomalies are: Point anomalies, Contextual anomalies, and collective anomalies. Data points that can be viewed as outlier with regard to other data points in the dataset are called point anomalies. Collective outlier is a collection of relevant data points that since they occurred together are considered as anomalies with regard to other data points in the dataset. Data points consid- ered as outlier in a particular context are called contextual anomalies (conditional anomalies). The definition of the context varies with respect to the structure of the dataset and it should be determined when we are defining the problem. A data point can be defined by two set of features: contextual attributes and behavioral attributes. Attributes applied to define the context of the data points are called con- textual attributes. As an example, in heating system domain, outdoor temperature can be considered as a contextual attribute. On the other hand, attributes applied to specify the non-contextual properties of the data is called behavioral attributes.
As an example, the monthly average valve openness or the monthly valve standard
deviation in a specific building can be considered as behavioral attributes in heating
systems .
Fault Detection and Diagnosis
Fault Detection and Diagnosis (FDD) is the process in which we try to detect the faults and understand the cause of the faults in the physical system. HVAC&R is an example of physical system in the buildings. Automated Fault Detection and Diagnosis (AFDD) equipment are the tools and technologies that can be applied to automate the FDD process [9]. In general, FDD techniques can be classified into three main groups as follows: 1) quantitative model-based methods, 2) qualitative model-based methods, and 3) process history-based methods. Quantitative model- based methods are a collection of quantitative mathematical relations relying on the underlying physics of processes. The growth of control systems’ complexity and com- puters’ usage, increased the importance of quantitative model-based FDD systems [24, 37, 43]. These methods are more complex, computationally intensive, most ac- curate and reliable compared to qualitative model-based and process history-based methods. Modeling the temporary behaviors of the system is a task that quan- titative model-based methods can perform better than other modeling techniques.
These methods can be classified into detailed physical models and simplified physical models. In physical model-based approaches, a set of measured inputs such as tem- perature will be used to predict or estimate the behaviors or outputs of the system.
Finally, to detect faults, these estimated values will be compared with the measured outputs. Using detailed information of the physical connections and features of all the parts in a system is the main idea of the detailed physical models. On the other hand, simplified physical models usually applies simpler approaches that need fewer computations compare to detailed physical models. Qualitative model-based methods are a collection of qualitative relations inferred from knowledge of the un- derlying physics. Qualitative model-based approaches can be divided into rule-based models and qualitative physics-based models. In both of the mentioned methods, causal knowledge will be used for fault diagnosis in the system. In rule-based mod- els, if-then-else rules will be created by using priori knowledge and from these rules, conclusions will be made. In situations where there is uncertainty about the available knowledge or the available knowledge is incomplete, qualitative physics-based mod- els are able to draw conclusions about a system’s state. The process history-based (data-driven) methods are the methods that only use measurement data coming from processes to create behavioral models. In process history-based models, feature ex- traction is a technique in which input and output data will be converted and applied as a priori knowledge. In these methods, the main goal is to connect measured inputs and outputs using mathematical relationships. Black box and gray box models are two classifications of process history-based models. In the black box models, fault detection in the systems is based on parameter estimation. In most situations, the parameter deviation does not have a physical significance. The gray-box models are a combination of physics based models and data-driven models. In these types of models, the physics based models use data to learn the parameter estimates (e.g.
coefficients) of the equations. Linear regression and artificial neural network (ANN)
are some of the examples of process history-based methods [28].
1.5. Outline 7
1.4.5 District Heating System
A system used to deliver heat and hot water from a central boiler to the buildings that are located in a limited geographical area through a piping system. The main parts included in a district heating system are:
1 Production unit in which the required heat and hot water will be produced.
2 Distribution unit that delivers the produced heat and hot water from produc- tion unit to the consumption unit.
3 Consumption unit in which the buildings are located and will receives the heat and hot water produced in the production unit.
The parts mentioned earlier form the primary side (circuit) of the district heating system. Additionally, there is a secondary side (circuit) which refers to the heating system located in the consumption unit. The main components included in the secondary side are a sub-station with heat exchanger, a piping system that circulate the water, and equipment such as radiators to transfer the heat to the room’s space.
A sub-station is a component that connects primary side to the secondary side.
Moreover, it is responsible for setting the suitable pressure and temperature for the supply water.
Figure 1.1: District Heating System
1.5 Outline
The remaining parts of this study are structured as below:
• Chapter 2: Reviews the studies related to FDD of the building systems and clustering of time-series data.
• Chapter 3: Presents a method for modeling and monitoring the operating behaviors of the control valve system.
• Chapter 4: Conducts an experiment by using the proposed method with real
world data.
• Chapter 5: Presents the results of the experiment and their analysis.
• Chapter 6: Presents a discussion about the proposed method, the obtained results, and threats to the validity.
• Chapter 7: Presents the conclusion of this study, answers the research ques-
tions, and introduce an idea for the future work
Chapter 2
Related Work
2.1 Search Terminology
The papers and resources are mostly found and collected from digital libraries such as Scopus. At the beginning, the reviews and surveys related to FDD methods for smart building systems and extracting patterns from time-series data has been selected. Next, to collect more literature, backward snowballing technique has been applied on the initial papers found. The literature are collected using following strings:
• "Extract patterns" + "Time-series data"
• "Extract patterns" + "Time series data"
• "Fault detection" + "Smart building"
• "Fault detection" + "Smart building" + "control Valve"
To select the most related papers the following inclusion and exclusion criteria are used:
Inclusion Criteria:
• Published in English language
• Related to the topic, i.e., title, abstract and conclusion sections of the literature are checked to understand whether they are related to the topic or not.
Exclusion criteria:
• Published in a language other than English
• The full-text of the literature is not available
2.2 Literature Review
Time-series data have some specific characteristics that makes pattern extraction difficult such as high demensionality, high correlation between the features, and high number of noisy data points. The review of the studies related to time-series cluster- ing from last decade shows that partitioning algorithm are the ones that are mostly
9
used. k-means is an example of partitioning algorithms that are widely applied for clustering time-series data [32]. The fast response of these types of algorithms is one of the reasons they are relatively popular. In these types of algorithms the number of clusters should be assigned before the application of the algorithm. The latter is one of the reasons that cause the application of partitioning algorithms in real world cases difficult. Additionally, these algorithms are proper to use in situations where we are working with equal-sized time-series data [8].
Active study related to Fault Detection and Diagnosis (FDD) in HVAC&R sys- tems started in the 1980s and since then FDD and data mining techniques have matured considerably [10, 42]. In 1987 and 1989, automated Fault Detection and Diagnosis (AFDD) methods on refrigeration based on vapor-compression have been studied by McKellar [33] and Stallard [40] respectively. In the 1990s, the major- ity of the applications related to FDD concentrated on vapor-compression devices and air-handling units (AHUs). Generally, these applications were using tempera- ture and/or pressure measurement for general Fault Detection and Diagnosis. The International Energy Agency (IEA) conducted a project (Annex 25) at the begin- ning of 1990 on real-time simulation of HVAC&R systems. The Annex 25 was able to detect general issues in different types of HVAC&R systems [26]. After Annex 25, another study (Annex 34) was conducted by IEA to show the application of FDD systems in buildings [16]. From 1998 to 2001, the U.S. department for en- ergy (DOE) conducted and supported several projects such as developing diagnostic tool for the whole-building , FDD for outdoor-air ventilation [12, 29, 30], simpli- fied physical-model-based FDD for air-handling units [37], and FDD for centrifugal chiller systems [38]. Until 2018, around 200 publications related to AFDD for build- ing systems were published. Around %62 of these publications related to process history-based AFDD methods, %26 related to qualitative model-based , and %12 related to quantitative model-based methods. There are two reasons that process history-based methods are mostly applied. The first reason is that these methods use historical data for creating the model. Another reason is that in these methods the modeling complexity is reduced. %72 of the studies that applied process history- based methods, derived from black box models, %12 of them derived from gray box models, and the last %16 of the studies derived from a combination of these methods.
Black box models can be classified into statistical , ANN, and pattern recognition techniques.
Some of the studies that used pattern recognition techniques are [22, 35, 36, 39].
Ren et al. [36] used a black box support vector machine (SVM) model to classify the patterns in the refrigeration system. These classified patterns were used to in- vestigate whether an issue exists in the system. In this study, with respect to seven operating patterns (a normal state and six fault states), SVM was used to recognize the best pattern to match the faults. Najafi et al. [35] presented an AFDD approach for air-handling unit diagnosis. In this approach, Bayesian network was used to analyze and compare the current behavioral patterns of the system with the faulty behavioral patterns that was produced by the system faults to select the most similar pattern that can demonstrate the current behavior of the system. Han et al. [22]
presented a hybrid method which was a combination of SVM and multi-label (mL)
2.2. Literature Review 11 technique to detect and diagnose multiple-simultaneous faults (MSF) automatically.
In this study, the application of the proposed hybrid strategy was shown by using this method on a building chiller system. Srivastav et al. [39] presented a Gaussian Mixture Regression (GMR) method to model the building energy usage. The results of this method showed better performance in prediction accuracy and local confi- dence estimation compare to multivariate linear regression and another method that was proposed in [41].
From all of the studies that applied black box models, most of them used sta- tistical techniques. Cui and Wang [15] proposed an on-line AFDD technique to demonstrate a centrifugal chiller system health state. For this purpose the authors applied polynomial regression black box model. The presented model is simple both in structure and application but in terms of statistical modeling it has limitations.
In another study, the authors [46] proposed an approach for fault detection in the AHU using Autoregressive-Moving-Average (ARMA) which is a kind of black box statistical model. The method uses a threshold to measure the performance degra- dation of the system which is caused by the existing fault in the system to understand whether the system needs service or not. Armstrong et al. [11] proposed a fifth-order AR black box model that is able to find the faults such as compressor valve leakage, in the rooftop units (RTUS) by getting one input. A PCA-based AFDD method was presented by Xiao et al. [44] for AHU system . Using this method it is possible to monitor the status of the sensor in an AHU system in real time. Inability to detect the complex sensor faults is one of the drawbacks of the proposed method.
Du et al. [18] introduced black box model based method using a joint angle analysis (JAA) that can detect faults in variable-air volume (VAV) systems. Xu et al. [45]
introduced a method based on black box models for centrifugal chiller system that decides whether the operation in the system is normal or abnormal. It is mentioned that for the purpose of fault diagnosis, another method such as JAA is needed.
There are some studies that used black box ANN model in their AFDD methods.
Kim et al. [31] presented an AFDD method based on black box ANN technique for air-conditioning system of a residential building. In this AFDD method, the proba- bilities of the normal and faulty state of the system is calculated and in cases where the probability of the faulty state is higher than the probability of the normal state, the system is marked with fault. Fan et al. [20] presented an AFDD method for AHU that is based on ANN balck models and wavelet analysis. In this method for the normal operation of the system a threshold is selected and the method can find the faults in the system when the residuals are higher than the threshold.
In general, the review of the studies related to Fault Detection and Diagnosis
of building systems shows that process history-based methods are the most applied
methods. The presented methods in the studies were mostly focused on refrigeration
system, chiller system, variable-air volume, compressor valve in rooftop units, and
air-conditioning systems. Control valve is one of the important parts in control sys-
tem of many HVAC&R systems. Hence, the performance of control valve can effect
the performance and efficiency of the HVAC&R systems. Since we could not find
any approach that can be applied to solve our problem immediately, therefore, we
proposed a method which is capable of modeling the typical behaviors and detect-
ing the deviating behaviors of control valve system. The proposed method will be
explained in the next chapter.
Chapter 3
Methods and proposed approach
3.1 Methods
In this section, the machine learning and data mining techniques, and distance mea- sure that has been used in this study are explained.
k-means
An iterative machine learning algorithm with the purpose of grouping the data points into k distinct groups (clusters) and it makes sure that each data point belongs to one cluster. k-means is one of the simplest and most popular clustering algorithm in which, each cluster has a centroid which is the arithmetic mean of all the data points that exist in that cluster. This algorithm, assure that the sum of the squared distance between the data points that are assigned to a cluster and the centroid of that cluster is at the lowest amount. The process in k-means is that first we specify the number of clusters ( k). Next, the algorithm shuffles the dataset and selects k number of data points randomly and without replacement as centroids. This process will be repeated until the sum of squared between the data points and the centroids in each cluster are at the lowest. To find the optimal number of k, some methods such as elbow method and silhouette score can be used.
Elbow Method
One of the most common methods to find the optimal number of k. The way this method works is that a range of values for k will be selected. Then the selected values will be used to fit the model. The section in the line chart where shows an elbow will be selected as the optimal value. In some cases, finding the elbow part of the chart is a hard task. So using other methods as complement methods can be a good idea [1].
Silhouette Score
A method that is used to measure the closeness of each point in one cluster to the points in the adjacent clusters. Mean nearest-cluster distance (m) and mean intra-cluster distance (n) are two important parameters for calculating Silhouette score (SC). Mean nearest-cluster distance is the distance between a data point and the nearest cluster to that data point. Mean intra-cluster distance is the distance
13
between the instances in the same cluster. The silhouette score uses the following formula:
SC = m − n
max(n, m) (3.1)
The possible values for SC are in the range of -1 and 1. Values closer to 1 means that instances are far from the adjacent clusters. On the other hand, values closer to -1 means the instances are placed in the wrong clusters. Values close to zero shows that cluster are overlapping [5].
z-score
A technique that scale the features in a way that the distribution of the features have mean equal to zero and standard deviation equal to one. The following equation shows the z-score for scaling feature’s value:
z = F
i− μ
σ (3.2)
In this equation z-score is the scaled version of the feature’s value, F
iis the feature’s value, μ is the mean of the feature and σ is the standard deviation of the feature.
Euclidean Distance Measure
The length of the line that connects point x to point y is called Euclidean distance.
The position of the points in Euclidean space is called Euclidean vectors. The Eu- clidean distance between two points can be calculated using following equation:
d(x, y) =
n
i=1