Self-organized Selection of Features for Unsupervised On-board Fault Detection

(1)

International Master’s Thesis

Self-organized Selection of Features for Unsupervised

On-board Fault Detection

Ahmed Mosallam

Technology

Studies from the Department of Technology at Örebro University örebro 2010

(2)

(3)

Self-organized Selection of Features for Unsupervised

On-board Fault Detection

(4)

(5)

Studies from the Department of Technology

at Örebro University

Ahmed Mosallam

Self-organized Selection of Features for

Unsupervised On-board Fault Detection

(6)

© Ahmed Mosallam, 2010

Title: Self-organized Selection of Features for Unsupervised On-board Fault Detection

(7)

Abstract

This work presents an algorithm for self organized feature selection for “in-teresting” features for subsequent fault detection. Feature selection in this scenario is hard because of the absence of information to evaluate against as we do not know what is faulty behaviour. The assumption in this thesis is that interesting features can be non random relations between signals.

In general, feature selection can be grouped in three main methods: wrap-per, embedded and filter methods. In this work we use filter methods and pro-vide empirical comparison of three different correlation methods. These meth-ods belong to two different groups, information theory and correlation metrics. The algorithm compute a feature cluster quality using the self-organizing map algorithm.

The experiments are performed on both synthetic data sets and real data sets. The empirical datasets obtained, exhibit different interesting relations to show that the algorithm indeed finds the relationships that it is designed to find. Also, it finds relations on the charge and discharge characteristics of lithium-ion batteries that can be used to predict the health status of the batteries.

(8)

(9)

Acknowledgements

I’d like to thank Thorsteinn Rögnvaldsson for all the guidance and support he provided throughout the work on this thesis. Also, for lots of inspiring and interesting discussions.

I’d also like to thank Stefan Byttner for the many useful comments and suggestions.

(10)

(11)

List of Figures

3.1 Pearson’s correlation for six different relationships . . . 26 3.2 Spearman’s correlation for six different relationships . . . 27 3.3 Symmetrical uncertainty’s correlation for six different

relation-ships . . . 29 3.4 The dendrogram for the eight dimensional data set 7 . . . 30 3.5 The merging of clusters under the single linkage criteria . . . . 30 3.6 The merging of clusters under the complete linkage criteria . . 31 3.7 The merging of clusters under the average linkage criteria . . . 31 3.8 On the left, variables X,Y and Z have non random relationship.

On the right they have random relationship . . . 32 3.9 _{Any value for dcutoff within the range between the two red lines}

produce the right answer . . . 33 3.10 On the left, DB index for each level. Note that the DB index

for the last level is infinity. On the right the clustering output using DB index on data set 7 . . . 34 3.11 On the left, the modified DB index for each level. Note that the

the modified DB index for the last level is infinity. On the right the clustering output using the modified DB index on data set 7 35 3.12 Number of clusters generated at each level vs. the distance . . . 36 3.13 Fitting two lines to thee points to detect a knee . . . 37 3.14 The L method . . . 38 3.15 Results of applying the L method on data set 7 on both

den-drogram on left and number of clusters vs. distance graph on right . . . 39 3.16 Results of applying the modified L method on data set 7 on

both dendrogram on left and number of clusters vs. distance graph on right . . . 40 3.17 K-means distortion measure for different relationships . . . 41 3.18 A self-organizing map neurons arranged in a hexagonal and

rect-angular grid . . . 42 3.19 SOM distortion measure for different relationships . . . 43

(14)

14 LIST OF FIGURES

3.20 Quality measure using K-means and SOM . . . 44

4.1 The right answer on data set 1 . . . 46

4.7 The right answer in data set 8 . . . 52

5.1 The average quality and standard deviations for the right an-swer on data set 1 computed 10 times. The left bar represents the SOM and left bar represents the k-means . . . 55

5.2 Exhaustive Quality distribution search on dataset 1 for SOM on left and K-means on the right . . . 56

5.3 The average quality and standard deviations for the right an-swer on data set 2 computed 10 times. The left bar represents the SOM and right bar represents the k-means . . . 58

5.5 The average quality and standard deviations for the right an-swer on data set 3 computed 10 times. The left bar represents the SOM and right bar represents the k-means . . . 61

5.7 The average quality and standard deviations for the right an-swer on data set 4 computed 10 times. the left bar represents the SOM and the right bar represents the k-means . . . 64

5.9 The average quality and standard deviations for the right an-swer on data set 5 computed 10 times. The left bar represents the SOM and the right bar represents k-means . . . 67

5.11 The average quality and standard deviations for the right an-swer on data set 6, 4,5 and 6 on the left and 12, 13 and 14 on the right, computed 10 times. The left bar in each image represents the SOM and the right bar represents the k-means . . . 70

(15)

LIST OF FIGURES 15

5.13 The average quality and standard deviations for the right an-swers on data set 7, 1,4 and 8 on the left and 3,5 and 7 on the left, computed 10 times. The left bar represents the SOM and the right bar represents the k-means . . . 73 5.14 Exhaustive Quality distribution search on dataset 7 for SOM

on left and K-means on the right . . . 74 5.15 The average quality and standard deviations for the right

an-swers on data set 8, 1,6 and 7 on the left and 2,3,4 and 8 on the right, computed 10 times. The left bar represents the SOM and the right bar represents k-means . . . 75 5.16 Exhaustive Quality distribution search on dataset 8 for SOM

on left and K-means on the right . . . 76 5.17 The dendrogram using SU measure for Li-ion battery ageing

data set. From the figure we can deduce that F2, F7 and F11 are correlated. Also F16, F10 and F3 could also be added. . . . 76 5.18 The average quality and standard deviations for the right

an-swer on battery data set computed 10 times. The left bar rep-resents the SOM and the right reprep-resents the k-means . . . 77 5.19 Exhaustive Quality distribution search on Li-ion battery ageing

(16)

(17)

List of Tables

5.1 Quality results on data set 1. The correct answer is that F1 and F2 are related . . . 54 5.2 Exhaustive search on dataset 1. This table shows only the first

best 5 combinations. The right answer for this dataset is F1 and F2 . . . 54 5.3 Quality results on data set 2. The correct answer is that F1 and

F2 are related . . . 57 5.4 Exhaustive search on dataset 2. This table shows only the first

best 5 combinations. The right answer for this dataset is F1 and F2 . . . 57 5.5 Quality results on data set 3. The correct answer is that F4, F5

and F6 are related . . . 60 5.6 Exhaustive search on dataset 3. This table shows only the first

best 5 combinations. The right answer for this dataset is F4, F5 and F6 . . . 60 5.7 Quality results on data set 4. The correct answer is that F3, F4

best 5 combinations. The right answer for this dataset is F3, F4 and F5 . . . 63 5.9 Quality results on data set 5. The correct answer is that F1, F3

best 5 combinations. The right answer for this dataset is F1, F3 and F5 . . . 66 5.11 Quality results on data set 6. The correct answer is that features

4,5 and 6 are related and features 12, 13 and 14 are related too. 69 5.12 Exhaustive search on dataset 6. This table shows only the first

best 5 combinations. The right answers for this dataset are F4, F5 and F6 in one cluster and F12, F13 and F14 in another cluster 71

(18)

18 LIST OF TABLES

5.13 Quality results on data set 7. The correct answer is that features 1, 4 and 8 are related and features 3, 5 and 7 are related too. . 72 5.14 Exhaustive search on dataset 7. This table shows only the first

best 5 combinations. The right answers for this dataset are F1, F4 and F8 in one cluster and F3, F5, and F7 in another cluster 72 5.15 Quality results on data set 8. The correct answer is that features

1, 6 and 7 are related and features 2, 3, 4 and 8 are related too. 74 5.16 Exhaustive search on dataset 8. This table shows only the first

best 5 combinations. The right answers for this dataset are F1, F6 and F7 in one cluster and F2, F3, F4, and F8 in another cluster . . . 75 5.17 Quality results on Li-ion battery ageing data using SU similarity

measure. F2, F7 and F11 are expected to be right answer. Also adding F16, F10 and F3 to the cluster that contains F2, F7 are F11 could be considered right answer too . . . 75 5.18 Exhaustive search on Li-ion Battery ageing data set. This table

shows only the first best 5 combinations. The right answers for this dataset is unknown . . . 77 5.19 Results summary of applying the proposed algorithm on the

(19)

Chapter 1

Introduction

To do fault detection, fault isolation and diagnostics on a system it is common to use a model based approach. This means to have a reference model of the system behaviour; a reference model that is compared to the actual observed system behaviour.

To model a system, one has to gather knowledge about the desired system, in the form of data and/or expert knowledge, and then build a reference model. This approach requires extensive experimentation and model verification. How-ever, it will be reliable once the model is built, and will work efficiently until the system is significantly upgraded or changed.

An alternative approach, much less explored, is to do data mining on the system once it is operational. The monitoring system learns the normal be-haviour and looks for deviations from this. This is straightforward to do if one has access to the full status of the system, i.e if it is faulty or non faulty, and can observe all aspects of the system, i.e. measure all possible signals, and if it is possible to do lots of computation on board the system and/or the system is connected to a computer server via a broadband available network. However, if the communication bandwidth is very limited, the computation resources on board are very limited, all the possible signals are not available and the status of the system is unknown, then the task becomes much more difficult. Then one will have to look for features that are potentially useful for the task; features that can be expected to be related to the system operation and the system status

One way of doing this is to say that relationships that are not random are potentially interesting, i.e. signals between which there exists a non random relationship are interesting to monitor. Examples are signals that exhibit lin-ear or non linlin-ear correlations. Other examples are signals that cluster in signal space.

(20)

Chapter 1. Introduction CHAPTER 1. INTRODUCTION

1.1 The task

Extract “interesting” subset(s) of signals from a “larger” set of signals without prior information. Signals between which there exists non random relation-ships are interesting to monitor i.e. signals that exhibit linear or non linear correlations or that cluster in signal space. The combinations of signals should be assessed as to evaluate the level of interestingness so that different com-binations of signals can be ranked. The primary application intended for this work is equipment monitoring and fault detection but it should be possible to use it for other surveillance problems too.

1.2 Outlines of the thesis

The thesis aims at providing a new method for feature selection based on relationship between them. The three major elements of the method are as follows:

1 Using the symmetrical uncertainty as a measure of non linear relation-ships and using this to cluster signals. Also, evaluating the symmetrical uncertainty against other measures of relationships.

2 Using a modified L method to determine optimal clusters sizes.

3 Using two clustering methods, self organized maps (SOM) and K-means, to evaluate (rank) the final feature sets.

1.3 Outline of this document

In chapter 2 we describe related work done on feature selection. Chapter 3 contains an explanation of the proposed algorithm. Chapter 4 explains the organization of data and pretreatment procedures. The last chapter contains the results and concluding remarks.

(21)

Chapter 2

Background

2.1 Introduction

Feature selection is a process for choosing a subset of features (variables, ob-servations) according to a certain criterion. Feature selection is primarily per-formed to select relevant features. However, it can have other motivations, including: general data reduction, feature set reduction, performance improve-ment and data understanding. Such problems are found in a wide variety of application domains, ranging from engineering applications in robotics and pattern recognition (speech, handwriting, face recognition), to Internet appli-cations (text categorization) and medical appliappli-cations (diagnosis, prognosis, drug discovery).

Feature selection is most often done for a classification or regression pur-poses. That is, the purpose is to find a subset of features that are suitable for solving a classification or regression problems. In these cases is the feature selection criterion how good the classifier or regressor is. Another possible use of the feature selection is to select features that may be useful for a later task but where it is not guaranteed to be so. For example, choosing principal com-ponents is done to preserve data, i.e. the selection criterion is data variance. The belief is that features that keep as much data variance as possible may be useful for later classification or regression tasks.

This thesis is about the latter reason for feature selection; to find features that may be useful in a subsequent classification task but where there is no guarantee that this will be the case. The selection criterion is the strength of the relationship between the selected features in the belief that the relationship itself maybe a useful feature for later classification/regression. A brief overview of the domain is given in this chapter.

2.2 Feature selection in general

The selection of features can be achieved in two ways, feature ranking or features subset selection [18]:

(22)

Chapter 2. Background CHAPTER 2. BACKGROUND

Feature ranking

In feature ranking are features ranked according to some criterion to select the top k ranked features where the number of features to select is specified by the user [19] or analytically determined [33]. It makes use of scoring functions, correlation or information based criteria, computed from the input and the output. Usually it is used as a preprocessing method for its computational simplicity [1] [5] [13] [38]. Ranking methods are commonly used in microarray analysis to find genes that discriminate between healthy and sick patients [16] and often evaluate genes in isolation without considering the gene-to-gene correlation.

Feature subset selection

Feature subset selection algorithms determine automatically the number of features and group them in subsets. The rapid advances of different research fields introduce the need for choosing the interesting features among high di-mensional data sets[23] [32]. Feature subset selection methods can into filters, wrappers and embedded methods (explained in the next section).

2.3 Feature selection models

2.3.1 Filter Methods

Filters are algorithms that filter out features that have little chance to be useful without direct feedback from predictors. Filter methods are computa-tionally less expensive than wrappers or embedded methods [35]. The filter algorithm is a function that returns a relevance index R(A|B) that evaluates, given the data B, how relevant a given feature subset A is for the task Y. These indices are usually known as feature selection metrics, which vary from sim-ple correlation functions, information based function or even some algorithmic procedure such as decision trees [21] [2] [32] [34]. Empirical comparisons of the influence of different indices are difficult because results depend on the task. What works well for document categorization, may not be the best for bioinformatics data. Filter methods determine feature relevance (interesting-ness) and/or redundancy without applying learning algorithms (classifier or clustering) on the data.

2.3.2 Wrappers Methods

Wrapper algorithms are wrapped around predictors [26]. Basically it consist of:

1- a feature evaluation criterion,

(23)

Chapter 2. Background 23

3- a search component.

Once an evaluation method for the variable subsets has been defined, the algorithm can start the search for the best subset. A search strategy, like ex-haustive search, best-first, simulated annealing, genetic algorithms and branch and bound, defines the order in which the variable subsets are evaluated (see [26] for review). Wrapper approaches are aimed at improving results of the specific predictors or clustering algorithms they work with. Wrapper methods use a learning machine to asses the quality of subset of features without using the knowledge about the specific structure of the learning machine function, and thus an be combined with any learning machine.

2.3.3 Embedded Methods

Embedded methods are algorithms that are built into predictors. In embedded methods are both learning and feature selection parts that cannot be separated or used for other learning algorithms. An example of such a model is the deci-sion tree induction algorithm, in which at each branching node, a feature has to be selected [4].

2.4 Feature selection learning algorithms

Feature selection algorithms can be grouped according to the type of learning into supervised, unsupervised or semi-supervised.

2.4.1 Supervised feature selection

Feature selection in supervised learning has been extensively studied [6]. Super-vised feature selection algorithms depend on measures that take into account the class information. In essence, supervised feature selection algorithms try to find features that help separate data of different classes. In case of regression, the feature selection is conducted by choosing the variables that most reduce the residual sum of squares as in forward stepwise selection or minimizes a penalized criterion [36].

2.4.2 Semi-supervised feature selection

When a small number of instances are labelled but the majority are not, semi-supervised feature selection is designed to take advantage of both the large number of unlabelled instances and the labelling information as in semi-supervised learning [40]. Intuitively, the additional labelling information should help constrain the search space of unsupervised feature selection.

(24)

Chapter 2. Background CHAPTER 2. BACKGROUND

2.4.3 Unsupervised feature selection

Unsupervised feature selection aims to find subset of features according to a certain criterion without prior information. Unsupervised feature selection can be divided into three categories:

Redundancy based

The idea in this category is to eliminate the redundancy among the input fea-tures. Dy and Brodley defines a redundant feature in [12]. Correlation methods and mutual information are used to detect redundant variables. P. Mitra et al. [29] proposed a maximal information compression index for measuring similar-ity between features. The proposed method is based on lowest eigenvalue for covariance matrix between two variables. Self-organized maps (SOM) are also used as visualization method for detecting the correlation between features in [37] [20].

Entropy based

Entropy based methods can be used as an evaluation method for this crite-rion. Entropy methods for variable comparison have also been studied in the clustering literature [14] [15]. In this category a relevant subset is that cluster in the data set space where uniformly distributed features do not provide any useful information for clustering [7] [6].

Clustering based

In this category the clustering quality assessment methods are used as evalu-ation method. Davies-Bouldin validity index [8] was proposed as subset eval-uation criterion in [17] where it was assumed that the features are normally distributed. In Dy and Brodley [12] [11] the scatter matrices and separability criteria were used as stopping criteria for the feature selection search process.

(25)

Chapter 3

The Method

3.1 Introduction

The idea is to explore a data set and look for non random relationships. The intention is that such relationships may hold important information about the data, information that can be used to classify the data or build models for the process that the data describes. Obviously time constrains or the need to do this on board embedded hardware make it impossible to search for all the groupings of variables1_{. This chapter presents our approach for grouping} or clustering variables according to their relationships where no assumptions are made concerning the number of the groups or the group structure. The clustering is based on pairwise similarities between variables. If we have N total variables, then an N×N similarity matrix (affinity matrix), can be formed. We then perform a hierarchical cluster analysis based on the pairwise similarity. In this way, the algorithm searches for good, but not necessarily the best, groupings in a reasonable time. We present in this chapter different methods for measuring the similarity between variables. After this we present different methods for choosing a reasonable number of clusters in hierarchical clustering.

3.2 Similarity Measures

In this section we study three similarity measures.

3.2.1 Pearson’s Correlation

A commonly used similarity measure is the sample linear correlation coeffi-cient (or Pearson’s product-moment correlation coefficoeffi-cient) [25]. The sample correlation coefficient for two variables X and Y is defined by:

1_{The number of ways of sorting n objects into k non empty groups is a Stirling number}

of the second kind given by _k!1 Pkj=0−1k−j kjj

n_{. Adding these numbers for k = 1,2,...n}

groups, we obtain total number of possible ways to sort n objects into groups.

(26)

Chapter 3. The Method CHAPTER 3. THE METHOD RXY = SXY √ SXX √ SYY = PM j=1(Xi− X)(Yi− Y) qP M j=1(Xi− X)2 qP M j=1(Yi− Y)2 (3.1)

where M is the number of the observations. The correlation has a magni-tude bounded between -1 and +1. The value +1 means complete linear asso-ciation between two variables and -1 means also linear assoasso-ciation but with negative direction. The value of R remains unchanged if the measurements of the variables X and Y are changed linearly. Figure 3.1 shows different sets of two variables, X and Y, with the Pearson’s correlation coefficient of two variables for each set.

Figure 3.1: Pearson’s correlation for six different relationships

3.2.2 Spearman’s Rank Correlation

The linear Pearson’s correlation does not reveal all there is to know about the relation between two variables. Non linear relationships can exist which are not revealed by RXY. Spearman’s rank correlation [28] measures the statistical dependence between two variables by evaluating how good the relationship be-tween those two variables can be indicated using a monotonic function defined by:

ρ = 1 − 6 P

d2

(27)

Chapter 3. The Method 27

where N is the number of observations and d is the distance between ob-servations rank. Figure 3.2 shows different sets of two variables, X and Y, with the Spearman’s correlation coefficient of two variables for each set.

Figure 3.2: Spearman’s correlation for six different relationships

3.2.3 Symmetrical Uncertainty Correlation

Symmetrical uncertainty correlation is an information theory based method for variable comparison. In this section we review some of the fundamental concepts of information theory and then show how those concepts can be used towards assessing relationship between variables.

The information entropy of a random variable X that takes on possible values in the domain X ={x1, x2, ...xn} is defined by:

H(X) = −X x∈X

p(x) log p(x) (3.3)

The joint entropy of two random variables X and Y is defined by:

H(X, Y) = − X x,y∈X,Y

(28)

Chapter 3. The Method CHAPTER 3. THE METHOD

The mutual information between two random variables X and Y with re-spective domains X and Y is defined by:

I(X, Y) = H(X) + H(Y) − H(X, Y) (3.5)

The mutual information is a symmetric measure that quantifies the mutual dependence between two random variables, or the information that X and Y share. It measures how much knowing one variable reduces the uncertainty about the other. The mutual information measures the information shared by two variables, and thus, their similarity. The mutual information is a non negative quantity upper bounded by both the entropies H(X) and H(Y), i.e. I(X, Y) _{6 min { H(X) , H(Y) } . If we want to use the mutual information as} a similarity measure its value has to be normalized. The normalized version of the mutual information is called symmetrical uncertainty [39] defined by:

SU(X, Y) = 2 I(X, Y)

H(X) + H(Y) (3.6)

A feature Y is regarded more similar to feature X than to feature Z, if SU(X, Y) > SU(Z, Y). Furthermore, SU is normalized to the range [0, 1] with the value 1 indicating that knowledge of the value of either variables completely predicts the value of the other variable and the value 0 indicating that X and Y are independent. In addition, it still treats a pair of features symmetrically. Figure 3.3 shows different sets of two variables, X and Y, with the symmetrical uncertainty correlation of two variables for each set.

3.3 Distances Measures

Usually, the distance (proximity) matrix is the input parameter for the cluster-ing process. The correlation coefficient cannot be applied directly as a distance measure because it can be negative. One way to convert the similarity matrix into an adequate distance value for clustering is

Dij = 1 − abs{Sij} (3.7)

(29)

Figure 3.3: Symmetrical uncertainty’s correlation for six different relationships

3.4 Clustering

We can hardly examine all grouping possibilities by applying brute force search. Instead, we use the hierarchical clustering algorithm based on pair-wise similarity [9]. Hierarchical clustering finds reasonable clusters without having to look at all possibilities. It proceeds by either successive mergers or divisions.

1 Agglomerative hierarchical: initially, it generates as many clusters as variables. The most similar variables are first grouped, and these initial groups are merged according to their distance. As the distance decreases, all subgroups are fused into one cluster.

2 Divisive hierarchical: initially, it generates one single cluster with all the variables in it. It then divided into two clusters where variables in one cluster are far from the ones in the other. Eventually, as the distance increases, all subgroups are divided into one cluster for each variable. Agglomerative hierarchical clustering has been applied in this work. The final hierarchical structure is represented by a tree called a dendrogram, see Fig-ure 3.4. Where the Y axis represents the distances between the items being connected and the X axis represent the items to be clustered.

The dendrogram illustrates the mergers or divisions that have been made at each level. In this section we shall briefly review the agglomerative techniques in particular single linkage method which has been used in this work.

(30)

Figure 3.4: The dendrogram for the eight dimensional data set 7

3.4.1 Linkage Methods

Linkage methods are suitable for clustering observations and variables. Linkage methods are divided into three categories:

1 Single linkage: clusters are fused due to the distance between their nearest members, see Figure 3.5

2 Complete linkage: clusters are fused according to the distance between their farthest members, see Figure 3.6

3 Average linkage: clusters are fused according to the average distance between pairs of members in the respective sets, see Figure 3.7

Figure 3.5: The merging of clusters under the single linkage criteria

The following are the steps for grouping N variables in the agglomerative hierarchical clustering algorithm :

(31)

Figure 3.6: The merging of clusters under the complete linkage criteria

Figure 3.7: The merging of clusters under the average linkage criteria

1 Start with N clusters, each containing one variable.

2 Search the distance matrix for the nearest pair of clusters say A and B, and let the distance between them be dAB.

3 Merge clusters A and B and label the newly formed cluster AB. 4 Delete the rows and columns that corresponds to these two clusters. 5 Add a new row and column giving the distance between the newly formed

cluster AB and the remaining clusters.

6 Repeat steps 2 to 5 a total of N − 1 times until all the objects end up in one big cluster.

7 After the algorithm stops, record the the clusters that have been merged and the distances at which the merging process took place.

Single Linkage

Clusters are formed from the individual entities by merging nearest neigh-bours, or the smallest distance. For step 3 of the general algorithm explained previously, the distance between AB and any other cluster C are computed by:

(32)

d(AB)C= min{dAB, dBC} (3.8) where the quantities dAB and dBC are the distances between the nearest neighbours of clusters A and B and clusters B and C, respectively.

3.5 Cluster Extraction

Agglomerative hierarchical clustering is one of the most frequent approaches in unsupervised clustering. However, the algorithm does not create clusters, but computes only a hierarchical representation of the data set. This makes it suitable as an automatic preprocessing step for other algorithms that operate on the hierarchical representation and decide the natural clusters. Before we discuss the methods for determining the number of the clusters we need to define what is a cluster. In this work we aim to group the features that exhibit non random relationships into one cluster, see Figure 3.8.

Figure 3.8: On the left, variables X,Y and Z have non random relationship. On the right they have random relationship

An example is the eight dimensional data set 7 (see Chapter 4), which con-sists of two three-dimensional interesting subgroups and the rest of the vari-ables are random. The input parameter for the hierarchical clustering process is the symmetrical uncertainty distance matrix. The output from the clustering process is the dendrogram illustrated in figure 3.4. In this section we will study the resulting dendrogram in order to select the two interesting subgroups au-tomatically and to neglect the random variables. As we explained before, the

(33)

hierarchical clustering starts by creating a separate cluster for each variable. In the next step it searches the distance matrix for the nearest pair of clusters and then merges this pair. Each merging step is called a level. The hierarchical clustering uses the distance between the merged features to represent the level at which the merging happened. It repeats this procedure until all the objects in a single cluster are represented in a tree. In order to construct clusters from the hierarchical cluster tree as generated by the linkage function we need to choose a cutoff distance, dcutoff. Clusters are formed when a node and all of its subnodes have distance values less than dcutoff. For data set 7 we have only eight possible outcomes from the hierarchical clustering process. Any value for dcutoff between 0.76 and 0.92 will produce the right answer, see Figure 3.9.

Figure 3.9: Any value for dcutoff within the range between the two red lines produce the right answer

We desire an algorithm that efficiently determines a reasonable distance to return from the hierarchical clustering without the need for setting any parameters.

3.5.1 Cluster validity

Cluster validation is a very important step in clustering analysis. The result of the clustering procedure needs to be validated to choose the right clustering output. In most clustering algorithms, the number of clusters is set by the user. However, this is not the case with hierarchical clustering. We seek an algorithm that can determine a reasonable distance to return from the hier-archical clustering. There are a lot of algorithms to find the best number of

(34)

clusters[8] [30] [10] [24] [31]. We have studied two variants and chosen one for our problem.

Davies-Bouldin Validity Index

The Davies-Bouldin (DB) validity index [8] is a function of the ratio of the sum of within cluster scatter to between-cluster separation.

DB = 1 N N X i=1 maxi6=j{ Sn(Qi) + Sn(Qj) S(Qi, S(Qj) } (3.9) where • n is number of clusters.

• Snaverage distance of all objects from the cluster to their cluster centre. • S(Qi, Qj) distance between clusters centers.

This ratio is small if the clusters are compact and far from each other. Hence, DB index will have a small value for a good clustering. Lets recall data set 7. For each level the DB index will be calculated. The DB index will be calculated for N-1 times, where N is the number of the variables. See Figure 3.10

Figure 3.10: On the left, DB index for each level. Note that the DB index for the last level is infinity. On the right the clustering output using DB index on data set 7

(35)

The DB index is zero for the first level. That is because we have a cluster with single variable and the distance to its mean will be zero. The DB index increases when we increase the dcutoff. Hence, the clustering output will always be the first level, which is not the right clustering output.

Modifying The Davies-Bouldin Validity Index

The DB index always looks for compact clusters that are well separated. By default it chooses the first level, and this is not always the right answer. Also, choosing the first level means choosing the level with maximum number of clusters. We tried to add a penalty on the number of the generated clusters, in the hope of getting a reasonable number of compact clusters:

DB = N{ N X i=1 maxi6=j{ Sn(Qi) + Sn(Qj) S(Qi, S(Qj) } + 1} (3.10)

In this way, the more clusters we have the larger the DB index will be. Let us calculate the DB index for each level of data set 7. Figure 3.11 illustrates the results. The DB index has a high value at the first level. The DB index de-creases when we increase the dcutoff. Hence, the clustering output will always be at the final level. Again, this not the right clustering output.

Figure 3.11: On the left, the modified DB index for each level. Note that the the modified DB index for the last level is infinity. On the right the clustering output using the modified DB index on data set 7

(36)

The L Method

The DB index method is based on the ratio of the sum of within cluster scatter to between-cluster separation. It is not able to detect the right distance for grouping clusters with relationships. We are looking for a method that can increase the distance to group the close levels. Levels that are close to each other implies that these levels contain similar variables. On the other hand the method has to stop when the distance between the levels increases so not to merge non-similar clusters. We need to study the relation between the number of the clusters generated at each level and the distance between the levels to see how can we choose the right distance. Figure 3.12 shows this relationship, where the y-axis is the distance and the x-axis values is the number of clusters.

Figure 3.12: Number of clusters generated at each level vs. the distance

The appropriate distance lies in the range between the distance 0.76 to 0.92 or at the curved transition area of the graph where the number of the clusters generated equals 4. Salvador et. al. [31] proposed an algorithm, the L method, that detects the curved transition area of a graph, or what is called the “knee”. This method can find a knee in the graph without the need to set any parameters. The method is based on assuming that the regions to the left and right of the knee are approximately linear. By fitting two lines to both left and right sides, the intersection of these two lines will be the knee, see Figure 3.13

Each line must contain at least two points and start at either end of the data. Both lines together should cover all of the data points, so if one line is small, the other line is large to cover the rest of the remaining data points.

(37)

Figure 3.13: Fitting two lines to thee points to detect a knee

The total number of line pairs is N − 3, where N is number of variables, see figure 3.14.

The algorithm starts the search from the left side of the graph and calcu-lates iteratively the total root mean squared error, RMSEc, using the following equation:

RMSEc= lL× RMSEL+ lR× RMSER (3.11) for each possible pair of lines where lL is number of points on the left line divided by the total number of the points in the whole graph and the same for lR. We seek the value of RMSEc such that it is minimized. We then calculate the intersection between the two lines that have the minimum RMSEc, and gives in return the appropriate dcutoff and reasonable number of clusters. Figure 3.15 shows the results of applying the L method on data set 7. As we can see the method indeed finds the first knee in the graph. The output clusters are correct, as variables 3,5 and 7 are related also variables 1,4 and 8 are related too. But this is not the only answer we seek. We want the L method to look forward for more knees in order to group more features that may have relationships.

Modifying the L Method

The L method indeed returns one knee. However, the resulting clusters are not the complete answer we desire. We seek a result of a lower distance or at the

(38)

Figure 3.14: The L method

next knee. We want to force the L method to look forward for more knees. We have modified the algorithm as explained in Algorithm 1.

Algorithm 1: Modified L method Data: Distance vs. Clusters graph Result: Suggested distance(s) 1 knees = FindKnee(graph); 2 length = length(graph); 3 while currentKnee > 1 do

4 graph = graph − graph(currentKnee : length); 5 knees = [knees FindKnee(graph)];

The algorithm explained in algorithm 1 calls the L method iteratively. Each call to the algorithm detects a knee and removes the detected knee as well as the points before it i.e. points with lower distances. The algorithm will stop when it reaches the highest distance in the graph. The result of applying the modified L method on data set 7 is illustrated in Figure 3.16

The modified L method returned three knees after applying it on data set 7. The first knee was not exactly the result we looked for. However, we cannot say it is the wrong answer as the clustered features are indeed related but it was not the true answer. The second knee returned exactly what we are looking for.

(39)

Figure 3.15: Results of applying the L method on data set 7 on both dendrogram on left and number of clusters vs. distance graph on right

3.6 Clusters assessment (ranking)

So far, the algorithm groups the features that have relationships. However, it does not give any feedback about how good the relationship is. This section shows how K-means clustering algorithm and SOM can be used to rank the resulting clusters.

3.6.1 K-means Clustering Methods

The K-means clustering is designed to group items, rather than variables, into K clusters [3]. The number of clusters, K, may be specified in advance or determined as part of the clustering procedure. One way for starting the clustering procedure is to randomly select seed points from among the items or to randomly partition he items into initial groups. The clustering procedure is composed of three steps:

1 Partition the items into K initial clusters.

2 Proceed through the list of items, assigning an item to the cluster whose centroid is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster losing item.

3 Repeat Step 2 until no more reassignments take place.

We can define an objective function which the K-means algorithm is min-imizing in Step 2, sometimes called the distortion measure by:

(40)

Figure 3.16: Results of applying the modified L method on data set 7 on both den-drogram on left and number of clusters vs. distance graph on right

J = 1 N N X n=1 K X k=1 rnkk xn− µkk2 (3.12) where:

1 xn is the data point. 2 µk is the cluster centroid.

3 rnk is the assessment factor. rnk ∈{0, 1} where k is number of clusters and n is data point index.

4- N is number of items.

The distortion can be used as an assessment measure for the variable re-lation. A non random cluster should have a low distortion value. A random cluster should have a high distortion value. Figure 3.17 shows assessments re-sults for different clusters that contain features that exhibit random and non random relationships. From the figure we can see that the highest distortion value was for the second cluster in the first row which contains a random relationship.

3.6.2 Self-organizing maps

A self-organizing map (SOM) is based on artificial neural network that is trained using unsupervised learning to produce a discrete low dimensional representation of the input space of the training samples. It consists of neurons organized on a two or three dimensional rectangular or hexagonal grid, see

(41)

Figure 3.17: K-means distortion measure for different relationships

Figure 3.18. Each neuron is a D dimensional weight vector (codebook vector, prototype vector) where D is equal to the dimension of the input vectors. The neurons are connected to other adjacent neurons by a neighbourhood relation, which dictates the topology of the map. The procedure for placing one vector from data space onto the SOM is to find the node with the closest weight vector to the vector that has been taken from data space and to assign the map coordinates of this node to our vector. The distortion measure the neurons minimize is defined as:

E = 1 2N N X n=1 K X k=1 Λk[x(n)]k x(n) − wkk2 (3.13)

where wk is the ithprototype vector of SOM, x(n) is the jthdata vector, D is the total number of the items and Λkis the neighbourhood function. The distortion measurement can be used as an assessment measure for the cluster quality. A non random relationship should have low distortion value. A random variable should have a high distortion value. Figure 3.19 shows assessment results for different clusters that contain features that exhibit random and non random relationships. From the figure we can see that the highest distortion value was for the second cluster in the first row which contains a random relationship. For a more complete description of the SOM refer to [27].

(42)

Figure 3.18: A self-organizing map neurons arranged in a hexagonal and rectangular grid

3.6.3 Normalized distortion measure for clusters ranking

In this work a good relationship is a relationship with low distortion value for the K-means or SOM. To use this value as an assessment measure it should be bounded and normalized to give quantitative and descriptive values. The normalized distortion measure for both K-means and SOM is defined by:

Q = Ja Jb

(3.14) where Ja is the cluster under assessment and Jb cluster contains random variables. The ranking procedures is as follow: after the algorithm clusters the features it starts ranking the clusters that contain more than one feature. So for each cluster to be ranked the algorithm generates a cluster that contains as many features as the cluster to be ranked has. The generated features have to be random. Then the algorithm does the assessment for the cluster by cal-culating Q. A low Q value means the distortion value in the numerator is less than distortion value in the denominator. Hence, the cluster has interesting relationship. A Q value close to unity indicates the cluster contains non in-teresting relationships, see Figure 3.20. From that figure we notice that there is a difference between the two quality assessment measure for the same rela-tionship. The SOM method gives higher quality for most of the relationships than the K-means. The K-means gave higher quality for the linear relationship

(43)

Figure 3.19: SOM distortion measure for different relationships

than the SOM. Both of the two methods gave equal assessment value for the random relationship.

(44)

(45)

Chapter 4

The Data

4

This chapter presents the experimental setup used to test the proposed algorithm.

4.1 The organization of Data

The data sets are organized in the form of spreadsheet or table data. Each row of the table is one data sample. The columns of the table are the variables of the data set, as follows:

Feature 1 Feature 2 ... Feature n

Item 1 x11 x12 · · · x1n

Item 2 x21 x22 · · · x2n

..

. ... ... ...

Item m xm1 xm2 · · · xmn

where xmnis the measurement of the nthfeature on the mthitem.

4.1.1 Data Sets

To test the proposed algorithm, eight synthetic data sets were created. The eight diverse data sets used to evaluate the the proposed method for clustering algorithms vary in size and number of the interesting clusters . There are some data sets that contain only spherical relationships, and some which contain linear relationships. The eight data sets that were used are:

Data Set 1

This is a simple clustering data set to test if the method can detect a non-random cluster distribution. There are 1500 observations and 4 features. The first two features are related through clusters, see Figure 4.1. The other two

(46)

Chapter 4. The Data CHAPTER 4. THE DATA

features are random. The answer we expect after applying our algorithm on this data set is that features 1 and 2 exhibit a non-random relationship with each other whereas the other two have no relationship with any other feature.

Figure 4.1: The right answer on data set 1

Data Set 2

This is a simple linear relationship, see Figure 4.2. The data set has 1500 obser-vations and 4 features, where features 1 and 2 are perfectly linearly correlated and the others are just random numbers. The answer we expect after apply-ing our algorithm here is that feature 1 and feature 2 exhibit a non-random relationship with each other and the other two have no relationships to any features.

Data Set 3

This is a much more difficult situation than the previous data sets. The data set has 500 observations and 9 features. Features 4, 5 and 6 exhibit a non linear relationship; they lie on the surface of a 3-dimensional sphere (see Figure 4.3). The other features are random numbers. The answer we expect after applying our proposed algorithm on this data set is that feature 4, feature 5 and feature 6 have a mutual relation and that other variables have no relationships with each other. We expect feature selection measures that build on linear relationships to fail on this data set.

(47)

Chapter 4. The Data 47

Data Set 4

This data set has 900 observations and 7 features. Features 3, 4 and 5 have a non-linear relationship through Bessel functions, e.g. the Matlab logo, see Figure 4.4. The other four features are random numbers. This problem is non linear but not as difficult as the half-sphere (Data Set 3). The answer we expect after applying our algorithm on this data set is that features 3, 4 and 5 have a mutual relationship and that other variables exhibit no relationships. We expect methods that build on rankings to be successful here, perhaps also linear methods. The observations are not randomly sampled but sampled on a grid (the Matlab logo script was used to generate the data), which may create "artificial” relationships between the features.

Data Set 5

This data set has 900 observations and 6 features. It is very similar to Data Set 4 except that Bessel functions on a higher order are used. Features 1, 3 and 5 are related (i.e. are points on the surface of the "logo”), see Figure 4.5, whereas the remaining three features are random numbers. The answer we expect after applying the algorithm on this data set is that feature 1, feature 3 and feature 5 are related but that other features are not related to anything.

(48)

Data Set 6

This data set is a combination of Data Set 3 and Data Set 4. The idea with combining is to see how well the methods are able to detect multiple rela-tionships in a data set and group these relarela-tionships correctly. The first 500 observations from Data Set 4 were merged with Data Set 3. The answer we expect with our method is that there are two groups of features that exhibit mutual relationships: features 4,5 and 6; and features 12, 13 and 14. The other features have no relationships to anything.

Data Set 7

This data set has 4000 observations and 8 features. It is a combination of features from real data sets and some random number features. Features 1, 4 and 8 come from a chemical process plant and have a non linear relationship. Features 3, 5 and 7 are signals collected on board a heavy duty truck driving on the road. Figure 4.6 illustrates both relationships. Features 3, 5 and 7 have been rotated in their subspace to make it more difficult to spot the relationships with linear or ranking methods (i.e. relationships are not obvious if you plot two features against each other). The other features (2 and 6) are random numbers. The answer we expect after applying our algorithm on this data set is two groups of related features: 1, 4 and 8 in one group; and 3, 5 and 7 in another group.

(49)

Data Set 8

This data set has 10000 observations and 9 features. It is similar to Data Set 7, i.e. it is a combination of data from real processes. Features 1, 6 and 7 are signals collected on a heavy duty truck driving on the road. However, the signals have been transformed so that they form an almost cylindrical volume, to make the problem more difficult. Features 2, 3, 4 and 8 are different rotation symmetry features extracted from images. These have also been rotated to make it a bit more difficult to spot the relationship. The image features are not very correlated to each other and it is questionable whether it is possible to see any relationships here. The remaining features (5 and 9) are random numbers. The answer we expect after applying the algorithm to this data set is that there are two separate groups of features: features 1, 6 and 7; and features 2, 3, 4 and 8 (see Figure 4.7). Of course, subsets of the features in these groups are also partially correct answers.

Li-ion Battery Data set

This data set has 169 observations and 17 features. It has been collected from Li-ion batteries at the NASA Ames Prognostics Center of Excellence (PCoE). For the source and a detailed explanation of the data refer to [22]. For this data set we do not have an answer of what the right answer is.

(50)

4.2 Pre-processing of data

Prior to measure the correlation between the variables, data are often pre-treated. Pre-processing is done in order to transform the data into the most suitable form for analysis. In this section, we describe two of the common pre-processing methods, scaling of the data and mean centring.

4.2.1 Scaling

Usually, the variables are coming from different sources. In this way they may have different numerical ranges. A variable with large numerical range will get a large variance, whereas a variable with a small numerical range will get a lower variance. Then a variable with a larger variance will have a better chance to be expressed in the modelling than a low variance variable. One way to scale the data, is called unit variance UV scaling. For each variable one calculates the standard deviation Snand forms the scaling weights by taking the inverse of each standard deviation 1/Sn. Then, each column is multiplied by the term 1/Sn. By using this multiplication with the inverse of the standard deviation, it is ensure that each scaled variable has a unit variance and therefore

σ2_n= 1 m − 1 m X i=1 (xin− µi)2= 1, ∀i (4.1)

(51)

4.2.2 Mean Centring

Mean centring is the second procedures for the pre preprocessing. With mean centring the average value of each variable is calculated and then subtracted from the data and therefore

µn= 1 m m X i=1 xin = 0, ∀n (4.2)

(52)

(53)

Chapter 5

Results and Conclusion

In this section, we empirically evaluate the efficiency of the proposed method on eight synthetic data sets. The evaluation is done in two ways:

1- Evaluation of similarity measures: for each dataset, we apply the pro-posed algorithm using different similarity measure to show empirically the difference between them in revealing the right answer.

2- Evaluation of the ranking procedure: The clusters that contain the de-sired answers should have high ranks by the SOM and K-means ranking methods.

Note that any cluster containing a single feature is neglected.

5.1 Results

5.1.1 Data set 1 results

The results of applying the algorithm on data set 1 are illustrated on Table 5.1 and Figure 5.1 shows the variance of the right answer quality measure. The right answer for this data set is that F1 and F2 have a relationship. F3 and F4 are random. The table is divided into three parts for: symmetrical uncertainty (SU), Pearson’s correlation and Spearman’s correlation methods.

Evaluation of similarity measures

• Symmetrical uncertainty: the algorithm returned one knee and one clus-ter contains F1 and F2, which is the right answer. The SOM method returned higher rank (lower value) for the right answer than K-means. Both ranks are less than one which means that this cluster contains non random data.

(54)

CHAPTER 5. RESULTS AND CONCLUSION

Symmetrical uncertainty

Knee Clustering output SOM Q K-means Q

1 F1, F2 0.65 0.88

Pearson’s correlation

Knee Clustering output SOM K-means

1 F1, F4 0.88 0.91

Spearman’s correlation

Knee Clustering output SOM Quality K-means Quality

1 F1, F4 0.90 0.92

Table 5.1: Quality results on data set 1. The correct answer is that F1 and F2 are related

• Pearson’s correlation: the algorithm returned one knee and one cluster contains F1 and F4, which is the wrong answer. The SOM method re-turned higher rank (lower value) for the this answer than K-means. • Spearman’s correlation: could not also detect the right answer. It

re-turned one knee and one cluster contains F1 and F4, which is the wrong answer. The SOM method returned higher rank (lower value) for the this answer than K-means.

Evaluation of the ranking procedure

Table 5.2 shows the best five features combinations ranked by SOM and K-means methods.

• SOM method: the cluster contains the right answer, is ranked as the best cluster.

• K-means ranking: the cluster contains the right answer, is ranked as the third best cluster.

Clustered features SOM Q Clustered features K-means Q

F1, F2 0.64 F1, F2, F3 0.69

F1, F2, F3 0.67 F1, F2, F4 0.70

F1, F2, F4 0.69 F1, F2 0.78

F1, F2, F3, F4 0.74 F1, F2, F3, F4 0.83

F2, F3 0.86 F2, F3 0.89

Table 5.2: Exhaustive search on dataset 1. This table shows only the first best 5 combinations. The right answer for this dataset is F1 and F2

(55)

55

Figure 5.1: The average quality and standard deviations for the right answer on data set 1 computed 10 times. The left bar represents the SOM and left bar represents the k-means

To summarize, the symmetrical uncertainty revealed the right answer where both of the similarity method could not. The SOM gave a higher ranking for the right answer and it ranked it as the best answer on the other hand the K-means ranked it among the best 27% best answers. Finally, the Figure 5.2 presents the distribution of the quality for all the possible combinations using SOM and K-menas ranking methods.

5.1.2 Data set 2 results

The results of applying the algorithm on data set 2 are illustrated on Table 5.3 and Figure 5.3 shows the variance of the right answer quality measure. As previously mentioned, the right answer for this data set is F1, F2 have a relationship and F3, F4 are random. The table is divided into three parts for: SU, Pearson’s correlation and Spearman’s correlation methods.

The algorithm returned the same answer using the the three similarity measure methods. It returned one knee and one cluster contains F1 and F2, which is the right answer. The K-means method returned higher rank (lower value) for the right answer than SOM. Both ranks are less than one which means that this cluster contains non random data.

(56)

Figure 5.2: Exhaustive Quality distribution search on dataset 1 for SOM on left and K-means on the right

• SOM ranking: the cluster contains the right answer is ranked as the third best cluster.

• K-means ranking: the cluster contains the right answer is ranked as the best cluster.

To summarize, using the three correlation methods the algorithm revealed the right answer. That was expected as the relation between the two variables is perfectly linear. The SOM could not rank the right answer as the best answer, instead it ranked the right answer among 27% best answers. K-means ranked the right answer as the best answer. Finally, the Figure 5.4 presents the distribution of the quality for all the possible combinations using SOM and K-menas ranking methods.

5.1.3 Data set 3 results

The results of applying the algorithm on data set 3 are illustrated on Table 5.5 and Figure 5.5 shows the variance of the right answer quality measure. The right answer for this data set is F4, F5 and F6 have a relationship. The rest of features are random. The table is divided into three parts for: symmetrical uncertainty (SU), Pearson’s correlation and Spearman’s correlation methods.

(57)

57

1 F1, F2 0.61 0.52

Spearman’s correlation

Knee Clustering output SOM Quality Q K-means Quality Q

1 F1, F2 0.62 0.53

Table 5.3: Quality results on data set 2. The correct answer is that F1 and F2 are related

F1, F2, F4 0.49 F1, F2 0.53

F1, F2, F3 0.50 F1, F2, F3 0.69

F1, F2 0.67 F1, F2, F4 0.70

F1, F2, F3, F4 0.69 F1, F2, F3, F4 0.77

F2, F4 0.97 F2, F3, F4 0.97

Table 5.4: Exhaustive search on dataset 2. This table shows only the first best 5 combinations. The right answer for this dataset is F1 and F2

• Symmetrical uncertainty: the algorithm returned one knee and two clus-ters: the first cluster contains F1 and F7 and the second cluster contains F4, F5 and F6. The second cluster is the right answer. The SOM method returned higher rank (lower value) for the right answer than K-means. Both ranks are less than one which means that this cluster contains non random data. Both ranking methods returned values ≈ 1 for the first cluster.

• Pearson’s correlation: the algorithm returned three knees. Two clusters in the first and the second knees and one cluster in the third knee. None of the returned clusters contains the right answer. Both ranking methods returned values ≈ 1 for all the returned cluster.

• Spearman’s correlation: could not also detect the right answer. It re-turned one knee and one cluster contains F3 and F4, which is the wrong answer. Both SOM method rand K-means returned a rank ≈ 1 for the this answer.

(58)

Figure 5.3: The average quality and standard deviations for the right answer on data set 2 computed 10 times. The left bar represents the SOM and right bar represents the k-means

• SOM method: the cluster contains the right answer is ranked as the best cluster.

• K-means ranking: the cluster contains the right answer, is ranked as the 18 best cluster.

To summarize, using the three similarity measure methods the algorithm revealed the right answer. The SOM ranked the right answer as the best an-swer. K-means ranked the right answer among 3.6% best answers. Finally, the Figure 5.6 presents the distribution of the quality for all the possible com-binations using SOM and k-means ranking methods. Even though the right answer did not pop up as the best answer in K-means ranking, it was partially detected. The first best cluster in the k-means exhaustive search, contains two correlated features, F5 and F6, but added to them a random variable F9. The same with the second and the third configurations. The fourth and the fifth clusters contain the right answer, F4, F5 and F6, but also there exist some noisy variables within these clusters.

(59)

59

Figure 5.4: Exhaustive Quality distribution search on dataset 2 for SOM on left and K-means on the right

5.1.4 Data set 4 results

The results of applying the algorithm on data set 3 are illustrated on Table 5.7 and Figure 5.7 shows the variance of the right answer quality measure. The right answer for this data set is F3, F4 and F5 have a relationship. The rest of features are random. The table is divided into three parts for: symmetrical uncertainty (SU), Pearson’s correlation and Spearman’s correlation methods. Evaluation of similarity measures

The algorithm returned the same answer using the the three similarity measure methods. It returned one knee and one cluster. The cluster contains F3, F4 and F5, which is the right answer. Both SOM and K-means ranking methods returned the same rank for the answer. Both ranks are less than one which means that this cluster contains non random data.

• SOM method: the cluster contains the right answer is ranked as the 3rd best cluster.

(60)

1 F1, F7 0.95 1.0

F4, F5, F6 0.61 0.83

1 F1, F2, F3, F4, F9 1.0 1.0 F6, F8 0.86 0.91 2 F2, F3, F4, F9 0.97 0.96 F6, F8 0.84 0.90 3 F3, F4 0.86 0.96 Spearman’s correlation

1 F3, F4 0.98 0.95

Table 5.5: Quality results on data set 3. The correct answer is that F4, F5 and F6 are related

F4, F5, F6 0.60 F5, F6, F9 0.82

F4, F6 0.70 F1, F4, F6 0.83

F1, F4, F5, F6 0.71 F1, F2, F5, F6 0.83

F5, F6 0.73 F4, F5, F6, F7 ,F8 0.83

F2, F4, F5, F6 0.73 F4, F5, F6, F9 0.84

Table 5.6: Exhaustive search on dataset 3. This table shows only the first best 5 combinations. The right answer for this dataset is F4, F5 and F6

• K-means ranking: the cluster contains the right answer, is ranked as the 2nd best cluster.

To summarize, the three similarity measure methods were successful here. The algorithm revealed the right answer using all the three correlation meth-ods. The SOM could not rank the right answer as the best answer, instead the it ranked the right answer among 2.5% best answers. However, the first and the second best answers were parts from the right answer, which means the al-gorithm partially detected the right answer. K-means ranked the right answer among 1.7% best answers. And similarly the first answer was part of the right answer. Finally, the Figure 5.8 presents the distribution of the quality for all the possible features combinations using SOM and k-means ranking methods.

(61)

61

Figure 5.5: The average quality and standard deviations for the right answer on data set 3 computed 10 times. The left bar represents the SOM and right bar represents the k-means

5.1.5 Data set 5 results

The results of applying the algorithm on data set 5 are illustrated on Table 5.9 and Figure 5.9 shows the variance of the right answer quality measure. The right answer for this data set is F1, F3 and F5 have a relationship and the rest of the features are random.The table is divided into three parts for: symmetrical uncertainty (SU), Pearson’s correlation and Spearman’s correlation methods. Evaluation of similarity measures

The algorithm returned the right answer for all similarity measure methods. The clusters contain the right answer were ranked with a value less than 1 which means that these clusters contain non random relationships. SOM ranked the right answer with a higher value than K-means.

• SOM method: the cluster contains the right answer is ranked as the best cluster.

• K-means ranking: the cluster contains the right answer, is ranked as the 3rdbest cluster.

Self-organized Selection of Features for Unsupervised On-board Fault Detection

International Master’s Thesis

Self-organized Selection of Features for Unsupervised

On-board Fault Detection

Ahmed Mosallam

Technology

Self-organized Selection of Features for Unsupervised

On-board Fault Detection

Studies from the Department of Technology

at Örebro University

Ahmed Mosallam

Self-organized Selection of Features for

Unsupervised On-board Fault Detection

© Ahmed Mosallam, 2010

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

The task

1.2

Outlines of the thesis

1.3

Outline of this document

Chapter 2

Background

2.1

Introduction

2.2

Feature selection in general

2.3

Feature selection models

2.3.1 Filter Methods

2.3.2 Wrappers Methods

2.3.3 Embedded Methods

2.4

Feature selection learning algorithms

2.4.1 Supervised feature selection

2.4.2 Semi-supervised feature selection

2.4.3 Unsupervised feature selection

Chapter 3

The Method

3.1

Introduction

3.2

Similarity Measures

3.2.1 Pearson’s Correlation

3.2.2 Spearman’s Rank Correlation

3.2.3 Symmetrical Uncertainty Correlation

3.3

Distances Measures

3.4

Clustering

3.4.1 Linkage Methods

3.5

Cluster Extraction

3.5.1 Cluster validity

3.6

Clusters assessment (ranking)

3.6.1 K-means Clustering Methods

3.6.2 Self-organizing maps

3.6.3 Normalized distortion measure for clusters ranking

Chapter 4

The Data

4.1

The organization of Data

4.1.1 Data Sets

4.2

Pre-processing of data

4.2.1 Scaling

4.2.2 Mean Centring

Chapter 5

Results and Conclusion

5.1

Results

5.1.1 Data set 1 results

5.1.2 Data set 2 results