Graph-based Multi-view Clustering for Continuous Pattern Mining

(1)

Master of Science in Computer Science June 2021

Graph-based Multi-view Clustering for Continuous Pattern Mining

Christoffer Åleskog

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

(2)

The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:

Author(s):

Christoffer Åleskog

E-mail: chal16@student.bth.se

University advisor:

Prof. Veselka Boeva

Department of Computer Science University co-advisor:

Vishnu Manasa Devagiri

Department of Computer Science

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Background. In many smart monitoring applications, such as smart healthcare, smart building, autonomous cars etc., data are collected from multiple sources and contain information about different perspectives/views of the monitored phenomenon, physical object, system. In addition, in many of those applications the availability of relevant labelled data is often low or even non-existing. Inspired by this, in this thesis study we propose a novel algorithm for multi-view stream clustering. The algorithm can be applied for continuous pattern mining and labeling of streaming data.

Objectives. The main objective of this thesis is to develop and implement a novel multi-view stream clustering algorithm. In addition, the potential of the proposed algorithm is studied and evaluated on two datasets: synthetic and real-world. The conducted experiments study the new algorithm’s performance compared to a single- view clustering algorithm and an algorithm without transferring knowledge between chunks. Finally, the obtained results are analyzed, discussed and interpreted.

Methods. Initially, we study the state-of-the-art multi-view (stream) clustering algorithms. Then we develop our multi-view clustering algorithm for streaming data by implementing transfer of knowledge feature. We present and explain in details the developed algorithm by motivating each choice made during the algorithm design phase. Finally, discussion of the algorithm configuration, experimental setup and the datasets chosen for the experiments are presented and motivated.

Results. Different configurations of the proposed algorithm have been studied and evaluated under different experimental scenarios on two different datasets: synthetic and real-world. The proposed multi-view clustering algorithm has demonstrated higher performance on the synthetic data than on the real-world dataset. This is mainly due to not very good quality of the used real-world data.

Conclusions. The proposed algorithm has demonstrated higher performance results on the synthetic dataset than on the real-world dataset. It can generate high-quality clustering solutions with respect to the used evaluation metrics. In addition, the transfer of knowledge feature has been shown to have a positive effect on the algorithm performance. A further study of the proposed algorithm on other richer and more suitable datasets, e.g., data collected from numerous sensors used for monitoring some phenomenon, is planned to be conducted in the future work.

Keywords: Machine Learning, Unsupervised Learning, Multi-view Clustering, Data Stream Mining, Pattern Mining

i

(4)

(5)

Acknowledgments

I would like to thank my two supervisors, Professor Veselka Boeva and Vishnu Man- asa Devagiri, for helping to formulate and conceptualize the proposed algorithm and be patient with my questions, knowing that I am pretty new in this field of research.

Lastly, I would like to thank my family, who had supported me emotionally when working on this thesis, encouraging me to continue working when the results were not as expected.

iii

(6)

(7)

List of Figures

1.1 An example of the CC algorithm. (a) A complete graph with the artificial node A. The thickness of the edges determine its weights.

(b) The Minimum Cut Tree found from the graph in Figure 1.1a. (c) The clusters produces when the artificial node is removed from the graph in Figure 1.1b. . . 2 3.1 High-level overview of MVCC for the three chunks A, B, and C. . . . 14 3.2 Overview of the core MVCC algorithm for one chunk. . . 14 3.3 One column in the consensus matrix, where one medoid (Column C1)

is built up with features from all views that correspond to that data point. . . 17 3.4 Example on how Gt could look like before and after being formatted

for PL. (a) Before formatting of Gt. (b) Formatted Gtto fit for usage in PL. . . 20 3.5 Example on how patterns Pt, created from Gt, could look like after

empty sets have been added to the formatted Gt. (a) Shows patterns P_t after empty sets have been added to Gt. (b) The uncompact form of the first pattern in 3.5a. . . 20 4.1 MVCC with BN and LE on the Cover-Type dataset with a threshold

of -0.2, where the x axis is the chunks and y axis is the ARI. . . 26 4.2 MVCC with BN and LE on the Cover-Type dataset with a threshold

of -0.2. Showing the number of clusters found in the global model and the first view, where the black line indicate the correct number of clusters for each chunk. (a) Number of clusters found in the first view (view 0) for both BN and LE. (b) Number of clusters found in the global model for both BN and LE. . . 26 4.3 Standard configuration of MVCC on Cover-Type dataset with thresh-

old set as -0.2. Shows SC over time in each local model. . . 27 4.4 STC on Cover-Type with and without transfer of knowledge between

data chunks. . . 29 4.5 STC on Cover-Type presenting the number of clusters found in the

global model. . . 30 4.6 STC on Cover-Type with 4 views and one view. . . 31 4.7 STC on Cover-Type presenting the number of clusters found in the

global model. . . 32 vii

(10)

C.2 Standard configuration of MVCC on Cover-Type dataset with threshold set as 0.2. Shows SC over time in each local model. . . 47 C.3 Standard configuration of MVCC on Cover-Type dataset with thresh-

old set as 0.4. Shows SC over time in each local model. . . 48 C.4 Standard configuration of MVCC on Cover-Type dataset with thresh-

old set as 0.4. Shows average ARI over time. . . 48

viii

(11)

List of Tables

3.1 Different datasets used in the experiments. . . 24 4.1 Metrics result on Dim32 with the two labeling methods, PL and

CNMF. MVCC executed with the threshold parameter set to 0.0, with the LE method. The dataset is divided into two chunks, the first with a size of 614 and second of 410 instances. . . 25 4.2 SC values for experiment with Dim32, with the CNMF and LE methods. 27 4.3 Metrics of experiments on Cover-Type with different chunk sizes. . . . 28 4.4 STC of MVCC for respectively datasets. . . 28 4.5 STC results on Cover-Type with and without transfer of knowledge

between data chunks. . . 30 4.6 STC results on Cover-Type with 4 and 1 views. . . 31 A.1 All abbreviations in the thesis. . . 43

ix

(12)

(13)

List of Algorithms

1 Kruskal’s algorithm . . . 15

2 Boundary Nodes . . . 17

3 Longest Edges . . . 18

4 Pattern-Labeling algorithm . . . 20

5 Map data points in V to medoids in the views . . . 45

6 Decision if it matches a pattern or not . . . 45

7 Matching a pattern to a data point . . . 46

xi

(14)

(15)

Chapter 1 Introduction

In today’s society, data is the core building block used for many applications and companies. As described in the book "Data Mining" by Witten et al. [53], raw data is interpreted and analyzed to yield hidden or unknown information. Specifically, large datasets are explored to find interesting, unexpected, or valuable structures or patterns. These can then generate new insights, better decisions, and predict future trends [53]. This field of study, known as Data Mining, has been studied intensively in the past decades [25][53].

A field in Data Mining entitled Clustering, or Classification, is the process, with the help of algorithms and techniques, to cluster or classify data into different groups based on similarity, correlation, or other statistical means [29]. Traditional techniques cluster whole datasets. Implying that all data in a dataset needs to be placed in memory to cluster it correctly, this, as can be predicted, causes problems with larger datasets. A solution to this has been made where the data is streamed into the clustering algorithm in blocks, aka "chunks" [22]. This way, the clustering algorithms do not need to keep all data in memory simultaneously.

Data is often collected from many different applied domains in the real world, where the data is often composed of data from different sources that describe the same physical object, phenomenon, or concept [54]. For instance, an image is composed of different features; pictures on the web have tags and descriptions associated with them; news from multiple news organizations; sensor information composed of frequency and time. These instances are all part of a subset of Big Data known as Multi-view data or Multi-source data [5]. A view could be the same data seen from a different perspective, hence multi-view, or from a different type of sensor, hence multi-source. These types of data exhibit heterogeneous properties, but con- nections between these properties can still be found. Hence, the improvements to the traditional clustering algorithm have been made to cluster Multi-view data.

A novel multi-view stream clustering algorithm is proposed in this work, based on Minimum Spanning Trees and Pattern Mining. The proposed algorithm is evaluated on two different datasets: synthetic dataset and real-world dataset. The obtained results show high-quality clustering solutions on the synthetic dataset and not very optimal solutions on the real-world dataset.

1.1 Background

In Data Mining, the process of finding interesting or useful information usually has to do with clustering or classifying data points into different clusters or groups.

1

(16)

The field of clustering have many different algorithms for clustering data into groups based on different criteria, e.g., Support Vector clustering [3], k-means clustering [36], Mean-Shift clustering [55], and DBSCAN [17], to name a few. All these mentioned algorithms are for clustering unlabeled data; this type of clustering is in the field entitled unsupervised learning.

In unsupervised learning, the goal is to cluster data without knowing which clusters each data point belongs to. There are variations of unsupervised learning where a small number of the data points are labeled and can guide the clustering in the right direction, termed semi-supervised learning [56]. Note that in this thesis, the focus is on unsupervised learning techniques suitable for streaming multi-view scenarios.

As mentioned before, there are many different algorithms for clustering unlabeled data, but there are two algorithms in our focus in this thesis. The first, is a graph-based algorithm [6], specifically Minimum Spanning Trees (MST), entitled Cut Clustering [18]. A more detailed description of it is provided in section 1.1.1.

The second algorithm is based on matrix factorization, described in more detail in section 1.1.2.

1.1.1 Cut Clustering Algorithm

The Cut Clustering (CC) algorithm is an algorithm based on MST as it was described by Flake et al. [18] and has been used in other clustering algorithms [21],[44]. As mentioned in the paper by Flake et al., it works by incorporating an artificial node in the graph that is connected to all other nodes with a specified α value as the weight for the edge between them. All other nodes are also connected with each other but use the distance between them as the weights for the edges. Therefore a complete graph is created with the data points and the artificial node, with a specified α. In this thesis, the Euclidean distance is used as the distance function between nodes.

To cluster the data points (non-artificial nodes), a Minimum Cut Tree is found for the complete graph, and the artificial node is removed. Thereby the tree has been divided into sub-trees that can be perceived as clusters of the nodes. These three steps of the algorithm can be observed from Figure 1.1.

1 2 3

4

5 A

(a) Complete graph.

1 2 3

4

5 A

(b) Minimum Cut Tree

1 2 3

4

5

(c) Clusters

Figure 1.1: An example of the CC algorithm. (a) A complete graph with the artificial node A. The thickness of the edges determine its weights. (b) The Minimum Cut Tree found from the graph in Figure 1.1a. (c) The clusters produces when the artificial node is removed from the graph in Figure 1.1b.

(17)

1.1. Background 3 Alpha value

One parameter is used in the CC algorithm, known as α. The value of this parameter has an impact on how many clusters the algorithm will find. In essence, the value of αdetermines the weights of all edges from the artificial node and therefore affects the shape of the Minimum Cut Tree found for the complete graph. As can be predicted, there are many ways to decide which value α should take, two of which are described in the papers [18] and [9].

The first method decides the α value by using a binary search algorithm. This is done by executing the algorithm for each loop and choosing the most stable α value.

The second method is based on an equation for calculating the mean α value from the data points that build up the graph, calculated by the formula:

α = 1 m ·

m

X

i=1 m

X

j6=i

w_ij

deg(vi), (1.1)

where m is the number of nodes in the graph, wij is the weight of the edge connecting nodes i and j, and deg(vi)is the degree of node i, i.e. the number of connecting edges.

1.1.2 Convex Non-negative Matrix Factorization

Traditional Non-negative Matrix Factorization (NMF) [31], approximates two non- negative matrices W ∈ R^n×k+ and H ∈ R^m×k+ into a non-negative matrix X ∈ R^n×m+ . Resulting in a minimization problem with the objective X+≈ W₊H₊^T. Formerly used to save space in memory, the algorithm can be used for clustering according to Ding et al. [13] and is equivalent to k-means if the constraint HH^T = I is additionally imposed on the objective function. Therefore, when NMF is used for clustering, k is the specified number of clusters to be found, columns of W holds the centroids of the clusters, and the rows with the maximum value in the columns of H describe which cluster a data point belongs to.

Convex Non-negative Matrix Factorization (CNMF) [14], is a modification of the traditional NMF that results in better centroids in W . Unlike NMF, CNMF approximates a mixed-sign data matrix X into a matrix W and non-negative matrix H. W is also divided into one mixed-sign data matrix S and a non-negative matrix L, where L is the labeling matrix, and S is a data matrix. An extensive explanation of S and how it can be used for missing data can be found in [24]. However, since missing data is not considered in this thesis, S = X. Then the factorization is in the form, F = X±L₊ and X± ≈ F H₊^T. CNMF imposes the constraint kLk1 = 1 to lift the scale indeterminacy between L and H so that Fi is precisely a convex combination of the elements in X. The approximation is quantified by the use of a cost function that is constructed by distance measures. Often times the square of the Euclidean distance, also known as the Frobenius norm, is used [38]. The objective function is therefore as follows:

minW,HL = kX − XLH^Tk²_F

s.t. X ≥ 0, L ≥ 0, H ≥ 0, kLk1 = 1, (1.2)

(18)

where X ∈ R^n×m is the data to be clustered, L ∈ R^m×k and H ∈ R^m×k are non- negative matrices, n is the dimensionality of the feature space, m is the number of data points, and k is the dimension to reduce to. The symbols kk²_F and kk¹ denotes the Frobenius norm, and Manhattan norm [11] of the thing it encases, respectively.

1.1.3 Evaluation Metrics

A variety of metrics can be used for the evaluation of clustering algorithms. In this work, four metrics are considered for evaluating the clustering solutions generated by the proposed multi-view stream clustering algorithm. They are Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), Homogeneity, and Completeness, respectively. These four were chosen to evaluate different characteristics of the produced clustering solutions. They are allowing for the evaluation to be more realistic.

The four metrics are described below.

Adjusted Rand Index

The ARI is used for evaluating a clustering solution and is adjusted for chance [28]. It works by calculating the Rand Index (RI) [41], i.e., computing the similarity measure between two clustering solutions by counting pairs of samples assigned to the same or different clusters, essentially calculating the number of agreeing pairs divided by the total. Then adjusting it for chance, as shown in Equation 1.3. Having a value of 1 if the clusterings are identical and close to 0 for random labeling independently of the number of clusters and samples.

ARI =

P

ij nij

2 −h P

i ai

2

P

j bj

2

i / ⁿ₂^ij

1 2

hP

i ai

2 + P_j ^b₂^ji

−h P

i ai

2

P

j bj

2

i

/ ⁿ₂^ij, (1.3) where P_ij ⁿ₂^ij

is the RI of samples i and j, ai is the sum of the pair of samples w.r.t sample i, the same for bj but for sample j. The minuend of the denominator is the expected RI, while the subtrahend is the maximum RI of samples i and j.

Adjusted Mutual Information

Like ARI, the AMI score is an adjustment accounting for chance of the Mutual Information (MI) [47]. As with ARI, it is also a measure of similarity between two clustering solutions, and the adjustment for it is similar. Exchanging RI of sample i and j to MI, and the maximum value of RI to the maximum value of the entropy of the predicted and true labels. The calculations of the expected value are pretty complex and, therefore, will not be shown here. A detailed description of how it works is shown in [47]. AMI scores a clustering solution from 0 to 1, with 1 being a perfect match and 0 for random partitions.

Homogeneity and Completeness

Homogeneity and Completeness are two other evaluation metrics for evaluating a given clustering solution, both components in the V-Measure. As described in the

(19)

1.2. Problem Description and Motivation 5 paper by Rosenberg [42], they are defined as, "A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster." (p. 411). As with the expected MI, the calculations for both of these metrics are complex and, therefore, will not be shown in this section. A detailed explanation of how they work can be found in [42].

1.2 Problem Description and Motivation

Multi-view clustering (MVC) and clustering of streaming data have been studied intensively over the past decades. However, the combination of both, i.e., multi-view clustering of streaming data, is quite a new field of study. It has only been studied the past half-decade [27][12][45]. During this time many different methods to cluster data has been made, e.g., generating Support Vectors to cluster different views in a common kernel space (global model) [27], creating a global model and updating it with each new chunk [12], and multi-view NMF-based algorithms that update a consensus matrix at each chunk [45]. Knowledge of the views’ correlation is imparted into the global model and often transferred to the subsequent chunks’ local models.

The knowledge represents the extracted patterns about the correlation among the different views of the current data chunk in the subsequent data chunk. Here, the data points have a correlation between views because different views describe the same object. To illustrate it, an example is given below:

Example 1.1. Assume we have a scenario in which data about a studied plant/flower are periodically collected and arrived. In addition, the gathered data contain information about two different views, each describing a flower by using different features.

The first view describes the length of the petals and the stem, while the second view describes the color of the petals and the stem. In this scenario, we know that there is some connection between the color and the length because they represent the same object. We want to extract this knowledge about the two view’s correlation and incorporate it in the analysis of the subsequent data chunk.

Therefore, the focus of this thesis is to develop a novel multi-view stream clustering algorithm that retains the correlations among the views from the previous data chunk in the next data chunk. Continuous pattern mining and MST form the core part of the novel algorithm.

1.3 Aim and Objectives

This thesis aims to develop a new MVC algorithm for streaming data. The algorithm will incorporate the knowledge extracted from the global model in the current data chunk when calculating the local models in the next data chunk.

1.3.1 Objectives

(20)

• Implement a new multi-view stream clustering algorithm, with and without transfer of knowledge between global and local models for different data chunks.

• Evaluate and compare the implemented algorithm against a single-view stream clustering algorithm and a multi-view stream clustering algorithm without transfer of knowledge.

• Analyze the evaluation results produced by the developed multi-view stream clustering algorithm and interpret the impact of views’ correlations (extracted multi-view patterns) on the clustering results.

1.3.2 Research Questions

The research questions which are to be answered in the thesis are as follows:

RQ1: How the knowledge (extracted multi-view patterns) discovered by the global model built at the current data chunk can be used to improve the local clustering models that are due to be generated on the next data chunk?

Motivation: For presenting the reason behind why the proposed algorithm is transferring the discovered knowledge (views’ correlations) between chunks.

RQ2: How the newly discovered knowledge improves the performance of the multi-view clustering algorithm in comparison with the algorithm that does not use this knowledge?

Motivation: Proof that the proposed algorithm improves the clustering results when the transfer of knowledge between chunks is used.

RQ3: Is there a difference in clustering results for the implemented algorithm and a single-view clustering algorithm?

Motivation: Demonstrate that the multi-view approach improves the understandability of the generated clustering results by discovering multi- view patterns, which are at a higher level of abstraction and cannot be identified by the single-view clustering algorithms.

1.4 Contribution

This thesis has developed a novel MVC algorithm for streaming data based on MST clustering algorithm, specifically the CC algorithm. It takes the views’ correlations of the previous data chunk into consideration when building the next data chunks local models. The main contributions of this thesis study are listed below:

• A novel MST-based multi-view clustering algorithm that can be applied for continuous pattern mining of streaming data.

• A streaming version of the MST-based (single-view) clustering algorithm that can be applied for analysis of streaming data scenarios.

(21)

1.5. Outline 7

• A labeling method that uses the multi-view patterns extracted by the novel MVC algorithm for labelling the data points.

1.5 Outline

The thesis is divided into four chapters, Related Work in Chapter 2, Method in Chapter 3, Result and Discussion in Chapter 4, and Conclusions and Future Work in Chapter 5. Chapter 2, Related Work, presents works related to the proposed MVC algorithm, both multi-view and single-view clustering methods in streaming data context. Chapter 3, Method, explains the proposed MVC algorithm in details. In addition, a computational complexity analysis of the algorithm is presented and the setup for the experiments are shown. Chapter 4, Results and Discussion, presents the obtained results. These are discussed, analysed and interpreted. The last chapter, Conclusions and Future Work, concludes the thesis, and presents future plans in this field of study.

(22)

(23)

Chapter 2 Related Work

For easier understanding and structure, this chapter has been divided into four sec- tions. They are respectively, Section 2.1, presents state-of-the-art multi-view stream clustering algorithms; Section 2.2, gives a short overview of MVC algorithms; Sec- tion 2.3, some of the notable single-view stream clustering algorithms; and lastly, Section 2.4, presents two graph-based sing-view clustering algorithms.

2.1 Multi-view stream clustering algorithms

As stated in section 1.2 the research field of combining both MVC and streaming data is relatively new. Therefore it would be prudent to relate to the research papers that first come up with these ideas.

Shao et al. [45] is seen as one of the first articles that mentioned this idea of sectioning data into different chunks for multi-view clustering. They made use of NMF [31], that under certain constraints can be used for clustering as described in the paper [13]. Their version of the NMF algorithm was combined with itself to create a new minimization problem where different views could be inputted, and a consensus matrix describing the clustering structure could be generated. They divided the views into chunks, and for each new chunk, their algorithm generates a new consensus matrix with the information from the consensus matrix of the previous chunk.

The authors of Huang et al. [27] have developed a new algorithm in this field of study, entitled MVStream, similar to Shao’s but with the Support Vector Clustering (SVC) [3] algorithm, adjusted to unsupervised data with the Position Regularized Support Vector Domain Description (PSVDD) [50]. It works by combining and transforming all data points from the different views into a common kernel space (global model) and, from this space finding the common Support Vectors for all views. These Support Vectors are transformed back to each view’s space, resulting in the contours of different arbitrary clusters. The algorithm also transfers knowledge between chunks by incorporating the previous chunks Support Vectors as data points in the current chunks views, thereby retaining the view’s correlations in the previous chunks.

Devagiri et al. [12] proposes one of the newer algorithms in this field. They make use of one of their previous clustering algorithm, entitled Split-Merge Clustering [8].

The MVC version of it works by updating the local models in each chunk with the Split-Merge Clustering algorithm, with the previous chunk’s updated local models and the current chunk’s local models. Then the updated local models are integrated

9

(24)

into a global model with Formal Concept Analysis (FCA) [32]. Producing a global model for each chunk, with later chunk’s global models having the previous chunks knowledge of the view’s correlation in them.

2.2 Multi-view clustering algorithms

MVC has been intensively studied and produced many different kinds of algorithms and concepts, as seen in the survey by Yang and Wang [54]. These are algorithms that divide data into different views but still hold all the data in memory.

Many different kinds of MVC algorithms that are based on the NMF algorithm have been developed [33][37][51][2]. One of which is by Liu et al. [33]. They com- bine the objective functions of different views, minimization problems, and adding a consensus part to it. Their paper proposes a novel way to use the l1 normalization to solve the problem of multiple views with NMF by imposing a new constraint to the objective function.

Peng et al. [40] propose a novel MVC algorithm without parameter selections, entitled COMIC. The proposed algorithm projects data points into a space where the two properties, geometric consistency, and cluster assignment consistency, are satisfied. By helping enforce a view consensus onto a connecting graph, thereby clustering the data without any parameters.

Bendechache and Kechadi [4] proposes a distributed clustering algorithm with k-means. It clusters each view with the k-means algorithm and merges overlapping clusters of different views. This is done multiple times with the same data until a specified level has been reached. This level is one of the algorithm parameters, including the number of clusters for each view used with the k-means algorithm.

2.3 Single-view stream clustering algorithms

As seen in Ghesmoune et al. [20] article on clustering data streams, many different algorithms have been developed over the years. Some of which are described in this section.

Coa et al. [10] proposes an iterative clustering algorithm with NMF, entitled ONMF. It uses the property in [13], that says NMF is equivalent to k-means if the orthogonality constraint HH^T = I is included in the minimization problem.

By clustering temporally changing data streams, divided into chunks, where each chunk is clustered with the orthogonal NMF, then propagating the results to the next chunks calculations to retain the knowledge found in the previous chunks. This process is continued until all data has been clustered.

Wang et al. [49] proposed an algorithm, entitled SVStream, the predecessor of MVStream, which is similar to its successor in which they both cluster data with SVC. The difference in this algorithm with its successor, excluding the multi-view part, is how SVStream merges the spheres created by the current and next data chunks. SVStream updates the kernel space, where the sphere is after each new data chunk has been clustered. It removes old Support Vectors that could affect the clustering and merging and updating spheres accordingly, thereby transferring knowledge of the view’s correlation to the next data chunk.

(25)

2.4. Graph-based clustering algorithms 11

2.4 Graph-based clustering algorithms

Because MST is used in the novel algorithm described in this thesis, it is prudent to mention some of its previous uses.

As described in section 1.1.1 Flake et al. [18], have proposed an MST based algorithm for clustering data. By converting the data into a complete undirected graph and adding an artificial node to it. When the node is removed after a MST is found, the data would have been divided into clusters.

Lv et al. [35] proposes another MST-based clustering algorithm, entitled CciMST.

CciMST starts by finding a MST and calculating pairwise Euclidean and geodesic distances of all data points. The graph and distances are then used for finding the cluster centers based on the density and variances of the data points in the graph. The algorithm ends with determining which edges in the graph should be cut, producing and comparing two results, choosing the one where the clusters have a bigger gap between them.

(26)

(27)

Chapter 3 Method

In this chapter, the proposed algorithm, henceforth referred to as Multi-view Cut- Clustering (MVCC), is described in detail, including the justification of decisions made through its development. In addition, the computational complexity analysis is proposed, and the steps taken to answer the research questions stated in Sec- tion 1.3 are discussed. All abbreviations in this thesis is presented in Table A.1 in Appendix A.

MVCC is an algorithm designed to be executed sequentially for each new data chunk, as seen in the high-level overview given in Figure 3.1. As seen in the figure, the algorithm takes a data chunk with a pre-specified size from a stream of data and incorporates the knowledge found from the previous chunk to cluster the current chunk. Summary statistics are extracted from each data chunk, and a part of this extracted knowledge is transferred to the next chunk. The core algorithm can be divided into six different stages; they are as follows:

1. Cluster views, i.e. build local clustering models.

2. Evaluate views’ clustering solutions (local models).

3. Create a consensus (integrated) matrix from the approved views.

4. Cluster the consensus matrix, i.e. build a global model.

5. Calculate artificial nodes (knowledge) and forward it to the next chunk.

6. Label the original data.

These stages are executed for each data chunk and are described in more detail later in this section. An overview of the core algorithm is shown in Figure 3.2.

3.1 Proposed MVCC algorithm

In this subsection, all the six steps of the proposed MVCC algorithm are described in detail. Starting with stage 1, where each view is clustered with the CC algorithm [18].

13

(28)

data stream Next data chunk

Knowledge A from previous

chunk B C Next

chunk

Extracted

knowledge Extracted

knowledge Extracted knowledge

Figure 3.1: High-level overview of MVCC for the three chunks A, B, and C.

Stage 1 C₁

C₂ C₃ C₄ C₅

C₆

Stage 2 V₁

V₂

V3

Stage 3

V₁ V₂ V₃

C₁ C₂ C₃ C₄ C₅ Features

Stage 4 C1

C₃

C₂ C₄ C₅

Stage 5 V₁

V₂

V3

Labeling Stage 6

Figure 3.2: Overview of the core MVCC algorithm for one chunk.

3.1.1 Cluster views

Given a data chunk Dtof n data points, divided into V views, namely Dt= {X_i^(v); v = 1, 2, ..., V, i = 1, 2, ..., n}, where t is the current chunk, starting at t = 1. For each view in D1, add an arbitrary data point l, and create, from each view, a complete undirected graph G^(v). The weights of edges between nodes in the graphs are the Eu- clidean distance between two nodes. Using the Manhattan distance could potentially improve the algorithm with data in higher dimensions [1], although in this thesis this is not considered. The potential of the Manhattan distance may be studied in our future work on this algorithm.

The artificial node l^(v) does not have a meaningful coordinate, at least for the first chunk; hence the α value needs to be used as the distance between all nodes connected to the artificial node. For each artificial node the mean α value is calculated with Equation 1.1. This is chosen based on the reasonable good results according to [9]. Note that this is only used in the first chunk. Later chunks will not use this parameter, as described later in this section.

To find a MST of the complete graphs, many different algorithms exist. In this

(29)

3.1. Proposed MVCC algorithm 15 thesis, Kruskal’s algorithm is used [30]. With the graphs G^(v), it works by sorting all edges and choosing the shortest edge. The two nodes that have the shortest edge are combined to a tree, and the process repeats by finding a new shortest edge that does not connect to the same tree. These steps are continued until a MST is found.

Pseudo-code for Kruskal’s algorithm is shown in Algorithm 1.

Algorithm 1 Kruskal’s algorithm

1: procedure Kruskal(G)

2: F := ∅.

3: for v ∈ G.V do

4: MAKE-SET(v)

5: for (u,v) in sorted(G.E), with weight(u,v) increasing do

6: if SET(u) 6= SET(v) then

7: F := F ∪ {(u, v)}

8: UNION(SET(u), SET(v))

9: return F

As stated in section 1.1.1, the CC algorithm works by removing the artificial node after a MST has been found. For MVCC this means that each artificial node l^(v), for each view v, are removed from the MST. Producing the cluster solutions (views’ local models) Ct = {c^(v)g ; g = 1, 2, ..., m^(v), v = 1, 2, ..., V }, where t is the chunk number, V the number of views, and m^(v) is the number of clusters in view v. c^(v)g is the cluster solution for view v, namely c^(v)g = {X_L^(v); L = 1, 2, ..., h^(v)}, where the subscript for X^(v) denotes the label of the cluster as seen from view v’s perspective, making h^(v) the maximum label for view v. For later calculations the label of view v will be denoted as L^(v), with the same structure as Ct, i.e. L^(v)_g,i is the label of data point i in cluster g and view v.

For subsequent stages in the algorithm, namely the creation of the consensus matrix, Ctis used for calculating the medoid of each cluster. Let X_i^(v) = x₁, x₂, ..., x_n, for each cluster in c^(v)g the medoid is defined as the following:

xmedoid= arg min

n

X

i=1 n

X

j=1

euclidean(xj, xi) (3.1) The indices of each data point and medoid in a cluster is also calculated for subsequent stages of the algorithm, structured the same as Ct. Since the Python [46]

library NetworkX [23] is used in this implementation of MVCC, the indices are given instead of the data points, denoted as It = {i^(v)g ; g = 1, 2, ..., m^(v), v = 1, 2, ..., V }. A simple mapping of the data points to the corresponding index is therefore used, by letting c^(v)g = X^(v)[i^(v)g ]and let Jt be the indices of medoids found when each medoid is calculated.

3.1.2 Evaluate views’ clustering results

As can be observed from Figure 3.2, not all views’ results from cluster solution Ct

are forwarded to create the consensus matrix. Each view’s clustering solution is

(30)

evaluated in order to be able to discard those views that have not demonstrated to have meaningful clustering solutions. Therefore low evaluated views’ clustering solutions are not considered in the creation of the consensus matrix.

Each clustered view is evaluated using the Silhouette Score (SC) [43], with a threshold T , where the threshold is the only parameter for the proposed algorithm.

A view’s clustering solution passes the evaluation criteria if its SC is higher than T. The choice of SC as the evaluation metric for the local models is because it is a commonly used metric for evaluating clustering solutions. Furthermore, SC is an internal validation metric and does not need a ground-truth to compare with to provide a validation of consistency of the clustering solutions.

SC is a measure of how close the data points are to the predicted cluster (cohesion) compared with other clusters (separation). The SC of one sample is calculated by a combination of cohesion and the minimum separation as shown in the following equation:

cohesion(i) = 1

|C_i| − 1 X

j∈Ci,i6=j

euclidean(i, j),

separation(i) = min

k6=i

1

|C_i| X

j∈Ck

euclidean(i, j),

s(i) = separation(i) − cohesion(i)

max cohesion(i), separation(i), if |C_i| > 1,

(3.2)

where i and j are data points corresponding to X_i^(v) and X_j^(v), Ci is the cluster data point i belongs to, and k is the number of predicted clusters. The mean value of s of all data points i is used for multiple data points. In essence, it gives a value between -1 and 1, with 1 dictating that all data points belong in the predicted clusters and -1 when all data points belong to their neighboring clusters.

The SC is calculated with the Python library scikit-learn [39] in the implementation of MVCC. The library function uses the labels to know what cluster a data point belongs to. Therefore, to know what Ci should be, a simple mapping of the indices is used to retrieve the labels by letting L^(v)g [i^(v)g ] = g.

3.1.3 Consensus matrix

The consensus matrix is built by medoids of the approved views’ clustering solutions.

A medoid is seen as a summarization of a cluster, allowing for privacy of data and distributed clustering. The global model does not need to know what the data in each view are; only the medoid of each cluster is transferred, i.e., privacy is preserved. The proposed algorithm can be interpreted as a distributed clustering because it clusters and evaluates several views in parallel.

The consensus matrix is built with data points from the views, where each column in the matrix is a data point (medoid). As seen from Figure 3.3, one medoid takes up one column, and each row is one feature, e.g., if C1 from Figure 3.3 is a medoid in view 1 (the gray area in Figure 3.3), the rest of the column (view 2 and 3) will have the remaining features of the medoid in view 1. If a medoid is found in one view, the data points corresponding to that medoid in the other views are used to fill the

(31)

3.1. Proposed MVCC algorithm 17 column. Essentially all features of a medoid (in this case C1) are placed in the same column. As seen in Figure 3.2 the third view does not have any medoids from stage 2 because the view did not pass the evaluation criteria. Knowing this, let V be the consensus matrix used in stage 4, constructed from X^(v), where v is the approved views, combined with the medoids and indices corresponding to each medoid.

V₁ V2

V₃ C₁

Features

Figure 3.3: One column in the consensus matrix, where one medoid (Column C1) is built up with features from all views that correspond to that data point.

3.1.4 Cluster the consensus matrix

The consensus matrix is the combinations of different medoids from all approved views, and for this matrix to be useful, information from it needs to be extracted.

One way to extract the needed information would be to cluster the matrix, grouping clusters of different views together. In MVCC, this is achieved by using the CC algorithm. One problem arises from this approach, however, i.e., what value should α take? Note that the consensus matrix is very small, built with medoids, meaning the more views and features there are, the bigger the matrix. This implies that when the matrix is clustered, each cluster would often time only have one medoid in it, representing one cluster from one view. The choice of α, as stated in section 3.1.1, could be to calculate the mean α value. However, this solution depends more on the medoids them-self than the results from each view. According to [9], mean α is the average value α can take, producing often wrong numbers of clusters when there is an unbalance of the weights between nodes, focusing on the overall average.

Therefore two different alternatives to using the α value are proposed in this thesis, entitled Boundary Nodes (BN) and Longest Edges (LE).

Both BN and LE make use of how the CC algorithm works. Instead of calculating α for an artificial node’s edges, they create artificial nodes and calculate the edges with the Euclidean distance, as if they were normal nodes in the graph. The difference between BN and LE is how many and where the artificial nodes are created.

Algorithm 2 Boundary Nodes

1: procedure BN(centroids)

2: artif icial_nodes := ∅.

3: combinations := All 2-length combinations of medoids in centroids.

4: for u, v ∈ combinations do

5: artif icial_nodes ← Average(u, v).

6: return artif icial_nodes

(32)

BN uses centroids of clusters to calculate the artificial nodes. An artificial node is placed between two data points (centroid), resulting in artificial nodes between two clusters. Pseudo-code of the BN algorithm is shown in Algorithm 2. This method has one small shortcoming. Namely, if three clusters centroid’s are in a straight line, one of the resulting artificial nodes will be placed on the middle centroid, splitting the central cluster in two, producing not-optimal results. This shortcoming may be fixed by removing artificial nodes that are too close to a centroid. Note that in practice, the possibility of clusters centroid’s being in a straight line is very low, resulting in very close results for BN with and without removing close nodes. Therefore all artificial nodes are considered in the calculations in this thesis.

Note that BN needs centroids for calculating the artificial nodes. Therefore it cannot be used for the first data chunk. The mean α is used instead, similar to the clustering of the first chunk’s views. Subsequent chunks use the centroids calculated from the previous chunk’s global model’s clusters in the calculation of BN.

Algorithm 3 Longest Edges

1: procedure LE(V ,nr_to_remove)

2: artif icial_nodes := ∅.

3: removed_edges := ∅.

4: G := complete graph of V .

5: M ST := Kruskal(G)

6: for e ∈ sorted(G.E)[0:nr_to_remove], with longest edge first do

7: remove e from MST .

8: removed_edges ← e.

9: for e ∈ removed_edges do

10: artif icial_nodes ← Average(u, v).

11: cluster_info := info(MST ) .Extracted cluster solution of indices and centroids from MST .

12: return artif icial_nodes, cluster_info

As the name suggests, LE uses the longest edges found in the MST generated from the consensus matrix. Artificial nodes are created in the middle of the longest edges. The question is then, how many edges should be used? Because each view is already clustered, the number of clusters found in them could be used to determine this. Calculating the average number of clusters in the views, each medoid in the matrix is one cluster, then that many edges are used for creating the artificial nodes.

Pseudo-code for LE is shown in Algorithm 3.

As one can see in Algorithm 2, BN returns the artificial nodes used in the CC algorithm on the consensus matrix V . LE, as a variation of CC, does not need to run the CC algorithm afterward because V is clustered in it, as can be observed in Algorithm 3.

Let Gt be the cluster solution of indices after the clustering, and W tt be the centroids of each cluster. Medoids is not used for W tt because the average gives a better representation of a cluster, not being forced to be one of the data points.

(33)

3.1. Proposed MVCC algorithm 19

3.1.5 Transfer of knowledge

When each view is clustered with CC, in the first stage, the α value is used for the first chunk. Subsequent chunks calculate artificial nodes and transfer them to be used as the artificial nodes for CC, the same as what BN and LE do. This allows knowledge to affect the clustering of the next chunks local models in the form of the artificial nodes. Essentially guiding the clustering of the next chunk to be similar to the previous chunk’s results, but still allowing changes in the cluster solution because of changes in data points from chunk to chunk.

For both LE and BN, the artificial nodes they return are split into components corresponding to each view, e.g., features belonging to view 1 are extracted from the artificial nodes and sent to view 1 in the next chunk. They are allowing the CC algorithm to be seeded in each view. An exception for BN is in the first chunk; due to how it is constructed. The data is clustered with CC first then BN is used on the clustering solution. For the consecutive chunks, BN is used first to seed CC, similar to LE.

3.1.6 Labeling original data

There are two ways to label the original data. The first is to use NMF, with fixed X and W matrix, or X and F matrix in CNMF. Let F = W t and X = concat(Dt), where concat(Dt) is the original dataset, not divided into different views. Using Equation 1.2, the rows with the maximum values of a column in H gives the label of each data point (columns in H).

Note that the CNMF method is slow and not very accurate. CNMF initiates the two matrices L and H with random values, resulting in different approximations for each execution. To rectify this, the CNMF algorithm is executed multiple times, and the instance which gives the lowest Frobenius norm |X ≈ F H+^T| is used. However, this increases the time of the labeling. Hence a new algorithm for labeling the original data is proposed, entitled Pattern-Labeling (PL). PL finds cluster patterns in the global model and maps data points to patterns they match, using the pattern’s index as the label for all data points that match it.

Before the algorithm can be used, Gt needs formatting to be able to execute the algorithm correctly. Gt holds indices of the medoids in V , but PL needs to have the indices of the medoids as seen from each view. As seen in Figure 3.4, Gt is formatted and changed to accommodate the indices of each medoid in the view’s data (the original dataset) and the views them-self. This is done because Gt can be seen as patterns found in the global model. Clusters in Gt are build by medoids from the views. By observing which medoid belongs to a cluster in the global model and comparing it to the data points in each view, patterns can be found and used.

Therefore, by formatting it to fit the proposed method, data points can be mapped to one of the rows in the lists. Pseudo-code for the changes of Gtis shown in Algorithm 5 in Appendix B, making use of Jt, which holds the indices for each medoid in each cluster for all views.

As can be observed from Figure 3.4b, some views are missing from the formatted G_t. To rectify this, empty lists are placed where views with no medoid from the cluster in the global model could be found. Let Pt be the list of formatted patterns

(34)

G_t,0 = {0, 2, 3, 4}

G_t,1 = {1, 5, 6}

Gt,2 = {8, 11, 12}

(a) Original G_t.

G_t,0 = {(12, 0), (123, 0), (64, 2), (0, 2)}

G_t,1 = {(91, 0), (235, 1), (835, 2)}

Gt,2 = {(201, 1), (829, 2), (50, 2)}

(b) Formatted G_t

Figure 3.4: Example on how Gt could look like before and after being formatted for PL. (a) Before formatting of Gt. (b) Formatted Gt to fit for usage in PL.

in Gt with the addition of empty lists, as shown in Figure 3.5. Note that the decision to have lists in the columns of Pt is for easier labeling. Figure 3.5b illustrates how patterns might look like without lists in the columns of the first cluster in the global model, first row in Pt.

P_t,0 = [{12, 123}, ∅, {64, 0}]

P_t,1 = [{91}, {235}, {835}]

P_t,2 = [∅, {201}, {829, 50}]

(a) Formatted G_t

P_t,0,0 = [12, ∅, 64]

P_t,0,1 = [12, ∅, 0]

P_t,0,2 = [123, ∅, 64]

P_t,0,3 = [123, ∅, 0]

(b) All patterns in P_t,0. Meaning data point that match any of these patterns is assigned the label 0.

Figure 3.5: Example on how patterns Pt, created from Gt, could look like after empty sets have been added to the formatted Gt. (a) Shows patterns Pt after empty sets have been added to Gt. (b) The uncompact form of the first pattern in 3.5a.

The PL algorithm, pseudo-code shown in Algorithm 4, needs to have a list of medoids for each data point. The list, entitled points_pattern in the pseudo-code, is created by mapping medoids in a view to the corresponding data point. For each index in It^(v,c), v is the view, and c is the cluster, add Jt^(v,c) to the list.

Algorithm 4 Pattern-Labeling algorithm

1: procedure Pattern_Labeling(index,checking, views)

2: predicted := A list of all data points, each element initialized with −1.

3: for i in data point index do

4: checking :=A range of ordered numbers up to nr of views, starting from 0.

5: predicted[i] :=check(i,checking,views)

6: return predicted

The matching of a data point to a pattern works by comparing element by element, i.e., first compare the first element in points_pattern corresponding to the first view for a specified data point, with the medoids in the first column in Pt. If it matches, move forward to the next view. This is repeated until all elements have

(35)

3.2. Computational Complexity analysis 21 been compared. The empty lists in Figure 3.4b corresponds to what is referred to in this thesis as a wild card. Wild cards are skipped automatically, resulting in a match, e.g., for Pt[0], two of three views are in the pattern. The middle view is skipped; therefore, a data point only needs to match the pattern’s first and last view to match the whole pattern. This process of matching with a pattern, the function check in the pseudo-code for PL, is shown in Algorithm 7 and 6 in Appendix B.

If a data point does not match any of the patterns, it is seen as an outlier in the data. The definition of an outlier in this thesis is a data point that is hard to group with a cluster; it might belong to one cluster but is placed closer to another cluster.

One view might assign it to the correct cluster, and another view assigns the data point to another cluster. For this reason, the data point will not match any pattern and is considered an outlier, therefore assigned the label −1.

3.2 Computational Complexity analysis

The computational complexity of MVCC is hard to estimate due to the variety in views and the dimensionality of the data. However, going per stage of the core algorithm, the computational time complexity is as follows.

Cluster views: The clustering of the views have a computational time complexity of O(v(|Dt|²) + v|D_t| + v(|D_t|log|D_t|)), where |Dt|is the number of data points in data chunk t and v is the number of views. The first part, O(v(|Dt|²)), is the creation of the medoids for all views. The second part, O(v|Dt|), is the creation of the complete graphs for all views. For the third part, Kruskal’s algorithm is O(E log V ), with E edges and V nodes, making the calculation of MST for all views, O(v(|Dt|log|D_t|)).

Evaluate views’ clustering results: If effectively using the pre-calculations of the distances, SC have a computational time complexity of O(n²), meaning stage 2 of the algorithm have a computational time complexity of O(v|Dt|²). Consensus matrix: Stage three is harder to estimate, it depends on how the creation of the matrix is implemented and the dimensionality of the data.

For the implementation used in this thesis, the computational time complexity is O(a × v × P^v_i=0|C_t⁽ⁱ⁾| × |D_t⁽ⁱ⁾[0]|), where a is the number of approved views,

|C_t⁽ⁱ⁾|is the number of clusters for view i, and |Dt⁽ⁱ⁾[0]|is the number of features for view i.

Cluster the consensus matrix: The clustering of matrix V have different time complexity depending on which of the two methods BN and LE is used.

• If BN, the computational time complexity is O(|Wt|^(|W^t^|−2)+ |V_t| + (|V_t|log|V_t|))), where |Wt| is the number of centroids found in Vt, and |Vt| is the number of data points in matrix Vt. Of the parts, |Wt|^(|W^t^|−2) is the time complexity of BN, |Vt| the creation of the complete graph, and

|V_t|log|V_t| is the calculations of MST.

(36)

• If LE, the computational time complexity is O(|Vt| + (|V_t|log|V_t|) + avg_clustert), where avg_cluster_t is the average number of clusters found for chunk t.

Transfer of knowledge: The computational time complexity of stage 5 is O(v).

Labeling original data: The computational complexity of the two labeling methods, CNMF and PL, are as follows:

• O(b(|Dt|k(|D_t|²p + m(2|D_t|²|W_t| + |D_t||W_t|²)))), where p is the dimensionality of the original data, m is a rough estimate of the number of iterations until convergence, and b is the number of times the algorithm should run to get better results. For the implementation in this thesis b = 100. The part, |Dt|²p + m(2|D_t|²|W_t| + |D_t||W_t|²), is the computational complexity for a equation for updating matrix H. Multiplying with |Dt|k is because the equation is for one element in the matrix.

• O(P^v_i=0|I_t⁽ⁱ⁾| +P|Gt|

i=0|G⁽ⁱ⁾_t | + |D_t| ∗P|Gt|

i=0|G⁽ⁱ⁾_t |), where |It⁽ⁱ⁾|is the number of clusters in view i, |Gt| is the number of clusters in the global model, and |G⁽ⁱ⁾t |is the number of local clusters in global cluster i. The first part, Pv

i=0|I_t⁽ⁱ⁾| +P|Gt|

i=0|G⁽ⁱ⁾_t |, is the mapping to the real medoids in each view.

The second part, |Dt| ∗P|Gt|

i=0|G⁽ⁱ⁾_t |, is the matching of patterns for all data points.

From this the total computational complexity analysis of the proposed algorithm is estimated as follows for the configuration with the lowest computational time complexity (LE with PL):

LE and PL: O(2λ + v(λlogλ) + ^λ_v|D_t| + 2^λ_v + aλvf + 2v|D_t|²+ v|D_t|), where λ =Pv

i=0|I_t⁽ⁱ⁾|, and f is the number of features in dataset D.

3.3 Experiment setup

To answer the research questions stated in Section 1.3.2 two experiments will be conducted. RQ1 is answered by explaining the MVCC algorithm; more on that is explained and discussed in Chapter 4. Two experiments described below answer the other two research questions.

Remember that RQ2 asks if the knowledge transferred to the next chunk improves the clustering compared to an algorithm that does not use this knowledge. It is possible to remove this function of the algorithm to answer this question; due to how the MVCC algorithm’s transfer of knowledge function works. Instead of transferring the artificial nodes to the next chunk, an empty list is used instead, forcing the algorithm to perceive the new chunk as the first again. It is forcing the algorithm to calculate the mean α value when clustering the local models. By doing this, a comparison can be made between itself, with and without transfer of knowledge.

Comparing both the approaches, the comparison will show the difference in transfer of knowledge between chunks because the compared algorithm is the same.

(37)

3.3. Experiment setup 23 The third research question, RQ3, asks if there is a difference between MVCC and a single-view clustering algorithm. For the same reason as the previous research question, MVCC will be compared to itself. Both because of the short time for finding and preparing a single-view clustering algorithm to experiment with and the fair comparison. By using one view in the MVCC algorithm, it emulates a single- view clustering algorithm. When MVCC uses one view, the view will be clustered twice; this might be a problem for the comparison and are discussed in more detail in chapter 4.

The datasets used in the two experiments are one real-world dataset and one synthetic dataset. They are as follows:

Forest Cover-Type dataset. Originates from the UCI Machine Learn- ing Repository [15], the dataset is for predicting forest cover types [7]. The dataset is a well-known dataset used in many different machine learning papers [27][34][8]. It was created for classification tasks with 581012 instances and 54 attributes, divided into 7 clusters. Cover-Type is an unbalanced dataset with 283301 instances for the biggest cluster and 2747 instances for the smallest.

From the total 54 attributes, 44 are binary values, and 40 of these indicate soil types, the remaining 4 indicate wilderness areas. These 44 binary attributes make the data sparse since only 2 of 44 are present at one instance, all others are 0.

Dim32. A synthetic dataset, Dim32 [19] is built for easy clustering, where each cluster is well separated, even in higher dimensions. Initially used in [19]

with variations of the number of attributes in the dataset, lowest number as 32, and highest as 1024. Dim32 has 1024 instances with 32 features, divided into 16 different Gaussian clusters.

Following [8], the 40 binary soil types in the Cover-Type dataset are not considered in the experiments due to their sparsity. Additionally, a sample set of 50000 instances of the Cover-Type dataset is used in the experiments and is divided into four views. This was chosen based on the research from Boeva et.al. [8], where a sample of 50000 instance was used to reduce the time to run the experiments. The first three attributes, Elevation, Aspect, and Slope, are assigned to the first view.

The next three, with the ninth attribute, horizontal and vertical distance to nearest surface water features, and horizontal distance to nearest roadways and wildfire ig- nition points, are assigned to the second view. The third view includes attributes 6, 7, and 8 in the dataset, corresponding to the hill shade at 9 am, noon, and 3 pm, respectively. The last view has been assigned to the remaining four binary attributes indicating Rawah, Neota, Comanche Peak, and Cache la Poudre wilderness areas.

Dim32 has been interpreted as a two-view dataset for the experiments. The decision was based on the fact that the clusters in the dataset are well separated even in higher dimensions. Therefore more views would not have a significant impact on the results. The first 16 attributes are assigned to the first view and the rest to the second view.

A summary of the different datasets used in the experiments is shown in Table 3.1.

(38)

Dataset Type #Samples #Attributes #Clusters

CoverType Real World 50000 14 7

Dim32 Synthetic 1024 32 16

Table 3.1: Different datasets used in the experiments.

The two methods mentioned in Section 3.1.4, BN and LE, and the two labeling methods mentioned in Section 3.1.6, CNMF and PL, are mostly experimented with the Dim32 dataset. Only experiments with LE and PL have been conducted on Cover-Type, due to the size of the Cover-Type dataset. The best version of MVCC is used in the two following experiments for answering the research questions with both datasets. In all experiments, ten versions of the two datasets are used. They were created by shuffling rows in the data, thereby generating new versions of the datasets. This was done to get more accurate results of how MVCC performs.

The data was scaled with Z-Score, such that the distribution has a standard deviation of 1 and a mean of 0, with the following equation:

z = x_i− µ

σ , (3.3)

where z is the scaled data, xi is a data point in the data, µ is the mean of the data, and σ is the standard deviation of the data. This was used on both datasets to normalize the data, leading to better results when clustering.

3.3.1 MVCC implementation

For the experiments, MVCC was implemented with Python [46] and uses libraries for some of the functions in it, which are as follows:

Scikit-Learn [39]: Preprocessing of data and all metrics calculations.

NetworkX [23]: Kruskal’s algorithm and creation of the complete graph.

Pandas [52]: Reading dataset from disk and manipulations of data.

NumPy [26]: Fast array manipulations and operations.

SciPy [48]: Utility functions for distance calculations.

PyMF [16]: Solving CNMF algorithm, implementation based on [14].

Graph-based Multi-view Clustering for Continuous Pattern Mining