An automated approach to clustering with the framework suggested by Bradley, Fayyad and Reina

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

,

STOCKHOLM SWEDEN 2018

An automated approach to

clustering with the framework

suggested by Bradley, Fayyad and

Reina

(2)

Abstract

Clustering with the framework suggested by Bradley, Fayyad and Reina allows for great scalability. However, practical challenges appear when applying the framework. One of the challenges is to define model param-eters. This includes defining the number of clusters (K). Understanding how parameter values affect the final clustering may be challenging even with insight into the algorithm. Automating the clustering would allow for a more widespread use. The research question is thus: How could an automated process for clustering with BFR be defined and what results could such a process yield? A tailored method for parameter optimiza-tion is suggested. This method is used with a new and computaoptimiza-tion- computation-ally advantageous cluster validity index called population density index. Computing the widely used within set sum of squares error requires an additional pass over the data set. Computing population density index does not. The final step of the automated process is to cluster with the parameters generated in the process. The outcome of these clusterings are measured. The results present data collected over 100 identically de-fined automated processes. These results show that 97 % of the identified K-values falls within the range of the suggested optimal value ±2. The method for optimizing parameters clearly results in parameters that out-perform randomized parameters. The suggested population density index has a correlation coefficient of 1.00 with the commonly used within set sum of square error in a 32-dimensional case. An automated process for clustering with BFR has been defined.

(3)

Abstract

Ramverket som föreslås av Bradley, Fayyad och Reina möjliggör storskalig klustring. Att använda ramverket medför dock praktiska utmaningar. En av dessa utmaningar är att definiera modellens parametrar. Detta inklud-erar att definiera antalet kluster (K). Att förstå hur angivna parametervär-den påverkar det slutgiltiga klustringsresultatet är utmanande även med insikt i algoritmen. Att automatisera klustringen skulle möjliggöra för fler att använda ramverket. Detta resulterar i frågeställningen: Hur skulle en automatiserad process för klustring med BFR kunna definieras och vilka resultat skulle en sådan process kunna ge? En skräddarsydd metod för parameteroptimisering föreslås. Denna används i kombination med ett nytt klustervalideringsindex vilket refereras till som population density in-dex. Användning av detta index medför beräkningsmässiga fördelar. Att beräkna det frekvent använda within set sum of squares-värdet kräver yt-terligare en iteration över det använda datasettet. Att beräkna population density index undviker denna extra iteration. Det sista steget i den au-tomatiserade processen är att klustra givet de parametervärden som pro-cessen själv definierar. Resultatet av dessa klustringar mäts. Resultaten presenterar data insamlad över 100 individuella försök. För samtliga av dessa var den automatiserade processen identiskt definierad. Resultaten visar att 97 % av de identifierade värdena på K-parametern faller inom en värdemängd baserad på det optimala värdet ±2. Att optimera pa-rametervärden med den föreslagna metoden ger tydligt bättre värden än om dessa genereras stokastiskt. Det föreslagna population density index har 1.00 som korrelationskoefficient med det välanvända within set sum of squares-värdet i ett 32-dimensionellt fall. En automatiserad process för att klustra med BFR har definierats.

(4)

Acknowledgements

Henrik Boström

For his meticulous feedback and impressive dedication Johan Montélius

For being inspiring and colorful in a gray place Sarunas Girdzijauskas

For introducing me to BFR Betzaida Carrillo

For all the love and support My friends at Epidemic

For the great welcoming Mikael, Annika and Elin

(5)

I would like to dedicate this work to my grandfather Yngve. He was born in 1925 as the son of a fisherman. Without him this work would not exist.

(6)

1 Introduction

"Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another."[1] An illustration of clustering may be found in Fig. 1. The term cluster does not have a widely accepted definition[3]. Finding the most suitable number of clusters depends on how a cluster is defined[1]. Partitional clustering algorithms organize points into a predefined number of clusters by minimizing a defined measure[3]. One of the commonly used partitional clustering algorithms is K-means[1]. Clustering algorithms generally require multiple accesses of each data point. This means that the algorithms need to iterate over the data. Each iteration is costly for big data applications. In 1998 Bradley, Fayyad and Reina suggested a framework for clustering large databases in a single iteration.[2] The framework is based on the procrastination of uncertain decisions and data compression[2]. The framework is often referred to as BFR after its creators[1].

This report suggests extensions to a K-means implementation of the frame-work suggested by Bradley, Fayyad and Reina. The extensions include a process for defining the number of clusters. It defines useful parameters and how these may be tuned. The extended framework may possibly be automated.

(9)

1.1 Background

Partitional clustering algorithms minimize a measure of error. This measure is different for different implementations and algorithms. Minimizing its cor-responding error may be seen as the objective of an algorithm. The error is often used to evaluate the quality of a clustering. Low error rates are better than high.[3] A famous algorithm for partitional clustering is K-means[1]. The K-means algorithm regards points in a Euclidean space[1]. Clusters are repre-sented by centroids. Centroids are reprerepre-sented by the mean value of all points assigned to that cluster. The centroid of a cluster is in the ith_{dimension thus}

located at the mean value of the ith _{dimension of all points assigned to the}

cluster. [3] Points are assigned to the cluster with the closest mean[1]. Cluster means are updated by iterating over all points multiple times[3]. Convergence is reached when cluster centers stabilize[1]. K-means minimizes a measure some-times referred to as within set sum of squares error[3]. This measure is low when points clusters are close together. It increases as points of clusters spread. [3].

BFR summarizes clusters as gaussian distributions. A point is included in a cluster if the distance to its closest cluster is less than some threshold value. A point of which the distance to its closest cluster is greater than the threshold value is buffered and assigned when more information is available. The BFR framework yields low error rates without multiple iterations over the data.[2] BFR may minimize different measures of error depending on its implementation [2]. The implementation covered in this report minimizes within set sum of squares error[2].

Partitional clustering algorithms generally cluster points into a predefined number of clusters[3]. The most suitable number of clusters is often defined by clustering into different numbers of clusters. The error is regarded as a function of the number of clusters. The curve of this function may be used to pick the most suitable number of clusters.[1] This approach requires multiple iterations over a data set and is expensive for big data applications[1][2].

1.2 Problem

Implementing the framework[2] created by Bradley, Fayyad and Reina requires defining and adjusting thresholds [2][1]. These thresholds are used to define nearness. Points and/or clusters are merged if they are near[2]. Commonly used approaches for defining parameters and measuring error are not using ad-vantages of the framework suggested by Bradley, Fayyad and Reina. Defining the number of clusters requires initializing multiple clustering models[1]. The resulting number of clusters may be depending on current threshold values. Us-ing the BFR framework may be challengUs-ing due its complexity. AutomatUs-ing this process would make the BFR framework easier to use. An automated process would benefit from using advantages of the BFR framework. How to define such a process may not be obvious. It is also not clear if automation may yield good results. No previous studies have been found on this matter.

(10)

1.3 Purpose

The purpose of this report is to define an automated process for clustering with the BFR framework. The outcome of this automated process will be studied. The steps of the automated process should be tailored to use advantages of the BFR framework. To use these advantages it is required to both tailor existing approaches and to define new. The number of clusters may be regarded as one of the parameters. Measuring the error may be regarded as a vital part of the evaluation required for comparison. The knowledge required for using the BFR framework could be lowered with an automated approach. Could an automated approach yield good results? The research question may thus be formulated as: How could an automated process for clustering with BFR be defined and what results could such a process yield?

1.4 Objectives

To answer the research question it is required to define a process which could be automated. The outcome of the defined process needs to be studied. This will answer which results an automated process could yield. One of the most important parameters to define is the number of clusters. The stability of the process for defining the number of clusters will thus be examined. Statistics will be gathered over multiple runs over the same data. A good method for defining the number of clusters finds similar numbers for different attempts on the same data set. Thresholds will need to be defined as BFR is dependent on thresholds[2]. Values will need to be assigned for these thresholds. A good method for finding threshold values should outperform random guessing. Oth-erwise the method is redundant. A tailored validity index is used for evaluation. Thus it is also required to gather statistics that show that the suggested validity measurement have a strong correlation with the validity index that it replaces. 1.4.1 Benefits, Ethics and Sustainability

The author does not receive funding or economical compensation from any part. There is not incentive to present misleading information. The author strives to present a transparent process which allows the reader to draw conclusions and repeat experiments. The reader is encouraged to critically examine any facts, claims or analysis presented in this work. Please contact the author should you find any issues or concerns with the presented information.

"Discrimination is the prejudiced treatment of an individual based on their membership in a certain group or category"[27]. Clustering may be used to classify people based on attributes[1]. This could possibly prevent the UN’s sustainable development goal no.10 (reduced inequalities)[15]. Only synthetic data sets were included in this study. This allows for public access to the data without any privacy concerns.

Every data access could be regarded as an energy consuming operation. Tailoring a process to use advantages of the BFR framework may lower the

(11)

amount of data accesses required. This could result in more energy efficient approaches than other brute force solutions.

1.5 Methodology / Methods

Observations that can be reduced to numbers are called quantitative[6]. Quan-titative observations are based on measurements[6]. Opposed to quanQuan-titative observations are qualitative observations[6]. Qualitative observations focus on the perceived experience of an individual. They may be based on, for example, interviews, surveys or questionnaires[6].

This study strives to provide support for the suggested approaches. The support should be as convincing and unbiased as possible. The study should focus on numerical clustering models and not their appliances. All data used in this study is thus gathered quantitatively. The analysis of the data consists of two parts. The first part is purely quantitative. The second part of the analysis brings in the author’s opinions and is thus qualitative.

1.6 Delimitations

This study only includes one suggested process. The results of this process is studied. These results are not compared to any alternative approaches. The steps of the included process are tailored to use advantages of the BFR frame-work. To which extent the steps of the included process use advantages of the BFR framework is not studied.

The framework suggested by Bradley, Fayyad and Reina is not bound to any specific clustering algorithm[2]. This report will only implement the framework using the K-means algorithm. The study will be based on four synthetic data sets[7] from School of Computing, University of eastern Finland. The data sets chosen have different dimensionality and size.

All included data sets are distributed in Gaussian clusters. This is suitable since the framework suggested by Bradley, Fayyad and Reina models clusters as axes-aligned Gaussian distributions[2]. This also means that the results and following analysis does not cover non-Gaussian clusters. Feature independence is thus also assumed in all experiments.

1.7 Outline

Chapter two introduces the K-means algorithm and the BFR framework. It presents different challenges and methods that are either required or useful for using these. The chapter introduces a statistical measure required for under-standing the results. Chapter three defines three deductive and quantitative experiments. It also introduces the data used in the study. Chapter four ex-plains my implementation of BFR. It exex-plains useful model parameters and approaches. Finally, I define three hypotheses. Chapter five presents the re-sults of the three experiments presented in chapter three. Chapter six analyses the results presented in chapter five based on the hypotheses defined in chapter

(12)

four. Chapter six suggests how the hypotheses possibly fit into the BFR frame-work. Chapter six ends by suggesting future studies and research. Chapter seven concludes on the results given the research question.

(13)

2 Extended Background

This chapter gives an introduction to partitional clustering in general. It plains K-Means and the BFR framework. Components of the suggested ex-tensions are introduced and explained. Finally, the chapter presents Pearson’s correlation coefficient which is a statistical measure used in the result and con-clusions.

2.1 K-Means

A well-known algorithm for clustering is K-means[1]. K-means represents a cluster by the mean value of all points assigned to that cluster. The mean value in the ith_{dimension[1]. The algorithm starts by picking K points at random.[8]}

These represent the initial clusters. Points get assigned to the cluster with the closest mean[1]. The mean of a cluster is updated for each point that gets assigned to that cluster[8]. Convergence is reached when the centers stabilize[8]. The pseudo code for K-means may be seen in Algorithm 1.

Algorithm 1K-Means

1: K ← Number of clusters

2: threshold ← value

3: points ← input data

4: clusters ← ∅

5: while idx< K do

6: cluster ← mean(chosen point from points)

7: clusters.append(center)

8: while T rue do

9: movement ← 0

10: foreach point in points do

11: closest ← closest cluster to point

12: distance ← dist(closest, mean(closest, point))

13: if distance > movement then

14: movement ← distance

15: closest ← mean(closest, point)

16: if movement < threshold then

(14)

2.2 Distance Measures

The distance between two points or sets may be defined in multiple ways[9]. This section defines distance measures used in this report.

Euclidean distance is arguably one of the most common distance measures. A definition of euclidean distance may be seen in equation 1[9].

d(x, y) = v u u t n X i=1 (xi− yi)2 (1)

Mahalanobis distance measures the similarity of two vectors from the same distribution[9]. A definition of mahalanobis distance may be seen in equation 2[9][2]. S is the covariance matrix.

d(x, y) = v u u t n X i=1 (xi− yi)S−1(xi− yi) (2)

S turns into a diagonal matrix if dimensions are independent[1]. Mahalanobis distance may be simplified to the equation seen in equation 3 if S is a diagonal matrix[1][9]. d(x, y) = v u u t n X i=1 (xi− yi)2 σi (3)

Mahalanobis distance thus turns into the euclidean distance normalized by the standard deviation, σi, in each dimension[9].

2.3 BFR

In 1998, Bradley, Fayyad and Reina suggested a framework for clustering large databases[2]. The framework is often referred to as BFR after its creators[1]. The framework is not bound to any specific clustering algorithm. In the original publication, it is exemplified in combination with K-means[2]. BFR allows data to be divided into chunks that fit in main memory[2]. The framework yields competitive results without iterating over a data set[2].

In the framework a cluster is represented by an N-dimensional Gaussian distribution with feature independence[2]. Each cluster has three attributes:

• Sum in each dimension

• Sum of squares in each dimension • Size

The sum and sum of squares can be used to compute the population standard deviation[10]. The formula may be seen in equation 4.

(15)

The framework suggests the use of three sets[2][1]. The discard set contains the main clusters[2][1]. Each point that is close to a cluster in discard updates the sum, sum of squares and size of that cluster and may thus be discarded[2][1]. Points that are far from clusters in discard are buffered in the retain set[2][1]. Points within the retain set with a low relative distance are summarized using the same cluster representation as in discard[2][1]. These summaries get stored in the compress set[2][1]. Points in retain and clusters in compress get assigned to the closest cluster in discard when all points have been considered[2][1]. An interpretation of the BFR framework may be seen in algorithm 2.

Algorithm 2BFR

1: K ← Number of clusters

2: points ← input data

3: thresholds ← values

4: discard ← ∅

5: compress ← ∅

6: retain ← ∅

7: while idx< K do

8: cluster ← cluster represented by chosen point from points

9: discard.append(cluster)

10: foreach point in points do

11: merge all clusters ∈ compress with relative distance < threshold 12: closest ← the cluster in discard closest to point

13: distance ← dist(closest, point)

14: if distance < threshold then

15: closest.update(point)

16: continue

17: closest ← the cluster in compress closest to point

20: closest.update(point)

21: continue

22: closest ← the point in retain closest to point

25: cluster = merge(point, closest) 26: compress.append(cluster) 27: retain.remove(closest)

28: Assign all clusters in compress to their closest cluster in discard

(16)

2.4 Cluster Validity Indices

There are two types of cluster validity indices[4]. External cluster validity indices requires knowledge of the true partitioning of a set[4]. Internal cluster valid-ity indices measures the outcome of a clustering by regarding either densvalid-ity, separation or both[4]. The true partitioning of a problem is often unknown[4]. Internal cluster validity indices are thus more commonly used[4]. The only ex-isting cluster validity index used in this report is the within set sum of squares error (or inertia). The measure is an internal cluster validity index and defined as in equation 5. W SSSE = X c∈Clusters X X∈c N X i=1 (Xi− ¯Xci)2 (5)

Where c is a cluster, Xiis the ithdimension of a vector which belongs to exactly

one cluster. ¯Xciis ithdimension of the mean vector of a cluster c. Examples of

other cluster validity indices not used in this report are[4]: • Silhouette index

• Davies-Bouldin index • Dunn’s index

2.5 The Elbow Method

Both K-means and BFR cluster the data into a predefined number of clusters[1][2]. It is common that the best number of clusters is unknown[1]. The number of clusters, K, may be defined by creating multiple clusterings for different val-ues of K[1]. The error of a clustering generally decreases for increasing valval-ues of K[1]. The best K-value occurs when the error stabilizes[1]. This process is sometimes called the elbow method[16]. An illustration of the process may be seen in Figure 2.

2.6 Hyperparameter Optimization

Learning algorithms typically minimize a loss function[18][19]. This minimiza-tion is for many learning algorithms done given a set of predefined parameters[18]. The predefined parameters are often referred to as hyperparameters[18][19]. The process of finding a set of hyperparameters that minimizes the loss rates is called hyperparameter optimization[19]. Two common methods for hyperparameter optimization are manual and grid search[18]. Manual search is based on an expert searching for parameters based on experience[21]. A grid search is per-formed by regarding combinations of hyperparameters in a predefined grid[21]. Another approach for optimizing hyperparameters is a to perform a random search[18]. A random search is performed by initializing a number of learn-ing algorithms with randomized hyperparameters[18]. The hyperparameters that minimize the loss function are considered optimal[18]. A random search is claimed to outperform a grid search for high dimensional parameter spaces[18].

(17)

Figure 2: The elbow method

2.7 The Initial Points

The validity of a K-means clustering is highly dependent the points picked to represent the initial centers[17][1]. The algorithm generally produces a higher validity if the initial points belong to different clusters[1]. Two points with a high relative distance are more likely to belong to different clusters[1]. A possible initialization algorithm is suggested as a part of K-means++[17]. In K-means++, initial points are picked with a probability proportional to the distance to the closest point that has already been picked[17]. Another option is to use the method suggested in algorithm 3[1].

Algorithm 3Possible initialization algorithm

1: Pick the first point at random

2: while Picked points < K do

(18)

2.8 Pearson’s Correlation Coefficient

A signed correlation of two random variables is defined as in equation 6[26]. −1 ≤ corr(X, Y ) ≤ 1 (6) Pearson’s correlation coefficient measures the linear relationship between two random variables[26]. The coefficient evaluates to ±1 for two perfectly linearly related variables[26]. A positive sign means that two random variables are positively related. Greater values of X means greater values of Y. A negative sign means that two random variables are negatively related. Greater values of X means smaller values of Y. Total absence of linear relationship is defined as corr(X,Y) = 0[26]. Pearson’s correlation coefficient is defined as in equation 7.

corr(X, Y ) = cov(X, Y ) σXσY

(19)

3 Methodology

This study consists of two parts:

• The first part is to define an automated process. This requires implement-ing the BFR framework

• The second part is a quantitative study of the results of the automated process

3.1 Implementing BFR

Bradley, Fayyad and Reina suggests approaches for defining thresholds[2]. These approaches are general and not specific enough to implement the framework. Specifying these definitions was needed to get a working implementation. These specifications should preferably be defined by considering applicable previous work. A deductive approach was used when no applicable previous work was found.

3.2 Defining Hypotheses

Plenty of time was spent on implementing the framework suggested by Bradley, Fayyad and Reina before defining the hypotheses. The time spent on imple-mentation together with manual trials and errors yielded knowledge on the logical flow of the implementation. This knowledge, together with more trials and errors, affected the hypotheses. The hypotheses were put through simple and manual experiments and iteratively adjusted. The process for defining the hypotheses was performed in partly an inductive manner based on observing results. The study evaluates the following hypotheses:

• The elbow method may be defined in such a way that it finds acceptable values for K on all four data sets

• Optimizing parameters using 50/150 optimization in a logical sequence yields lower error rates than randomly chosen parameters

• A strong existing correlation between the within sum of square error and population density index of clusterings

3.3 Data

The experiments in this study are performed on multiple data sets. All data sets are synthetic and publicly available[7]. The study could have included non-synthetic data sets. Four non-synthetic and Gaussian distributed data sets were used to provide the framework suggested by Bradley, Fayyad and Reina with optimal conditions. Studies on a wider variety of data sets is certainly relevant but was excluded due to limited resources. The data sets used in the study are:

(20)

• S1: 2D. 5000 points divided into 15 clusters[23] • S2: 2D. 5000 points divided into 15 clusters[23] • Dim032: 32D. 1024 points divided into 16 clusters[22]. • Dim15: 15D. 10126 points divided into 9 clusters[24].

All data sets consist of Gaussian clusters with different level of overlapping[7][7].

3.4 Subsets and Iterations

Each data set was divided into as many fully sized subsets of size 1000 as possible. A point was not allowed to be member in more than one subset. Each subset was clustered R times to minimize stochasticity. R was defined as 1 + 20 mod(number of subsets). This evaluated to:

• S1: 5 subsets, R = 5 • S2: 5 subsets, R = 5 • Dim032: 1 subset, R = 21 • Dim15: 10 subsets, R = 3

The number of subsets and R-values were consistent for all experiments.

3.5 Experiments

The study consists of three experiments. Each experiment evaluates an hypoth-esis. Each hypothesis is a part of the suggested extension to the BFR framework and included defining K, parameter optimization and efficient validity represen-tation. The experiments were performed repeatedly and the results were based on multiple attempts to reduce variations and stochasticity in the results. 3.5.1 Finding K

The experiment was designed to evaluate the consistency of the process sug-gested for finding K. This process is described in detail in section 4.3. 100 unique trials were made for each data set. The data of each trial was randomly shuffled to minimize bias from ordering. The error of a value K was represented by the average error of all subsets clustered into K clusters R times. Each of the trials resulted in an identified value Ki which was documented.

(21)

3.5.2 Optimizing Parameters

The experiment was designed to evaluate the efficiency of the suggested process for parameter optimization. One unique trial was made for each data set. The data of each trial was randomly shuffled to minimize bias from ordering. One set of parameters was defined using the suggested optimization approach. The set of optimized parameters was compared with 100 randomly generated set of parameters. Parameters were generated uniformly at random within an upper and lower bound. New parameters were generated for each trial. The error of all sets of parameters was evaluated on the same permutation of the data. This permutation was the same as during the tuning process. The error of a set of parameters was represented by the average error of 100 unique clusterings on the same permutation of the data.

3.5.3 Alternative Validity Index

The experiment was designed to evaluate the relation between an existing and a suggested cluster validity index. Two different validity index measures were computed for all clusterings on all sets of parameters in the experiment of section 3.5.2. Both these measures were documented.

(22)

4 Suggested Implementations and Artifacts

This degree project has been carried out in collaboration with Epidemic Sound. A python implementation of BFR was written using pure numpy as a part of this collaboration. The source code is available at:

https://github.com/jeppeb91/bfr.

The research question was born as challenges and ideas arose during the im-plementation of BFR. This section contains algorithms and artifacts that are deductively derived using experience gained during the implementation process.

4.1 Implementation of BFR

Implementing the BFR framework[2] requires defining thresholds and distance measures. Bradley, Fayyad and Reina suggests the use of mahalanobis distance to determine whether a point is considered close to a cluster[2]. A point is con-sidered close to a cluster if it has a mahalanobis distance less than a threshold. The definition of nearness between a point and a cluster may be seen in equation 8.

mahalanobis distance(point,cluster) < τm×

√

d (8)

This definition will consider all points within a distance of a τmstandard

devi-ations from a normally distributed cluster in d dimensions as near[14]. Maha-lanobis distance may be computed for any σ 6= 0. When a cluster is represented by a single point (or possibly a few points), it may have σ = 0 in one or more dimensions. Thus another threshold was introduced. The nearness between a point and a variance free entity (such as another point or a variance free cluster) is defined as in equation 9.

Euclidean distance(point,entity) < τe (9)

τe is a decimal value. Euclidean distance and its corresponding threshold is

frequently used in the early stages of a clustering. This is defined as the initial-ization phase. The initialinitial-ization phase ends when all clusters in the discard set have a variance > 0 in all dimensions.

BFR suggests that clusters in the compress set should be merged if they are considered near[2]. The nearness of two clusters in compress is defined as in equation 10.

∀d ∈ dimensions, σ(A∪B)d< τc× (σAd+ σBd) (10) Two clusters in compress, A and B, are hence considered close if the standard deviation of the merged cluster, σ(A∪B), is lower than the sum of the unmerged

standard deviations, scaled by a threshold value, in all dimensions.

There is a chance that clusters in the compress set have σi = 0 in one or

(23)

adjusted to as in equation 11. d(x, y) = v u u t n X i=1 (xi− yi)2 σi ,(xi− yi) 2 σi = 0 ∀ σi= 0 (11)

A cluster in the compress set has σi = 0 only if all points belonging to that

cluster have exactly the same value in one or more dimensions. The adjustment of mahalanobis distance is only done to prevent zero division.

Euclidean distance was used to determine which cluster that was closest to a point.

4.2 Population Density Index

BFR aggregates meta data of clusters in one pass over a data set[2]. To compute the within set sum of squares error would require another pass. Thus a new cluster validity index is suggested. A definition of the index may be seen in equation 12.

population density index = P c∈Clusters Pc N PN i=1σci Ptot (12) c is a cluster and N is the number of dimensions. Pc is the population size of

the cluster c. σci is the standard deviation of cluster c in the ith dimension.

The population density index is hence the sum over all clusters of the dimension average standard deviation of each cluster scaled by the proportion of the total population size belonging to that cluster.

4.3 Defining K

Defining K may be done by using the elbow method combined with population density index. The process is suggestively implementing as in the algorithm seen in algorithm 4.

Algorithm 4The Elbow Method

1: K ← 2

2: threshold ← value

3: prev_error ← error(K)

4: while T rue do

5: K ← K + 1

6: error ← error(K)

7: if error > prev_error × threshold then

8: Return K - 1

(24)

4.4 Picking The Initial Points

A suggested approach for defining the initial points is to iteratively pick the point that maximizes the distance to the closest point that has already been picked[1]. This is computationally expensive for large data sets and thus Algorithm 5 is suggested instead.

Algorithm 5The Initial Points

1: S ← points

2: K ← value

3: N ← number of candidates

4: initial_points ← random point

5: while number of initial_points < K do 6: candidates ← N random points from S

7: initial_points.append(best candidate)

For each initial point a number of candidates are picked. This allows a level of optimization to be defined. The best candidate is defined by finding the candidate which maximizes the distance to the closest point in initial_points.

4.5 50/150 Optimization

Clustering the same set of data multiple times yields different solutions[2]. To evaluate the efficiency of a threshold value it is required to gather statistics over multiple attempts. If we define the estimated error of a threshold, τ, to the average error of N trials, we get equation 13.

error(τ ) = PN

i=1errori

N (13)

Exploiting the law of large numbers gives us a limiting procedure in which the estimated error of a threshold approaches the true value when N goes to infinity[10]. This may be seen in equation 14.

lim

N →∞

PN

i=1errori

N = errortruth (14) Algorithm 6 is suggested for the tuning of a single parameter. The algorithm starts by making a high initial guess of τ. The algorithm iteratively lowers τ until the error rate is no longer decreased. If lowering τ does not improve error rates, τ is instead multiplied with 1.5. Convergence is declared when neither an increase or a decrease improves the error rate. A threshold is used to define the significance of a decrease in error rate.

(25)

Algorithm 650/150 Optimization

1: τ ← high initial guess

2: previous_error ← ∞ 3: while T rue do 4: previous_error ← error(τ ) 5: τ ← τ 2 6: error ← error(τ )

7: if error < previous_error × threshold then

8: continue

9: τ ← 2 × τ × 1.5

10: error ← error(τ )

11: if error < previous_error × threshold then

12: continue return τ

4.6 Hypotheses

It is challenging to define good values for thresholds and K. The BFR frame-work was implemented using three different thresholds. Three hypotheses were defined. These were combined into an automated process. The hypotheses are: • The elbow method may be defined in such a way that it finds acceptable

values for K on all four data sets

• Optimizing parameters using 50/150 optimization in a logical sequence yields lower error rates than randomly chosen parameters

• A strong existing correlation between the within sum of square error and population density index of clusterings

The first step of the extended framework is to start by obtaining a randomly chosen subset S which fits in main memory. S is then split into N smaller subsets. The error of a parameter, τ, is represented by the average of N unique clusterings initiated on each of the subsets. The accuracy of the average may be increased by creating R clusterings for each subset. Learning the error by a combination of weak learners is a process which was previously described as bagging.

The first parameter to define is the number of clusters, K. K is initially defined by using the elbow method with all thresholds set to infinity and initial points chosen entirely at random.

The easiest threshold to isolate is the threshold for merging two clusters in compress, τc. The impact of τc is maximized when many points are outside

nearness from clusters in discard but close enough to be summarized as clusters in compress. Points get summarized as clusters in compress if their relative euclidean distance is less than the corresponding euclidean threshold, τe. Points

(26)

than the corresponding mahalanobis threshold, τm. For tuning τc it is thus

appropriate to set τm low and τe infinitely high.

The next threshold to tune is the euclidean threshold, τe. τe is used in the

initialization phase (before all clusters have a variance in each dimension) and to determine if two points in the retain set are close enough to form a cluster in compress. τmis kept low while tuning τe. This is to maximize the number of

points that are far from clusters in discard thus forcing them to the retain set. τc is kept at the value which was found during its tuning.

The mahalanobis threshold, τm, is never tuned. Instead it is defined as

τm=3. This means that a point is included in a cluster if it is within a confidence

interval ≈ 99% of the cluster[10][14].

K is redefined after tuning the thresholds. The number of candidates used in the algorithm for finding the initial points is preferably tuned after the initial guess of K. The exact implementation of the process used in all experiments may be seen in algorithm 7.

Algorithm 7The training

1: S ← data

2: Si← the ithequally sized subset of S

3: τc← ∞

4: τe← ∞

5: τm← ∞

6: init_candidates ← 1

7: K ← initial guess from elbow method on all Si

8: init_candidates ← iteratively increased some steps 9: τm← 2.0

10: τc← value from 50/150 tuning on all Si

11: τe← value from 50/150 tuning on all Si

12: τc← 3.0

(27)

5 Experimental results

This section presents the results of the experiments defined in section 3.5. • Section 5.1 presents the distribution of identified K-values. This

experi-ment is described in section 3.5.1.

• Section 5.2 compares the average error of a set of tuned parameters with the average error of 100 randomly generated parameters. This experiment is described in section 3.5.2.

• Section 5.3 presents data on the relationship between within set sum of square error and population density index. This experiment is described in 3.5.3.

(28)

5.1 Finding K

This section includes four figures. Each figure shows the frequency of identified K-values of a given data set. The process used to define K is described in algo-rithm 7 and section 3.5.1. The horizontal axes display K-values as categorical elements. The vertical axes represent the number of occurrences for each value K.

(29)

Figure 3: Distribution of K-values for S1 5.1.1 S1

100 trials identified K as 14, 15 or 16. The S1 data set is divided into 15 Gaussian clusters[7]. The distribution of identified K-values may be seen in figure 3. 5.1.2 S2

100 trials identified K as either 14, 15, 16 or 17. The S2 data set is divided into 15 Gaussian clusters[7]. The distribution of identified K-values may be seen in figure 4.

(30)

Figure 5: Distribution of K-values for Dim032 5.1.3 Dim032

100 trials identified K as 16, 17, 18 or 19. The Dim032 data set is divided into 16 Gaussian clusters[7]. The distribution of identified K-values may be seen in figure 5.

5.1.4 Dim15

100 trials identified K as 9, 10, 11, 12 or 13. The Dim15 data set is divided into 9 Gaussian clusters[7]. The distribution of identified K-values may be seen in figure 6.

(31)

5.2 Optimizing Parameters

Both the K-value and optimized set of parameters was retrieved using algorithm 7. The threshold used in equation 7 was 0.95 for all data sets except for Dim032 for which it was 0.99. All randomized sets of parameters used the same K-value as the optimized set. Table 1 each data set is treated individually. 100 randomly generated sets of parameters are evaluated per data set. The statistics are computed using the result of these 100 parameter sets. The result of a parameter set is represented by the average population density index of 100 unique clusterings initiated with those parameters. Table 2 shows the average population density index of one set of optimized parameters. The average was computed by initiating 100 unique clusterings with the optimized parameter values. In Table 3 each data set is treated individually. 100 randomly generated sets of parameters are evaluated per data set. The statistics are computed using the result of these 100 parameter sets. The result of a parameter set is represented by the average within set sum of squares error of 100 unique clusterings initiated with those parameters. Table 4 shows the average within set sum of squares error of one set of optimized parameters. The average was computed over 100 unique clusterings given the optimized parameter values. Table 1: Statistics the average population density index of 100 random param-eter sets for all four data sets

S1 S2 Dim032 Dim15 Average 35585.02 44733.09 4.46 17518.84 Standard Deviation 5736.96 5767.17 2.64 5412.63 Best 29700.77 38926.39 2.66 11831.89 Worst 53942.54 67755.68 16.54 39024.26 Table 2: Average population Density Index of tuned parameters for all four data sets

S1 S2 Dim032 Dim15 29754.14 39105.67 2.71 11831.89

Table 3: Statistics of the average within set sum of squares error of 100 random parameters over all four data sets

S1 S2 Dim032 Dim15 Average 1.13+13 1.78E+13 2.54E+6 9.57E+13 Standard Deviation 3.26E+12 2.17E+12 3.60E+6 1.20E+14 Best 9.03E+12 1.56E+13 2.33E+5 2.13E+13 Worst 2.68E+13 2.69E+13 1.92E+7 6.56E+14

(32)

Table 4: Average within set sum of squares error of tuned parameters for all four data sets

S1 S2 Dim032 Dim15 9.41E+12 1.62E+13 3.10E+5 2.13E+13

(33)

Figure 7: Average population density index of all parameters on S1 5.2.1 S1

Figure 7 compares the average population density index of 100 sets of random parameters (blue bars) with the average population density index of the opti-mized set of parameters (red line). Figure 8 presents statistics of the average population density index of both optimized and random sets of parameters.

Figure 8: Statistics of parameters on S1. The average does not include tuned parameters.

(34)

Figure 9: Average population density index of all parameters on S2 5.2.2 S2

Figure 10: Statistics of parameters on S2. The average does not include tuned parameters.

(35)

Figure 11: Average population density index of all parameters on Dim032 5.2.3 Dim032

Figure 12: Statistics of parameters on Dim032. The average does not include tuned parameters.

(36)

Figure 13: Average population density index of all parameters on Dim15 5.2.4 Dim15

Figure 14: Statistics of parameters on Dim15. The average does not include tuned parameters.

(37)

5.3 Validity Index Comparison

This section presents the average within set sum of squares error with the aver-age population density index evaluated over the same clusterings. Both K-value and the optimized set of parameters was retrieved by using algorithm 7. The K-value of the random sets of parameters was the same as for the optimized set. Table 5 displays the covariance and Pearson’s correlation coefficient of obser-vations of population density index and within set sum of squares error. Each observation shows the population density index and within set sum of squares error averaged over 100 unique clusterings given a set of parameter values. Table 5: Covariance and correlation coefficient of within set sum of square error and population density index

S1 S2 Dim032 Dim15 Pearson’s corr. coeff. 0.58 0.43 1.00 0.89 Covariance 1.07+E16 5.28E+15 9.28E+6 5.71E+17 5.3.1 S1

Figure 15 displays the average within set sum of squares error and population density index of the tuned and random sets of parameters.

(38)

Figure 16: Error comparison on S2 5.3.2 S2

5.3.3 Dim032

(39)

5.3.4 Dim15

(40)

6 Analysis

This section contains analyses of the three experiments of section 5.

6.1 Finding K

The correct K-value of a clustering does not have a finite answer. The results of 3.5.1 will thus be evaluated for different acceptance criteria using the suggested true partitioning defined by school of computing, university of eastern Finland as Ktrue[7]. Let us define four levels of acceptance criteria. The first may be

seen in equation 15 and the second in 16. The third and fourth may be seen in equation 17 and 18.

P (K = Ktrue) (15)

P (K = Ktrue± 1) (16)

P (K = Ktrue± 2) (17)

P (K = Ktrue± 3) (18)

Table 6: Evaluation of K distribution

S1 S2 Dim032 Dim15 All P(K = Ktrue) 0.52 0.41 0.04 0.23 0.3

P(K = Ktrue± 1) 1.0 0.95 0.68 0.69 0.83

P(K = Ktrue± 2) 1.0 1.0 0.99 0.87 0.97

P(K = Ktrue± 3) 1.0 1.0 1.0 1.0 1.0

6.2 Optimizing Parameters

Comparing the average population density index of the tuned set of parameters with each of the randomized sets yields interesting numbers. Let us define a pairwise comparison between the tuned set of parameters and each of the ran-domized sets as a basis for evaluation. Since 100 ranran-domized sets of parameters were generated for each of the data sets there are 400 observations. The tuned parameters had a lower average population density index in 393 of the 400 ob-servations. If we perform a binomial test of significance with H0: p = 0.5 and

Ha: p > 0.5 we get equation 19.

If X ∼ Binom(N = 400, p = 0.5), P (X ≥ 393) ≈ 2.43 × 10−106 (19) Thus p > 0.5 with a very strong certainty. Hence it is reasonable to conclude that a tuned set of parameters perform better than a randomized.

(41)

6.3 Validity Index Comparison

Pearson’s correlation coefficient reveals an almost perfect (1.00 after rounding to two decimals) linear relationship between the within set sum of square error and population density index for the Dim032 data set. A strong linear rela-tionship (0.89) is found for the Dim15 set. The linearity of the relarela-tionship is significantly lower on S1(0.58) and S2(0.43). The population density index is based on the dimension average standard deviation. This models clusters in a symmetric way where the standard deviations of all dimensions are equal. This is a better representation for hyperspherical clusters with high dimension-ality since the standard deviation of each dimension affect the result less. The impact of the standard deviation of each dimension is higher for clusters with low dimensionality. The relationship between within set sum of square error and population density index thus seems to depend on how much the shapes of clusters deviate from a perfect dimensional symmetry.

(42)

7 Conclusions

Three distance measures are suggested to apply the BFR framework. Eu-clidean, mahalanobis and a measure which compares the standard deviation of two merged clusters with the respective standard deviations of the unmerged clusters. Computing the within set sum of squares error requires an additional pass over a data set. The suggested population density index may be computed by regarding the current state of the clusters. Using population density index thus reduces the required amount of data accesses needed during the optimiza-tion. The optimization process may use the suggested population density index to compare parameter sets. Optimizing the parameters using population den-sity index requires fewer data accesses than within set sum of squares error. This assumes that the same number of error measurements are made. The al-gorithm used for defining the initial points is based on picking candidates and keeping the candidate that maximizes the distance to the closest of the already chosen points. A bigger number of candidates increases the probability of a good spread of the initial points. The suggested 50/150 algorithm attempts starts with a parameter value which is set too high. The algorithm attempts to either halve or increase a parameter value by 50%. Convergence is reached when almost no improvement is being made. This yields good parameter values and avoids iterating for minimal improvements. Algorithm 7 uses subsets of the data for optimizing parameters. This minimizes the number of data accesses required. Each of the training iterations thus become significantly cheaper.

7.1 Returning to the question

The results of all three experiments are promising. The results in 3.5.1 shows that the automated process does not always find the most suitable value for K. The process yields K values that are relatively stable for different attempts on the same data set. Given the conditions of the experiments it is thus possible to automate the definition of K.

The automated process yields threshold values that are better than random-ized guessing. Optimrandom-ized parameters are only outperformed in 7 out of 400 cases. The conclusion is thus that the automated process finds good parameter values which are not necessarily optimal.

The comparison of the validity indices reveals that population density index is a good substitute for within set sum of squares error. In particular for prob-lems of a higher dimensionality. The index may also be used in low-dimensional problems. In such problems it may give a less accurate representation of the within set sum of squares error. Computing population density index is of the complexity O(clusters * dimensions). Computing the within set sum of squares error is O(points * clusters * dimensions). Thus an automated process may benefit by using population density index as error representation.

The research question is: How could an automated process for clus-tering with BFR be defined and what results could such a process yield? An automated process is defined in algorithm 7. This process uses the

(43)

suggested implementation of BFR. Section 5 shows that this automated process results in reasonable parameter values for both K and thresholds on all four data sets. Thus both parts of the research question is answered. An automated process could be defined as in algorithm 7. This process yields good results on the data sets included in this study.

All experiments performed in this study were carried out on data sets that fit in main memory. The process could possibly be defined as in algorithm 8 for data that is too big for main memory.

Algorithm 8The extended BFR framework

1: whileTrue do

2: S ← randomly chosen yet unpicked subset from data

3: if model is not trained then

4: train model using S

5: update model using S

6: if all data has been used then

7: Assign all clusters in compress to their closest cluster in discard

8: Assign all points in retain to their closest cluster in discard

9: return model

7.2 Future Work

The results of this study supports that an automated process may produce good results. However, a much bigger study should be performed. Alternative meth-ods should be included and comparisons on both the number of data accesses and the level of optimization should be made. Further studies are required to conclude on the exact level of optimization. The algorithm presented in algo-rithm 7 should be tested on a great amount of data sets. Algoalgo-rithm 8 should be studied using data sets that do not fit in main memory. Please contact the author if you want to pick up the work or if you have any questions regarding the study.

(44)

A

Appendix A

This sections includes figures of clusterings on S1 and S2. It also includes figures of clusterings which have not been included in the study. All K-values and thresholds have been defined using algorith 7 without changing a single value or threshold. The edge of the shaded ellipses are placed where the mahalanobis distance is equal to 3 ×√2. Note that BFR by its definition only supports axis alligned gaussian clusters. This becomes obvious in some of the figures of this section and also visually explains why the study is limited to gaussian clusters. All data sets are available from the school of computing, university of eastern Finland[7].

(45)

S1

(46)

S2

(47)

Aggregation (misc.)

(48)

D31 (misc.)

(49)

Jain (misc.)

(50)

R15 (misc.)

(51)

S3 (misc.)

(52)

S4 (misc.)

(53)

T4.8K (misc.)

(54)

B

Appendix B

The most vital functions of the experiments. Implemented using Python 3.6.3. Test K def test_k(): results = [] for i in range(100): print(i) set = get_data() slice = len(set) // 1 numpy.random.shuffle(set) subset = set[:slice] results.append(tune_params(subset)) for idx, result in enumerate(results):

print(idx, "\t", result.nof_clusters) The Training

def tune_params(points):

nof_points, dimensions = numpy.shape(points) samples, rounds = get_samples(points)

kwargs = {"mahalanobis_factor": 30000.0, "euclidean_threshold": 13371337.1337, "merge_threshold": 20000.0, "dimensions": dimensions,

"init_rounds": 1, "nof_clusters": 2} max_vals = numpy.max(points, axis=0)

min_vals = numpy.min(points, axis=0)

max_dist = bfr.ptlib.euclidean(max_vals, min_vals) initial_k = find_k(rounds, samples, **kwargs) kwargs["nof_clusters"] = initial_k

kwargs, _ = tune_param(rounds, samples, "init_rounds", 3, **kwargs) kwargs, _ = tune_param(rounds, samples, "init_rounds", 2, **kwargs) kwargs, _ = tune_param(rounds, samples, "init_rounds", 2, **kwargs) kwargs, _ = tune_param(rounds, samples, "init_rounds", 2, **kwargs) kwargs["merge_threshold"] = 1.0

kwargs["mahalanobis_factor"] = 2.0 imp, other = True, True

kwargs["euclidean_threshold"] = 13371337.0 while imp or other:

(55)

if imp: continue

kwargs, other = tune_param(rounds, samples, "merge_threshold", 1.5, **kwargs) imp, other = True, True

kwargs["euclidean_threshold"] = max_dist / 4 while imp or other:

kwargs, imp = tune_param(rounds, samples, "euclidean_threshold", 0.5, **kwargs) if imp:

continue

kwargs, other = tune_param(rounds, samples, "euclidean_threshold", 1.5, **kwargs) kwargs["mahalanobis_factor"] = 3.0

updated_k = find_k(rounds, samples, **kwargs) kwargs["nof_clusters"] = updated_k

return bfr.Model(**kwargs)

def tune_param(rounds, samples, keyword, factor, **kwargs): improved = False

initial = kwargs[keyword] half = initial * factor

larger_kw_err = avg_error(rounds, samples, **kwargs) kwargs[keyword] = half

small_kw_err = avg_error(rounds, samples, **kwargs) if small_kw_err / larger_kw_err < 0.95:

kwargs[keyword] = half improved = True

(56)

Average Error

def avg_error(rounds, samples, **kwargs): errors = numpy.zeros(len(samples) * rounds) for jdx in range(rounds):

for idx in range(len(samples)):

model = cluster(samples[idx], **kwargs)

errors[idx + jdx * len(samples)] = model.error() return numpy.average(errors)

def cluster(points, **kwargs): model = bfr.Model(**kwargs) model.fit(points)

model.finalize() return model Elbow Method

def find_k(rounds, samples, **kwargs): nof_clusters = 2

prev_error = numpy.inf while True:

kwargs["nof_clusters"] = nof_clusters

error = avg_error(rounds, samples, **kwargs) if prev_error == numpy.inf: prev_error = error nof_clusters += 1 continue if error > prev_error * 0.95: return nof_clusters - 1 nof_clusters += 1 prev_error = error Random Parameter Set

def random_parameters(vectors, nof_clusters): nof_points, dimensions = numpy.shape(vectors) max_vals = numpy.max(vectors, axis=0)

min_vals = numpy.min(vectors, axis=0)

max_dist = bfr.ptlib.euclidean(max_vals, min_vals) init_rounds = random.randint(1, 24)

mahalanobis_factor = random.uniform(2.0, 3.0)

euclidean_threshold = random.uniform(max_dist / 10, max_dist / 2) merge_threshold = random.uniform(0.0, 1.0)

(57)

'euclidean_threshold': euclidean_threshold,

'merge_threshold': merge_threshold, 'dimensions': dimensions, 'init_rounds': init_rounds, 'nof_clusters': nof_clusters} return kwargs Test Tuning def test_tuning(points): res = [] for i in range(100): numpy.random.shuffle(points)

tuned, t_err, t_inertia = tune_params(points) errors = []

randoms = []

for j in range(100):

rand = random_parameters(points, tuned.nof_clusters) randoms.append(rand)

err = avg_error(rounds=100, samples=[points], **rand) inertia = avg_inertia(rounds=100, samples=[points], **rand) errors.append((err, inertia))

errors.append((t_err, t_inertia)) #res.append(find_smallest(errors)) for idx, error in enumerate(errors):

(58)

References

[1] J. Leskovec, A. Rajaraman and J.D. Ullman, “Clustering,” in Mining of Massive Datasets, Cambridge University Press, Cambridge. 2014, pp. 241-280.

[2] P. S. Bradley, U. M. Fayyad, and C. A. Reina, “Scaling Clustering Algorithms to Large Databases” Knowledge Discovery and Datamining, 1998, Volume 2.

[3] R. Xu, D. C. Wunsch, and IEEE Computational Intelligence Society, Clus-tering. Hoboken, N.J. Piscataway, NJ. IEEE Press: Wiley ; 2009.

[4] S. Bandyopadhyay and S. Saha, Unsupervised Classification Similarity Mea-sures, Classical and Metaheuristic Approaches, and Applications. Heidel-berg, Springer Berlin Heidelberg. 2013.

[5] D. R. Aronson, Evidence-based technical analysis : applying the scientific method and statistical inference to trading signals. Hoboken, NJ. 2007. [6] J.D. Atkinson, “Qualitative Methods,” in Journey into Social Activism:

Qualitative Approaches, Fordham University Press, New York. 2017. [7] P. Fränti, "Clustering datasets", 2015. [Online]. Available:

http://cs.uef.fi/sipu/datasets/. [Accessed: 11- May- 2018].

[8] D. T. Larose and D. Chantal, “Hierarchical and k -Means Clustering,” in Wiley Series on Methods and Applications in Data Mining, Hoboken, NJ. 2014, pp. 209–227.

[9] E. Szmidt, Distances and Similarities in Intuitionistic Fuzzy Sets. Springer International Publishing, Cham. 2014.

[10] D. Forsyth, Probability and Statistics for Computer Science. Springer In-ternational Publishing, Cham. 2018.

[11] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R. Springer, New York. 2013. [12] C. Zhang and Y. Ma, Ensemble Machine Learning. Springer US, Boston.

2012.

[13] L. Breiman, “Pasting Small Votes for Classification in Large Databases and On-Line,” Machine Learning, vol. 36, no. 1, pp. 85–103, 1999.

[14] J. Leskovec, A. Rajaraman and J.D. Ullman, "Clustering", 2014. [Online]. Available: http://www.mmds.org/mmds/v2.1/ch07-clustering.pdf [Accessed: 14- May- 2018].

(59)

[15] UN’s general assembly, "Transforming our world: the 2030 Agenda for Sustainable Development", 2015. [Online]. Available: http://www.un.org/ga/search/view_doc.asp?symbol=A/RES/70/1&Lang=E [Accessed: 20- May- 2018].

[16] M. A. Syakur, B. K. Khotimah, E. M. S. Rochman, and B. D. Satoto, “Inte-gration K-Means Clustering Method and Elbow Method For Identification of The Best Customer Profile Cluster,” IOP Conference Series: Materials Science and Engineering, vol. 336, p. 012017, 2018.

[17] D. Arthur and S. Vassilvitskii, “k-means : the advantages of careful seed-ing”, Proceedings of the eighteenth annual ACM-SIAM symposium on dis-crete algorithms, pp. 1027–1035, 2007.

[18] J. Bergstra and Y. Bengio, "Random Search for Hyper-Parameter Opti-mization", Journal of Machine Learning Research 13 (2012) 281-305. [19] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for

Hyper-Parameter Optimization,” 25th Annual Conference on Neural Information Processing Systems (NIPS 2011), pp. 25th Annual Conference on Neural Information Processing Systems (NIPS 2011), 2011.

[20] J. Leskovec, A. Rajaraman and J.D. Ullman, “Frequent Itemsets,” in Mining of Massive Datasets, Cambridge University Press, Cambridge. 2014, pp. 201-238.

[21] M. Claesen and B. De Moor, “Hyperparameter Search in Machine Learn-ing,” The XI Metaheuristics International Conference in Agadir, Morocco. 2015.

[22] P. Fränti, O. Virmajoki and V. Hautamäki, "Fast agglomerative clustering using a k-nearest neighbor graph", IEEE Trans. on Pattern Analysis and Machine Intelligence, 28, 1875-1881, 2006.

[23] P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering prob-lems", Pattern Recognition, vol. 39, 761-765, 2006.

[24] I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clus-tering", Pattern Recognition, vol 40 (3), 784-795, 2007.

[25] A. Gionis, H. Mannila, and P. Tsaparas, "Clustering aggregation", ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1(1), 1-30, 2007.

[26] G. L. Shevlyakov and H. Oja, Robust correlation : theory and applications. Wiley, Chichester. 2016.

[27] B. Custers, T. Calders, B. Schermer, Discrimination and Privacy in the Information Society Data Mining and Profiling in Large Databases. Heidel-berg, Springer Berlin Heidelberg. 2013.

(60)

An automated approach to clustering with the framework suggested by Bradley, Fayyad and Reina

An automated approach to

clustering with the framework

suggested by Bradley, Fayyad and

Reina

Contents

1

Introduction

1.1

Background

1.2

Problem

1.3

Purpose

1.4

Objectives

1.5

Methodology / Methods

1.6

Delimitations

1.7

Outline

2

Extended Background

2.1

K-Means

2.2

Distance Measures

2.3

BFR

2.4

Cluster Validity Indices

2.5

The Elbow Method

2.6

Hyperparameter Optimization

2.7

The Initial Points

2.8

Pearson’s Correlation Coefficient

3

Methodology

3.1

Implementing BFR

3.2

Defining Hypotheses

3.3

Data

3.4

Subsets and Iterations

3.5

Experiments

4

Suggested Implementations and Artifacts

4.1

Implementation of BFR

4.2

Population Density Index

4.3

Defining K

4.4

Picking The Initial Points

4.5

50/150 Optimization

4.6

Hypotheses

5

Experimental results

5.1

Finding K

5.2

Optimizing Parameters

5.3

Validity Index Comparison

6

Analysis

6.1

Finding K

6.2

Optimizing Parameters