Using Semi-supervised Clustering for Neurons Classification

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Using Semi-supervised Clustering for Neurons

Classification

by

Ali Fakhraee Seyedabad

LIU-IDA/LITH-EX-A--13/019--SE

2013-04-26

(2)

(3)

Linköpings universitet Institutionen för datavetenskap

Examensarbete

Using Semi-supervised Clustering for Neurons

Classification

av

Ali Fakhraee Seyedabad

LIU-IDA/LITH-EX-A--13/019--SE

2013-04-26

Handledare: Jose M. Pena

Examinator: Patrick Lambrix

(4)

Abstract

We wish to understand brain; discover its sophisticated ways of calculations to invent improved computational methods. To decipher any complex sys-tem, first its components should be understood. Brain comprises neurons.

Neurobiologists use morphologic properties like “somatic perimeter”, “axonal length”, and “number of dendrites” to classify neurons. They have discerned two types of neurons: “interneurons” and “pyramidal cells”, and have a consensus about five classes of interneurons: PV, 2/3, Martinotti, Chandelier, and NPY. They still need a more refined classification of in-terneurons because they suppose its known classes may contain subclasses or new classes may arise. This is a diﬃcult process because of the great number and diversity of interneurons and lack of objective indices to clas-sify them.

Machine learning—automatic learning from data—can overcome the men-tioned diﬃculties, but it needs a data set to learn from. To meet this demand neurobiologists compiled a data set from measuring 67 morphologic prop-erties of 220 interneurons of mouse brains; they also labeled some of the samples—i.e. added their opinion about the sample’s classes.

This project aimed to use machine learning to determine the true number of classes within the data set, classes of the unlabeled samples, and the accuracy of the available class labels. We used K-means, seeded K-means, and constrained K-means, and clustering validity techniques to achieve our objectives. Our results indicate that: the data set contains seven classes; seeded K-means outperforms K-means and constrained K-means; chandelier and 2/3 are the most consistent classes, whereas PV and Martinotti are the least consistent ones.

(5)

1 Introduction 5 2 Literature Review 7 2.1 Machine Learning . . . 8 2.2 Supervised Learning . . . 8 2.3 Unsupervised Learning . . . 8 2.3.1 K-means . . . 9 2.3.1.1 Distance Metrics . . . 10 2.3.1.2 Drawbacks of K-means . . . 11 2.4 Semi-supervised Learning . . . 11 2.4.1 Semi-supervised Classification . . . 13 2.4.2 Semi-supervised Clustering . . . 13 2.4.2.1 Seeded K-means . . . 14 2.4.2.2 Constrained K-means . . . 14 2.5 Standardization . . . 15 2.6 Clustering Validity . . . 15

2.6.1 Selecting the Number of Clusters . . . 16

3 Methodology 19 3.1 Data Set Description . . . 19

3.2 Methods and Algorithms . . . 23

4 Results and Discussion 25 4.1 Selecting the Number of Clusters . . . 25

4.2 Clustering the Data set . . . 27

4.3 Examining the Accuracy of Class Labels . . . 29

5 Summary 31 5.1 Summary . . . 31

(6)

List of Figures

1.1 Neurons in the neocortex can be divided into the two classes, interneurons and pyramidal cells. The left figure illustrates an interneuron from a mouse neocortex, and the one on the right a pyramidal cell [22]. . . 6 2.1 The machine learning’s classification. . . 7 2.2 Diversity of clusters. The seven clusters in (a) (denoted by

seven different colors in (b)) differ in shape, size, and den-sity. Although these clusters are apparent to a data analyst, none of the available clustering algorithms can detect all these clusters [12]. . . 9 2.3 (left) Euclidean distances, (right) Mahalanobis distances. . . 11 2.4 Different types of data sets and learning problems. Dots

cor-respond to points without any labels. Points with labels are denoted by plus signs, asterisks, and crosses. (a) and (b) illustrate supervised and unsupervised learning respectively. In (d), the must-link and cannot- link constraints are denoted by solid and dashed lines, respectively. (c) and (d) exemplify semi-supervised learning tasks [13]. . . 12 2.5 Example of supervised and semi-supervised learning in

bi-nary classification problems. (a) A data set with two classes that needs to be classified using labeled data. (b) The same data is now captured along with unlabeled data represented in green dots. (c) A supervised learning technique may choose an incorrect decision boundary. (d) Given the extra unlabeled data points, the classification plane can be placed in region of sparse population, leading to better classification [20]. . . . 13 2.6 (a) A data set with 3 natural clusters; (b) The results of

k-means with 4 clusters [9]. . . 17 2.7 Plot of residual sum of squares (RSS) and Hubert Γ statistic

vs. number of clusters (K). The elbow, which can be seen at k=7, suggests that true number of clusters is 7. . . 17

(7)

4.1 Residual sum of squares (RSS) vs. number of clusters (K) when Euclidean distance was used. . . 26 4.2 Residual sum of squares (RSS) vs. number of clusters (K)

when Mahalanobis distance was used. . . 26 4.3 Hubert Γ statistic (Γ) vs. number of clusters (K) when

Eu-clidean distance was used. . . 27 4.4 Averaged similarity, calculated by Rand Index (RI), between

the clusterings resulted from Euclidean means, seeded K-means and constrained K-K-means. . . 28

(8)

List of Tables

3.1 Summary of the dataset. . . 19 3.2 Data set’s variables [15]. . . 20 4.1 Results of selecting the number of clusters. . . 25 4.2 Results, Mean (Standard Deviation), of clustering the whole

data set. . . 28 4.3 Results, Mean (Standard Deviation), of clustering the labeled

part of the data set. . . 29 4.4 The number of the classes’ samples ends in the same cluster

when the whole data set was clustered. . . 30 4.5 The number of the classes’ samples ends in the same cluster

when the labeled part of the data set was clustered. . . 30 4.6 Summary of examining the classes’ uniformities. . . 30

(9)

Chapter 1 Introduction

Understanding the brain will answer interesting questions like how it would be possible for a human brain, such as the one of Gary Kasparov, to be competitive with a computer, such as Deep Blue. It will also guide us to invent improved computational methods. In their quest to decipher the brain, scientists discovered a special part of it—“neocortex”, is responsible for sophisticated activities such as learning, mathematical calculations, and planning; the neocortex is assumed to be the root of human intelligence [7, 19, 24].

Modern computers use just several processing units—CPUs, whereas the neocortex benefits from cooperation of hundred billions (precisely 1011 [18]) tiny processing units—neurons. To understand any system, first its components should be understood (then how they connect and collaborate with each other), so to understand the neocortex, first neurons should be understood.

Neurobiologists use morphologic1 properties like somatic perimeter, ax-onal length, and number of dendrites to classify neurons. They have dis-cerned two classes of neurons: interneurons and pyramidal cells (Figure 1.1 illustrates an interneurons and a pyramidal cell of a mouse brain) and have a consensus about five classes of interneurons: NPY, 2/3, Chandelier, Martinotti, and PV. They suppose these classes might contain subclasses or new classes may exist, so they still need a more refined classification of interneurons[22].

Most of the non-automatic eﬀorts have failed to morphologically clas-sify interneurons because of the great number and diversity of them and lack of objective indices[22]. Advanced technologies simplify compiling pre-cise morphologic data of more neurons, and enable experts to use machine learning—automatic learning from data, which is more sophisticated and objective than their opinion, to classify neurons.

1_{Morphology: the branch of biology that deals with the form of living organisms, and}

(10)

A group of neurobiologists at Columbia University [22] compiled a data set from measuring 67 morphologic features of 220 interneurons of the mouse neocortex; they also labeled some of the samples—i.e. added their opinion about the samples’ classes. This project aimed to use machine learning techniques to determine the true number of classes within this data set, the classes of the unlabeled samples, and the accuracy of the available class labels.

Chapter 2 reviews the literature of machine learning with focus on clus-tering and semi-supervised clusclus-tering; Chapter 3 describes the data set and the methods used to analyze it; Chapter 4 contains the results and discus-sions; and Chapter 5 summarizes the project and future work.

Figure 1.1: Neurons in the neocortex can be divided into the two classes, interneurons and pyramidal cells. The left figure illustrates an interneuron from a mouse neocortex, and the one on the right a pyramidal cell [22].

(11)

Chapter 2 Literature Review

This chapter introduces machine learning, its categories namely supervised, unsupervised, and semi-supervised learning, subcategories of semi-supervised learning namely semi-supervised classification and semi-supervised cluster-ing, subcategories of semi-supervised clustering namely semi-supervised clus-tering using class labels and semi-supervised clusclus-tering using pairwise con-straints (Figure 2.1 illustrates machine learning’s classification), standard-ization, and clustering validity.

Figure 2.1: The machine learning’s classification.

First, I explain some terms in order to ease understanding the rest of the report.

Data Set: A set of objects (samples) each described by measuring a set of features—an object is a vector of measurements. For example, a data set of students comprises samples each described by features like name, birth date, and number of taken courses. Objects of a data set can be called samples, observations, instances, data points, points etc.; similarly features can be called attributes, variables, dimensions etc.

(12)

2.1 Machine Learning

Machine learning is the process of automatic learning from a data set. As-sume a data set of texts for which we measured the frequencies of the words for each text in the set. We also labeled the texts with categories like his-torical, political, and entertaining. The process of automatically learning a classifier from the data set to predict the labels of unlabeled texts exempli-fies a machine learning task. Another example is to find diﬀerent groups in a data set of a college’s students.

Machine learning can be classified according to the data set that it uses, into supervised, unsupervised, and semi-supervised learning.

2.2 Supervised Learning

Supervised learning, also called classification, uses background knowledge— class labels, which is known before learning, to supervise the learning pro-cess. The class label, which is a feature of a data set, assigns objects to the predefined categories. Humans must label the objects (like written works, neurons, images) of a data set.

Classification is a two-phase task: within the first step training, labeled examples train a classifier and within the second step testing, unlabeled examples test the generalization power of the classifier. During testing, the classifier tries to predict the labels of the objects that were hidden from it during the training. The performance of a classifier is the number of objects it could classify correctly. If it performs well enough, it can reliably classify future samples.

Classification needs two sets one for training and one for testing, train set and test set. Dividing the data set into two sets wastes some information— i.e. samples of the test set—during the training. k-fold cross validation avoids this waste: it divides the data set into k sets and repeats the training and testing k times; each time it separates one set for testing, and uses the rest for training. So unlike the previous method, which divides the data set into two sets, k-fold cross validation uses each sample k times for training and once for testing; it produces k classifiers, and selects the one with the best performance.

2.3 Unsupervised Learning

Unlike supervised learning, which knows class labels, unsupervised learning lacks class labels for diﬀerent reasons; sometimes we simply lack the class labels, and sometimes we lack objective indices or agreement about the classes like in the case of interneurons, Chapter 1.

(13)

Unsupervised learning, also called clustering or cluster analysis, lacks supervision in the form of class labels. Clustering detects the groups— clusters, of a data set’s objects. Clusters have diﬀerent shapes, sizes, and densities (see Figure 2.2), but no single algorithm can detect all kinds of clusters simultaneously [12].

Next section describes K-means, the most prevalent clustering algorithm. ( [12] reviews clustering.)

Figure 2.2: Diversity of clusters. The seven clusters in (a) (denoted by seven diﬀerent colors in (b)) diﬀer in shape, size, and density. Although these clusters are apparent to a data analyst, none of the available clustering algorithms can detect all these clusters [12].

2.3.1 K-means

K-means receives the number of clusters, K, as an input and categorizes the objects of a data set into K clusters: the objects within a cluster are closest together and farthest from those of the other clusters. Cluster centers— mean of the objects within a cluster: �µk = _|c1_k_|�_x_�_i_∈c_kx�i—represent the clusters.

K-means improves a random first partition iteratively: at each itera-tion it assigns objects to the closest clusters and calculates the new cluster centers; the process repeats until no more changes happen (see Algorithm 1). K-means minimizes residual sum of squares—sum of squared distances between all the objects and the cluster centers:

K � k=1 � � xi∈ck [d( �xi, �µl)]2. (2.1) Frequent usage of terms like close and far in this Section implies the necessity of distance metrics—measures to calculate distance.

(14)

Algorithm 1 K-means algorithm

Input: Unlabeled data set X; Number of clusters K. select K first cluster centers randomly

repeat

for all object �x do

assign �x to the closest cluster end for

calculate the new cluster centers until no more changes happen

2.3.1.1 Distance Metrics

Assume x, y and z are objects, then d is a distance metric if: • d(�x, �y) ≥ 0;

• d(�x, �y) = d(�y, �x); • d(�x, �y) = 0 ↔ �x = �y; • d(�x, �z) < d(�x, �y) + d(�y, �z).

K-means often uses Euclidean distance that calculates the distance be-tween two objects as:

deuclidean(�x, �y) = �

(�x− �y)2_. _(2.2)

Euclidean distance gives the same weight, equal to 1, to all the features, so it supposes the clusters within a data set are spherical. Euclidean dis-tance misleads us when the clusters are elliptical, rather than spherical. For example in Figure 2.3 (left) Euclidean distance assesses the distance of ob-ject 1 and the origin the same as the distance of obob-ject 2 and the origin, whereas object 1 is farther from the bulk of the cloud. A more precise dis-tance metric should evaluate the disdis-tance of object 1 to the origin farther than the distance of object 2 and the origin. Figure 2.3 (right) shows Ma-halanobis distances. MaMa-halanobis distance of an object �x from a cluster of objects with the mean �µ and the covariance matrix S is:

dmahalanobis(�x, �µ) = �

(�x_{− �µ)S}−_1(�x_{− �µ).} _(2.3) Mahalanobis distance with the identity matrix, instead of the covari-ance matrix of the cluster, is Euclidean distcovari-ance. (For overview of distcovari-ance metrics and Mahalanobis distance see [4, 6, 16].)

(15)

Figure 2.3: (left) Euclidean distances, (right) Mahalanobis distances.

2.3.1.2 Drawbacks of K-means

K-means is simple, so it is the most popular clustering algorithm. But, it suﬀers from two main drawbacks: it finds suboptimal solutions; the quality of the result is highly dependent to the quality of the first partition, so a poor initial partition leads to a poor one at the end. The method that partly overcomes the problems runs K-means many times and selects the best solution—i.e. the partition with the least residual sum of squares.

2.4 Semi-supervised Learning

Semi-supervised machine learning partially knows a data set, so it intermedi-ates classification and clustering. Partial knowledge means a partially labeled data set that consists both labeled and unlabeled objects, or an unlabeled data set with some extra information about it.

Usually a partially labeled data set consists many unlabeled objects (because they are cheap) and few labeled ones (because they are expen-sive). Some semi-supervised tasks use unlabeled data sets with extra must-link constraints—which pair of objects should be in the same cluster—and cannot-link constraints—which pair of objects should be in diﬀerent clusters. Figure 2.4 illustrates the diﬀerent types of data sets and machine learning problems.

A semi-supervised task can use unlabeled objects to improve a classification— semi-supervised classification, or labeled objects to guide a clustering—semi-supervised clustering.

(16)

Figure 2.4: Diﬀerent types of data sets and learning problems. Dots corre-spond to points without any labels. Points with labels are denoted by plus signs, asterisks, and crosses. (a) and (b) illustrate supervised and unsuper-vised learning respectively. In (d), the must-link and cannot- link constraints are denoted by solid and dashed lines, respectively. (c) and (d) exemplify semi-supervised learning tasks [13].

(17)

2.4.1 Semi-supervised Classification

Classification needs to fully know a data set, so a semi-supervised task needs labeled objects for all classes to improve a classification. This kind of semi-supervised task is actually a classification combined with unlabeled objects, which provide extra information about the distributions of the classes, to increase the accuracy of the classifier, Figure 2.5. (For more information about semi-supervised classification see [3].)

Figure 2.5: Example of supervised and semi-supervised learning in binary classification problems. (a) A data set with two classes that needs to be classified using labeled data. (b) The same data is now captured along with unlabeled data represented in green dots. (c) A supervised learning technique may choose an incorrect decision boundary. (d) Given the extra unlabeled data points, the classification plane can be placed in region of sparse population, leading to better classification [20].

2.4.2 Semi-supervised Clustering

Lack of labeled objects for some classes or extra information in the form of pairwise constraints prevents semi-supervised classification. In these cases, semi-supervised clustering guides clustering processes with labeled objects or pairwise constraints to find better clusters [1, 2].

Labeled objects and pairwise constraints (must-link and cannot-link) are equivalent: objects with the same class label produce must-link constraints and objects with diﬀerent class labels produce cannot-link constraints. n labeled objects produces (n_{∗ n − 1)/2 pairwise constraints, which is a large}

(18)

number (e.g. our data set have 115 labeled objects that produces 6555 constraints), so we need techniques to extract meaningful constraints or algorithms to directly cluster partially labeled data sets. Semi-supervised clustering consists two groups of algorithms: algorithms that use pairwise constraints [2, 5, 23]; algorithms that use labeled objects [1].

Next two sections describe algorithms, which the project used, from the second group.

2.4.2.1 Seeded K-means

K-means suﬀers from random selection of the first cluster centers, i.e. poor selections result to poor partitions (see Section 2.3.1.2). Seeded K-means— semi-supervised K-means—clusters partially labeled data sets, and instead of blind random selection uses the labeled objects to estimate the first cluster centers [1]. It selects the first cluster centers as the mean of the labeled objects. Algorithm 2 shows seeded K-means algorithm.

Algorithm 2 Seeded K-means

Input: Partially labeled data set X=_{{S ∪ U | S: set of the} labeled objects, U: set of the unlabeled objects_{}; Number of} clusters K.

for all clusters cl do

if there are objects_{∈ S labeled with l then}

select the center of cl as the mean of the objects in S labeled with l

else

select the center of cl randomly end if

end for repeat

for all objects �x do

assign �x to the closest cluster calculate the new clusters centers end for

until no more changes happen

2.4.2.2 Constrained K-means

Seeded K-means uses class labels only in the first step and it changes the class labels in the next steps when it assign objects to the closest clus-ters. Sometimes class labels are precise and trustworthy and it is better to keep them unchanged to the end of the clustering.Constrained K-means uses class labels to estimate first cluster centers (similar to seeded K-means) and

(19)

keeps them unchanged throughout the clustering [1]. Algorithm 3 shows constrained K-means.

Algorithm 3 Constrained K-means

Input: Partially labeled data set X=_{{S ∪ U | S: set of the} labeled objects, U: set of the unlabeled objects_{}; Number of} Clusters K.

for all clusters cl do

if there are objects_{∈ S labeled with l then}

select the center of cl as the mean of the objects in S labeled with l

else

select the center of cl randomly end if

end for repeat

for all objects �x /_{∈ S do}

assign �x to the closest cluster calculate the new clusters centers end for

until no more changes happen

Constrained K-means is better when class labels are trustworthy and seeded K-means is better when class labels are inaccurate.

2.5 Standardization

Features of a data set with ranges much larger than others highly aﬀect the Euclidean distance; standardization (also called normalization) gives weights to the features in proportion to their ranges.

One method, z-score transformation, divides the features by their vari-ances, another by their ranges. Experiments show that the second method compared with the first one helps the clustering algorithms to do better [17].

2.6 Clustering Validity

External validity [9]: the true partition P can validate a clustering C of a data set. For each pair of objects one of the following statements holds:

• SS: if both objects are in the same cluster of C and in the same group of P.

• SD: if the objects are in the same cluster of C and in diﬀerent groups of P.

(20)

• DS: if the objects are in diﬀerent clusters of C and in the same groups of P.

• DD: if the objects are in diﬀerent clusters of C and in diﬀerent groups of P.

Assume a, b, c and d are occurrences of SS, SD, DS and DD, then Rand Index = (a + d)/(a + b + c + d) and Jaccard Coeﬃcient = a/(a + b + c), quantify the similarity between the partition P and the clustering C. They vary between 0 and 1 and the more similar C and P are, the larger they are [9].

Relative validity [10]: true partitions are often missed, so residual sum of squares, Section 2.3.1, or other indices such as Hubert Γ statistic can rate the different clusterings of a data set, which are resulted from different algorithms or the same algorithm with different parameters like number of clusters or distance metric. Hubert Γ statistic calculates clusters’ compactness as below: Γ = i=n � i=1 j=n � j=i+1 d( �xi, �xj)× d( �µci, �µcj); (2.4) n is the number of data set’s objects, d( �xi, �xj) the distance between �xi and �xj, and d( �µci, �µcj) the distance between the cluster centers of �xi and �xj. Similarity between d( �xi, �xj) and d( �µci, �µcj) implies the compactness of the clusters Ci and Cj.

Both residual sum of squares and Hubert Γ statistic evaluate clusters’ compactness. The more compact and separate the clusters are, the smaller/larger residual sum of squares/Hubert Γ statistic is.

2.6.1 Selecting the Number of Clusters

K-means needs K, the number of clusters, as an input. The number of clusters plays an important role in the quality of the resultant partition. Figure 2.6 illustrates the importance of the number of clusters in the results’ quality.

Partitions with larger number of clusters are better because they have clusters that are more compact, i.e. smaller residual sum of squares (or larger Hubert Γ statistic). Constant Increase of number of clusters decreases the residual sum of squares; at the optimal number of clusters the decrease, slows down. Figure 2.7 illustrates the plot of residual sum of squares (left) and Hubert Γ statistic (right) against number of clusters. In these graphs, the optimal number of clusters is 7. (They also illustrate that residual sum of squares and Hubert Γ statistic have similar results, i.e. they both measure the compactness of clusters but in diﬀerent ways.) This method to select

(21)

Figure 2.6: (a) A data set with 3 natural clusters; (b) The results of k-means with 4 clusters [9].

Figure 2.7: Plot of residual sum of squares (RSS) and Hubert Γ statistic vs. number of clusters (K). The elbow, which can be seen at k=7, suggests that true number of clusters is 7.

(22)

For the optimal number of clusters diﬀerent algorithms should result to more similar clusterings than for the sub-optimal number of clusters. Another method uses this heuristic to select the optimal number of clusters at which averaged similarity between the clusterings resulted from diﬀerent algorithms is higher. Rand Index can calculate the similarity between the clusterings. (This is a heuristic that I designed and used in the project, and have not seen it somewhere else.)

(23)

Chapter 3 Methodology

This chapter explains the data set of the project and the used methods to analyze it.

3.1 Data Set Description

The data set contains the information of 220 interneurons described by 67 morphologic variables. 115 of its samples labeled with one of the 5 well-known classes of interneurons: NPY, PV, 2/3, Chandelier, and Martinotti, and others are unlabeled. The 5 known classes may have subclasses or new classes may arise, so the number of classes within the data set is unknown. The unlabeled objects can belong to the known classes or unknown ones. Table 3.1 summaries the data set.

Table 3.1: Summary of the dataset.

Class Label Number of samples

Unknown 0 105 NPY 1 14 PV 2 19 2/3 3 25 Chandelier 4 24 Martinotti 5 33 Totall 220

All the variables of the data set are numeric. Table 3.2 contains the variables’ names and descriptions.

(24)

Table 3.2: Data set’s variables [15].

Variable Description

Variables describing the soma

Somatic perimeter (µm) Perimeter of the soma

Somatic area (µm2₎ _{Area of the soma}

Somatic aspect ratio Max diameter of soma/min diameter of soma

Somatic compactness [((4/π)_{∗ Area)1/2]/max diameter}

Somatic form factor (4π_{∗ Area)/(P erimeter}2₎

Somatic roundness (4∗ Area)/(π ∗ maxdiameter2)

Variables describing axon

Axonal node total Total number of axonal nodes (branching points)

Total axonal length (µ) Sum of lengths of all axon segments, measured along

tracing (not straight line distance)

Total surface area of axon (µm2₎ _{πr2 + 2πrh, SA calculated by modeling axon as a} cylinder with diameter defined by thickness of seg-ment in reconstruction

Ratio of axonal length to surface area (1/µm) Total axonal length/total surface area of axon

Highest order axon segment Maximum number obtained after each segment is

numbered by how many nodes it is removed from the initial segment

Axonal torsion ratio Total axonal length/total axonal length of fan in

di-agram where the fan in didi-agram is 2-D projection of the neuron constructed by compiling traces swept around a vertical axis. Torsion ratio = 1 corresponds to no loss of length, values larger than 1 correspond to the factor by which the processes have decreased in the fan in diagram

K-dim of axon Fractal dimension of the axon calculated using linear

regression and the nested cubes method

Axonal polar angle average Average of polar angles of all axonal nodes. The polar

angle is the angle between the 2 lines passing through the node and the endpoints of the next segments. Axonal polar angle standard deviation Standard deviation of axonal polar angles

Axonal local angle average Average of local angles of all axonal nodes. The local

angle is the angle between the 2 lines passing through the node and points adjacent to the node on the two following segments

Axonal local angle standard deviation Standard deviation of axonal local angles

(25)

Table 3.2 – continued from previous page

Axonal spline angle average Average of spline angles of all axonal nodes. The

spline angle is the angle between the 2 lines passing through the node and smoothed points adjacent to the node when the following two segments are ap-proximated by a cubic spline

Axonal spline angle standard deviation Standard deviation of axonal spline angles

Average tortuosity of axonal segments Average of tortuosities measured for each axonal seg-ment. Segment tortuosity = distance along seg-ment/straight line distance between segment end-points

Standard deviation of tortuosity of axonal segments Standard deviation of tortuosities of all axonal seg-ments

Axonal segment length average (µm) Total axonal length/number of segments Axonal segment length standard deviation (µm) Standard deviation of axonal segment length

Average tortuosity of axonal nodes Average of tortuosities measured for each axonal node. Node tortuosity = distance along process from origin of process to node/straight line distance from origin of process to node

Standard deviation of tortuosity of axonal nodes Standard deviation of tortuosities of all axonal nodes Number of axonal sholl sections Number of sholl sections (concentric spheres centered at the soma with radii at 100 µm intervals) containing axonal processes

Axonal sholl length at 100 µm otal length of axonal segments contained in first sholl section/total axonal length

Axonal sholl length at 200 µm Total length of axonal segments contained in second

sholl section/total axonal length

Axonal sholl length at 300 µm Total length of axonal segments contained in third

sholl section/total axonal length

Axonal sholl length density (µm) Total axonal length/number of axonal sholl sections

Axonal sholl node density Axonal node total/number of axonal sholl sections

Convex hull axon area (µm2) Area of the 2-D convex polygon created by connecting the distal axon segment endpoints of 2-D projection of neuron

Convex hull axon perimeter (µm) Perimeter of the 2-D convex polygon created by con-necting the distal axon segment endpoints of 2-D pro-jection of neuron

Convex hull axon volume (µm3) Volume of the 3-D convex polygon created by

con-necting the distal axon segment endpoints

Convex hull axon surface area (µm2₎ _{Surface area of the 3-D convex polygon created by} connecting the distal axon segment endpoints

(26)

Axon node density (1/µm) Axonal node total/total axonal length

Variables describing dendrites

Number of dendrites Total number of dendrites

Dendritic node total Total number of dendritic nodes (branching points)

Total dendritic length (µm) Sum of lengths of all dendrite segments, measured

along tracing (not straight line distance) Average length of dendrites (µm) Total dendritic length/number of dendrites Total surface area of dendrites (µm2₎ _{See total surface area of dendrites}

Ratio of dendritic length to surface area (1/µm) See ratio of axonal length to surface area

Highest order dendrite segment See highest order axonal segment

Dendritic torsion ratio See axonal torsion ratio

K-dim dendrites See K-dim axon

Dendritic polar angle average See axonal polar angle average

Dendritic polar angle standard deviation See axonal polar angle standard deviation

Dendritic local angle average See axonal local angle average

Dendritic local angle standard deviation See axonal local angle standard deviation

Dendritic spline angle average See axonal spline angle average

Dendritic spline angle standard deviation See axonal spline angle standard deviation Average tortuosity of dendritic segments See average tortuosity of axonal segments

Standard deviation of totuosity of dendritic segments See standard deviation of totuosity of axonal seg-ments

Dendritic segment length average (µm) See axonal segment length average

Dendritic segment length standard deviation (µm) See axonal segment length standard deviation Average tortuosity of dendritic nodes See average tortuosity of axonal nodes

Standard deviation of tortuosity of dendritic nodes See standard deviation of tortuosity of axonal nodes Number dendritic sholl sections Number of sholl sections (concentric spheres centered

at the soma with radii at 50-?m intervals) containing dendritic processes

Dendritic sholl length at 50-µm Total length of dendritic segments contained in first sholl section/total dendritic length

Dendritic sholl length at 100-µm Total length of dendritic segments contained in sec-ond sholl section/total dendritic length

Dendritic sholl length at 150-µm Total length of dendritic segments contained in third sholl section/total dendritic length

Convex hull dendrite area (µm2) See convex hull axon area

Convex hull dendrite perimeter (µm) See convex hull axon perimeter

Convex hull dendrite volume (µm3) See convex hull axon volume

Convex hull dendrite surface area (µm2) See convex hull axon surface area

Dendrite node density (1/µm) See axon node density

(27)

Variables describing location

Relative distance to pia Distance from soma centroid to pia/distance between

pia and white matter

3.2 Methods and Algorithms

Clustering and semi-supervised clustering algorithms can cluster this kind of data sets with partial knowledge about the classes, Chapter 2. To learn from the data set K-means, seeded K-means and constrained K-means were used.

The data set consists of variables with very diﬀerent ranges so to nor-malize it, we divided the variables by their ranges.

The algorithms find sub-optimal clusterings, Section 2.3.1.2, so to im-prove the results they were run many times—up to 102and 104 times where Mahalanobis and Euclidean distance were used respectively, and the cluster-ing with the least residual sum of squares was chosen. Mahalanobis distance needs the clusters’ covariance matrices (Section 2.3.1.1) so it is computation-ally expensive and the runs’ numbers were limited when it was using.

Both heuristics of Section2.6.1, were used to select the optimal number of clusters. Residual sum of squares and Hubert Γ statistic were used in the elbow method. Rand Index was used to calculate the averaged similarity between clusterings resulted from diﬀerent algorithms.

The clusterings were compared with each other referencing residual sum of squares.

Comparing the clusterings referencing the class labels was impossible because it needs a fully labeled data set, so to do this for our partially labeled data set I defined a new index, number of correctly clustered labeled objects. To calculate this index, first clusters receive labels based on the distributions of the labeled objects: a cluster is labeled with the class that has the most objects within it. The number of correctly clustered labeled objects is the number of labeled objects that were put in a cluster with the same label.

Clustering only the labeled part of the data set was another way to compare the clusterings referencing the class labels, using Rand Index (or Jaccard Coeﬃcient). To test whether the available labels are better than random labels, number of correctly clustered labeled objects and Rand Index were calculated when random labels had been given to 115 samples.

(28)

exam-ine the accuracy of the class labels the number of the classs samples that were clustered together was used; this index compares the experts’ and the algorithms’ classifications.

(29)

Chapter 4 Results and Discussion

This chapter contains the results of: selecting the number of clusters, clus-tering the data set, and examining the accuracy of the class labels.

4.1 Selecting the Number of Clusters

Figures 4.1, 4.2, and 4.3 present the graphs of selecting the number of clus-ters, K, using the elbow method. The calculations of Mahalanobis distance and Hubert Γ statistic consume large amount of time, so I skipped the graphs for Hubert Γ statistic in case of Mahalanobis distance. Residual sum of squares and Hubert Γ statistic measure the same quality—compactness of clusters, in diﬀerent ways (Section 2.6.1), so we gain nothing more from having the missed graphs.

The elbow in the plots of residual sum of squares (and Hubert Γ statistic) against number of clusters suggests the optimal number of clusters (Section 2.6.1). Table 4.1 summarizes the elbows of the graphs.

Table 4.1: Results of selecting the number of clusters.

Optimal number of clusters

Residual sum of squares Hubert Γ statistic Euclidean Mahalanobis Euclidean

K-means 7 7 7

Seeded K-means 7 6 7

Constrained K-means 7 7 7

Except for the graph of Mahalanobis seeded K-means (Figure 4.2) that has elbow at K = 6, others have elbows at K = 7, so we concluded the

(30)

Figure 4.1: Residual sum of squares (RSS) vs. number of clusters (K) when Euclidean distance was used.

Figure 4.2: Residual sum of squares (RSS) vs. number of clusters (K) when Mahalanobis distance was used.

(31)

Figure 4.3: Hubert Γ statistic (Γ) vs. number of clusters (K) when Euclidean distance was used.

optimal number of clusters in the data set is 7.

Figure 4.4 shows the averaged similarity, calculated by Rand Index (RI), between the clusterings resulted from K-means, seeded K-means and con-strained K-means for a range of Ks (from 5 to 10). The optimal number of clusters—7 in this graph—is the one with the highest averaged similarity (Section 2.6.1).

Results of selecting the number of clusters from two heuristics confirm each other. We expected the number of clusters to be more than 5, because some of the 5 well-known classes of interneurons may consist of subclasses or new classes may arise (Chapter 1). The results of examining the accuracy of the classes also confirms this hypothesis (Section 4.3).

Graphs 4.1, 4.2, and 4.3 also evaluate the algorithms referencing residual sum of squares (and Hubert Γ statistic): seeded means outperforms K-means and constrained K-K-means. It benefits from the class labels to select the first cluster centers, so it outperforms K-means, which selects cluster centers randomly. The available class labels are inaccurate (Chapter 1), so seeded K-means, which changes the labels throughout the clustering pro-cess, outperforms constrained K-means, which guarantees to keep the labels unchanged. However, the gain is marginal, it means the class labels are partly unreliable; results of the last experiment (Section 4.3) confirms this (two classes are reliable).

4.2 Clustering the Data set

Table 4.2 presents the results of clustering the whole data set. It comprises residual sum of squares (rss), number of correctly clustered labeled objects (ncclo), Section 3.2, and the same number when we used random labels

(32)

Figure 4.4: Averaged similarity, calculated by Rand Index (RI), between the clusterings resulted from Euclidean K- means, seeded K-means and con-strained K-means.

(ncclo*). Constrained K-means keeps the labels unchanged, i.e. number of correctly clustered labeled objects (ncclo) is always the number of labeled objects (115), so we omitted its results.

Table 4.2: Results, Mean (Standard Deviation), of clustering the whole data set.

Euclidean Mahalanobis

K-means Seeded K-means K-means Seeded K-means

rss 294.33(8.06) 291.10(7.65) 2.96E + 14(7.67E + 12) 2.92E + 14(7.31E + 12)

ncclo 82.84(8.18) 85.64(5.00) 84.05(7.66) 85.05(5.41)

ncclo* 47.40(6.14) 46.96(6.45) 43.10(5.99) 44.40(7.35)

Seeded K-means outperforms K-means referencing residual sum of squares and number of correctly clustered labeled objects, Table 4.2. Euclidean seeded K-means has the best result referencing residual sum of squares and number of correctly clustered labeled object, Table 4.2. The number of the correctly clustered labeled objects when we use the available class la-bels, provided by neurobiologists, (ncclo) is higher than this number when we use random class labels (ncclo*), Table 4.2, so the available class labels outperforms random labels.

Table 4.3 presents the results of clustering the labeled part of the data set. It comprises, besides the previous indices, the similarity between the clusterings and the class labels using Rand Index (RI), and the same number when we used random class labels (RI*). Constrained K-means keeps the

(33)

labels using Rand Index (RI) is always 1, so we omitted its results.

Table 4.3: Results, Mean (Standard Deviation), of clustering the labeled part of the data set.

rss 209.43(8.41) 197.69 2.10E + 14(8.52E + 12) 1.98E+14

RI 0.73(0.08) 0.86 0.73(0.08) 0.86

RI* 0.59(0.05) 0.58 0.59(0.04) 0.58

ncclo 92.67(7.59) 92 93.11(7.53) 92

ncclo* 53.19(7.32) 57 52.40(6.38) 57

Seeded K-means uses the labeled objects to select the first cluster centers, so it always clusters a labeled data set the same, regardless of distance metric and number of runs, Table 4.3, i.e. it starts its search from the same first cluster centers, so it always results to the same clustering.

Seeded K-means outperforms K-means referencing residual sum of squares and Rand Index, Table 4.3.

The similarity between the clusterings and the available class labels (RI) is higher than the similarity between the clusterings and the random labels (RI*), so the available class labels outperforms random labels.

The number of correctly clustered labeled objects is higher when we cluster only the labeled part of the data set, Table 4.3, compared with when we cluster the whole data set, Table 4.2.

4.3 Examining the Accuracy of Class Labels

The number of the classes’ samples end in the same cluster—e.g. 11.43 (of 14) NPY’s samples ends in the same cluster in the case of Euclidean K-means, Table 4.4—examines the accuracy of the available class labels. Tables 4.4 and 4.5 present the results of examining the accuracy of the classes, when the whole data set and its labeled part were clustered. Seeded K-means always clusters the labeled part of the data set the same, Table 4.5.

Table 4.6 summarizes the Tables 4.4 and 4.5. M ean% shows the averaged percentage of the all results, and the Best% shows the percentage of the best (largest) result.

2/3 and Chandelier are the most consistent classes and PV and Mar-tinotti are the least consistent ones, Table 4.6. The results of clustering the labeled part of the data set is better than the results of clustering the whole data set, Table 4.6.

(34)

Table 4.4: The number of the classes’ samples ends in the same cluster when the whole data set was clustered.

Class # Samples Mean (Standard Deviation)

NPY 14 11.43(1.59) 10.27(0.81) 11.42(1.68) 10.25(0.76)

PV 19 11.83(2.45) 11.06(2.42) 11.85(2.48) 11.14(2.45)

2/3 25 21.25(4.33) 24.08(1.62) 21.93(3.77) 23.69(2.18)

Chandelier 24 19.25(1.45) 20.34(1.93) 19.29(1.27) 20.24(1.83)

Martinotti 33 19.07(4.42) 19.88(3.86) 19.56(4.24) 19.73(4.12)

Table 4.5: The number of the classes’ samples ends in the same cluster when the labeled part of the data set was clustered.

Class # Samples Mean (Standard Deviation)

NPY 14 11.61(1.56) 10 11.65(1.58) 10

PV 19 12.25(2.42) 12 12.39(2.34) 12

2/3 25 24.47(2.24) 25 24.51(2.15) 25

Chandelier 24 21.86(2.40) 24 21.81(2.46) 24

Martinotti 33 22.47(3.43) 21 22.75(3.26) 21

Table 4.6: Summary of examining the classes’ uniformities.

Results of clustering the whole data set Results of clustering the labeled part of the data set

Class M ean% Best% M ean% Best%

2/3 91 96 99 100

Chandelier 82 85 94 100

NPY 77 82 79 83

PV 60 62 64 65

(35)

Chapter 5 Summary

This chapter summarizes the project and future work.

5.1 Summary

This project aimed to use machine learning to classify a data set of mouse interneurons. The data set is partially labeled and the number of its classes is unknown, so the task is semi-supervised clustering. To classify the data set I used clustering and semi-supervised clustering algorithms.

Analyzing the data set needs a wide spectrum of decisions: choosing the techniques to normalize the data set, to cluster it, to validate the results etc. Clustering the data set was the main part of the process. To do it I selected K-means, seeded K-means, and constrained K-means. These algorithms are partitional clustering algorithms i.e. they divide a data set into K mutually exclusive clusters. The algorithms need the number of clusters, K. Selecting the optimal number of clusters when it is unknown is a hard task. I used two heuristics to select the optimal number of clusters: the elbow method and averaged similarity between the clusterings.

To validate the clusterings I compared them with each other using resid-ual sum of squares (objective function of K-means), and with class labels using Rand Index. Comparing the clusterings with the class labels needs labeled data set, but our data set is partially labeled. To compare the clus-terings of the whole data set with class labels I defined a new index, number of correctly clustered labeled objects. In addition I only clustered the labeled part of the data set.

Results indicate seeded K-means outperforms K-means and constrained K-means, because seeded K-means benefits from the class labels in the first seeding, whereas K-means does it randomly; it also changes the noisy labels throughput the clustering process, whereas constrained K-means keeps them unchanged.

(36)

the neurobiologists suppose some of the 5 know classes may have sub-classes or new classes may arise.

I used the number of the classes’ samples end in the same cluster of the resultant clusterings to assess the accuracy of the class labels. The results show 2/3 and NPY are the most consistent classes, so the experts can trust them; whereas PV and Martinotti are the least consistent classes, so the experts may want to reconsider their samples.

5.2 Future Work

A clustering process consists of several decisions: selecting the standardiza-tion method, algorithms, distance metrics, validity techniques, etc. Apart from the methods that we used during the project, other alternatives are available. This section introduces some of the other possibilities.

Feature subset selection can be used to select a subset of a data set’s features that are most relevant to the clustering [8].

Density based clustering algorithms can be used to cluster the data set; clustering ensembles can be used to improve the quality of the clusterings, it combines the results of several algorithms [12].

Gap Statistic [21] or Stability Measure [14] can be used to select the optimal number of clusters.

These methods are more advanced than the ones I used in the project, but I am not sure if the results will be diﬀerent.

(37)

Bibliography

[1] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised Clustering by Seeding. In International Conference on Machine Learning, pages 27–34, 2002.

[2] S. Basu, M. Bilenko, and R. Mooney. Comparing and unifying search-based and similarity-search-based approaches to semi-supervised clustering. In International Conference on Machine Learning (ICML) 2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 42–49, 2003.

[3] O. Chapelle, B. Sch¨olkopf, and A. Zien, editors. Semi-Supervised Learn-ing. MIT Press, Cambridge, MA, 2006.

[4] R. De Maesschalck, D. Jouan-Rimbaud, and D. Massart. The maha-lanobis distance. Chemometrics and Intelligent Laboratory Systems, 50(1):1–18, 2000.

[5] A. Demiriz, K. Bennett, and M. Embrechts. Semi-supervised cluster-ing uscluster-ing genetic algorithms. Artificial neural networks in engineercluster-ing (ANNIE-99), pages 809–814, 1999.

[6] M. M. Deza and E. Deza. Encyclopedia of distances. Springer, 2009. [7] R. Douglas and K. Martin. Neuronal circuits of the neocortex. Annual

Review of Neuroscience, 27:419–451, 2004.

[8] L. Guerra. Semi-supervised subspace clustering and applications to neu-roscience. PhD thesis, Departamento de Inteligencia Artificial, Univer-sidad Polit´ecnica de Madrid, 2012.

[9] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17(2):107–145, 2001.

[10] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering validity checking methods: part ii. ACM Sigmod Record, 31(3):19–27, 2002.

(38)

[11] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems. Elsevier Science & Technology, 2011.

[12] A. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010.

[13] T. Lange, M. Law, A. Jain, and J. Buhmann. Learning with constrained and unlabeled data. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 731–738. IEEE, 2005.

[14] T. Lange, V. Roth, M. Braun, and J. Buhmann. Stability-based val-idation of clustering solutions. Neural computation, 16(6):1299–1323, 2004.

[15] L. M. McGarry, A. M. Packer, E. Fino, V. Nikolenko, T. Sippy, and R. Yuste. Quantitative classification of somatostatin-positive neocor-tical interneurons identifies three interneuron subtypes. Frontiers in neural circuits, 4, 2010.

[16] G. McLachlan. Mahalanobis distance. Resonance, 4(6):20–26, 1999. [17] G. Milligan and M. Cooper. A study of standardization of variables in

cluster analysis. Journal of Classification, 5(2):181–204, 1988.

[18] S. Russell, P. Norvig, J. Canny, J. Malik, and D. Edwards. Artificial intelligence: a modern approach, volume 2. Prentice hall Englewood Cliﬀs, NJ, 1995.

[19] Serendip. Neocortex. Accessed Jan 2, 2012.

[20] A. R. Shah, C. S. Oehmen, and B.-J. Webb-Robertson. Svm-hustle - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection. Bioinformatics, 24:783–790, March 2008.

[21] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statis-tical Society: Series B (StatisStatis-tical Methodology), 63(2):411–423, 2001. [22] A. Tsiola, F. Hamzei-Sichani, Z. Peterlin, and R. Yuste. Quantitative

morphologic classification of layer 5 neurons from mouse primary visual cortex. The Journal of Comparative Neurology, 461(4):415–428, 2003. [23] K. Wagstaﬀ, C. Cardie, S. Rogers, and S. Schr¨odl. Constrained K-means

(39)

[24] R. Wells. Cortical neurons and circuits: a tutorial introduction. Un-published paper, www.mrc.uidaho.edu, 2005.

(40)

På svenska

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Using Semi-supervised Clustering for Neurons Classification

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Using Semi-supervised Clustering for Neurons

Classification

by

Ali Fakhraee Seyedabad

LIU-IDA/LITH-EX-A--13/019--SE

2013-04-26

Examensarbete

Using Semi-supervised Clustering for Neurons

Classification

av

Ali Fakhraee Seyedabad

LIU-IDA/LITH-EX-A--13/019--SE

2013-04-26

Handledare: Jose M. Pena

Examinator: Patrick Lambrix

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Literature Review

2.1

Machine Learning

2.2

Supervised Learning

2.3

Unsupervised Learning

2.4

Semi-supervised Learning

2.5

Standardization

2.6

Clustering Validity

Chapter 3

Methodology

3.1

Data Set Description

3.2

Methods and Algorithms

Chapter 4

Results and Discussion

4.1

Selecting the Number of Clusters

4.2

Clustering the Data set

4.3

Examining the Accuracy of Class Labels

Chapter 5

Summary

5.1

Summary

5.2

Future Work

Bibliography

På svenska

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

In English

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring