IN
DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS
STOCKHOLM SWEDEN 2020 ,
A comparative study on the
unsupervised classification of rat neurons by their morphology
SABRINA CHOWDHURY ADDED KINA
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
En jämförelsestudie av
oövervakad klassificering av råttneuroners morfologi
SABRINA CHOWDHURY ADDED KINA
Degree Project in Computer Science, DD142X Date: June 8, 2020
Supervisor: Alexander Kozlov Examiner: Pawel Herman
KTH Royal Institute of Technology
School of Electrical Engineering and Computer Science
iii
Abstract
An ongoing problem regarding the automatic classification of neurons by their
morphology is the lack of consensus between experts on neuron types. Un-
supervised clustering using persistent homology as a descriptor for the mor-
phology of neurons helps tackle the problem of bias in feature selection and
has the potential of aiding neuroscience research in developing a framework
for automatic neuron classification. This thesis investigates how two different
unsupervised machine learning algorithms would cluster persistence images
of already labeled neurons and how similar their clusterings would be. The
results showed that the clusterings done by both methods were highly simi-
lar and that there was a large variation within the neuronal types defined by
experts.
iv
Sammanfattning
Ett pågående problem gällande den automatiska klassificeringen av neuroner
med avseende på morfologi är bristen på konsensus bland experter vad gäl-
ler neurontyper. Oövervakad klusteranalys med persistent homologi som en
deskriptor för neuroners morfologi hjälper lösa problemet med partiskhet in-
om egenskapsurval och kan potentiellt gynna neurovetenskapen i utvecklingen
av ett ramverk för automatisk klassificering av neuroner. Denna uppsats hade
som mål att undersöka hur två olika oövervakade maskininlärningsalgoritmer
klassificerar persistensbilder av tidigare klassificerade neuroner samt graden
av överensstämmelse mellan de två metoderna. Studiens resultat visade att bå-
da metoders resultat hade en hög grad av överensstämmelse och visade även på
en stor variation inom de klasser av neuroner som redan definierats av experter.
v
Acknowledgements
Many thanks to our supervisor for supporting us throughout the project. We
would also like to thank everyone that helped us finish this study.
Contents
1 Introduction 1
1.1 Purpose . . . . 1
1.2 Problem statement . . . . 2
1.3 Scope . . . . 2
2 Background 3 2.1 The neuron . . . . 3
2.2 The digital reconstruction of neurons . . . . 4
2.3 NeuroMorpho.Org (NMO) . . . . 5
2.4 Machine learning . . . . 5
2.4.1 Affinity propagation . . . . 5
2.4.2 Ward’s method . . . . 6
2.4.3 Clustering assessment metrics . . . . 6
2.4.4 Curse of dimensionality . . . . 8
2.4.5 Dimensionality reduction . . . . 8
2.5 Obstacles to the automatic classification of neurons . . . . 9
2.6 Topological data analysis (TDA) . . . . 9
2.6.1 Persistent homology . . . . 10
2.6.2 The construction of persistence barcodes and persis- tence diagrams . . . . 10
2.7 The topological morphology descriptor (TMD) . . . . 12
2.7.1 The TMD algorithm . . . . 12
2.7.2 TMD classification . . . . 13
2.7.3 The validity and objectiveness of TMD . . . . 14
2.8 Related work . . . . 15
3 Method 16 3.1 Data collection . . . . 16
3.2 Data formatting . . . . 17
vi
CONTENTS vii
3.3 Dimensionality reduction . . . . 17
3.4 Unsupervised learning . . . . 17
3.5 Clustering assessment . . . . 18
3.6 Software . . . . 18
4 Results 20 5 Discussion 22 5.1 Discussion of results . . . . 22
5.2 Discussion of method . . . . 23
5.3 Future improvements . . . . 24
6 Conclusions 26
Bibliography 27
A Source code 30
B Silhouette score parameter tests 31
Chapter 1 Introduction
The use of machine learning has led to successes in numerous fields [1, 2], one of them being neuroscience. An ongoing problem in neuroscience today is to classify neurons, or nerve cells, by their morphology, that is, by analyzing their shape and form. It is, however, not obvious how to characterize neurons as there is no clear consensus among neuroscientists which features should be used in the process of defining neurons [3]. Although there is a vast amount of data available on the morphology of neurons [4] this disagreement on the framework on neuron classification raises some concerns about the validity of the data. This is due to the fact that reconstructions of neurons are labeled on subjective grounds [5]. A recent addition to the field of automatic neuronal classification, that aims at reducing this subjective factor, is the Topological Morphology Descriptor (TMD) that is able to describe the overall shape of a neuron which can be represented by an image called the persistence image which can be sent as input to different machine learning algorithms [6]. TMD has previously been shown to be able to confirm or disconfirm expert-labeling of pyramidal neurons in the rat cortex [6] and it would be of benefit to neuro- science to further investigate whether or not expert-labeling on other neuron types can hold up to scrutiny when examined by an unsupervised learning model using the TMD algorithm. The information gained from such an in- quiry would also showcase the comparative performance of certain machine learning methods in the objective classification of neurons.
1.1 Purpose
The purpose of this study is to investigate to which extent two unsupervised learning methods trained on the persistence images of neurons generated by
1
2 CHAPTER 1. INTRODUCTION
the TMD algorithm would agree with the labeling done on those same neurons by experts. The results of this investigation would further help highlight how machine learning can be used in neuroscience with respect to the morphology of neurons.
1.2 Problem statement
The problem statement this study aims to answer is as follows:
How similar are the classifications of neurons into different morphological types made by two unsupervised learning algorithms trained on the persis- tence images of neurons generated by the TMD algorithm and how similar are their classifications compared to classifications made by experts on those same neurons?
1.3 Scope
There are many possible neuron classifications to test in this study. It would be unfeasible to test all of them due to time constraints and therefore only reconstructions of rat neurons were chosen for this study. Rat cells were chosen partly due to the fact that rats at the time of this study were one of the most well-documented species on NeuroMorpho.Org and also because they have been shown previously to work with the TMD algorithm without difficulties [6].
The following six types of rat neurons were chosen : pyramidal, fast spik- ing, basket, medium spiny, glutamatergic and granule 1 .
The unsupervised learning algorithms used in this study are affinity prop- agation and Ward’s method.
1
The glutamatergic cell type is a non-specific, generic cell type for excitatory neurons.
Depending on which brain area they are active in, they might display different functions in
the local circuits and have different morphologies. E.g. if found in layer 2-4 in the neocortex
they are most probably pyramidal cells. Due to this, they add some amount of noise to the
classification of the pyramidal cells. A similar argument goes for the granule cell type, which
can be inhibitory (GABAergic) or excitatory(glutamatergic) depending on which brain area
they are located in.
Chapter 2 Background
2.1 The neuron
Neurons, or nerve cells, are "the signaling units of the nervous system" [7].
Four important morphological regions of a typical neuron are the cell body, dendrites, axon, and presynaptic terminals. The soma contains the nucleus, the dendrites are tree-like branches responsible for receiving signals from other neurons and the axon is a long extension of the soma that is responsible for sending signals to other neurons [7]. A visual representation of the neuron can be seen in Figure 2.1.
Figure 2.1: An illustration of a neuron [7].
3
4 CHAPTER 2. BACKGROUND
2.2 The digital reconstruction of neurons
Understanding the morphology of neurons is key to understanding the process- ing of information in the nervous system [8]. Digitally reconstructed neurons are often used in this regard to more closely study the morphology of a given neuron for different research tasks and can today be made from any species, brain region and neuron types [8].
The computer was first incorporated in the process of tracing, archiving and analyzing neuronal morphology in the 1960s [9] where an analog com- puter was used to store point coordinates of a neuron under a microscope which were given manually by a human operator [10]. In the following decades, many attempts were made to reduce the amount of manual labor in this process of digitally reconstructing neurons [10]. Although there has been significant im- provements in computational power and computer vision which have given way to many successful commercial and academic tools, most neuroanatomists struggle with the general applicability of the available tools today [10]. There- fore digital reconstructions of neurons are often still made manually by human experts [10].
An example of a digital reconstruction of a neuron can be seen in Figure 2.2
Figure 2.2: An example of a digital reconstruction of a neuron 1 .
1
2020.URL:http://neuromorpho.org/neuron_info.jsp?neuron_name=int7_1_2
(visited on 05/06/2020)
CHAPTER 2. BACKGROUND 5
2.3 NeuroMorpho.Org (NMO)
NeuroMorpho.Org (NMO) is an archive of digitally reconstructed neurons from peer-reviewed publications accessible online [11]. The inventory is up- dated each month and has contributions from over 500 laboratories around the world 2 .
2.4 Machine learning
Machine learning uses the theory of statistics to program computers to opti- mally perform a task using previous data [12]. Two common types of machine learning are supervised and unsupervised learning. The goal in supervised learning is to learn what output will be caused by a certain input and to train the machine learning model using correct values from a supervisor. The goal in unsupervised learning is to learn how the input maps to an output without answers from a supervisor, but instead by finding patterns in the input [12].
2.4.1 Affinity propagation
Affinity propagation is an unsupervised learning algorithm which given an input consisting of the range of similarities between pairs of data points will output clusters. Representing the data as a network and each data point as a node, the data points recursively send messages to each other to determine their affinity. The affinity quantifies how likely one data point sees the other as its exemplar which is the center of a set of data points [13].
The input consists of numerous data points and these data points can be similar to one another. These real-valued similarities are also called prefer- ences and are contained in the matrix s, where s(i, k) indicates how well the data point with index k is suited to be the exemplar for data point i [13] .
The similarity between two points is the Euclidean distance between them when the goal is to minimize squared error. For two points x i and x k , the similarity is s(i, k) = −abs(x i − x k ) 2 [13].
Once the similarities are computed, affinity propagation tries to cluster the data points iteratively by sending messages between data points. At each stage or iteration the algorithm decides which points are exemplars and which other points belong to those exemplars [13].
2
http://neuromorpho.org/
6 CHAPTER 2. BACKGROUND
The responsibility matrix r determines the best exemplar for each point.
r(i, k) is to be interpreted as how reasonable it is for point k to be the exemplar for point i, whilst considering all other possible exemplars for i. The message is sent from i to k [13].
The messages between points concerning the availability matrix a are sent in the opposite direction of r: a(i, k) is sent from candidate exemplar k to point i and indicates how reasonable it would be for i to choose k as an exemplar whilst considering the other points that should have k as their exemplar [13].
All availabilities are initialized to zero ( a(i, k) ) and the responsibilities are subsequently computed using the rule
r(i, k) ← s(i, k) − max
k
06=k {a(i, k 0 ) + s(i, k 0 )}.
Then, availability and self-availability is updated as follows:
a(i, k) ← min(0, r(k.k) + X
i
0∈{i,k} /
max(0, r(i 0 .k))) f or i 6= k
a(k.k) ← X
i
06=k
max(0, r(i 0 .k))
[13].
2.4.2 Ward’s method
Ward’s method is an unsupervised learning method which initially considers each point as a cluster and then iteratively groups the points together such that an objective loss function is minimized. As the number of groups are reduced in this way, Ward’s method is a type of hierarchical clustering algo- rithm. Ward’s method compares the similarity between points by calculating the sum of squares between each pair of points [14].
2.4.3 Clustering assessment metrics
The clustering received from unsupervised learning methods must be assessed
somehow in order to increase the belief that the clusters constructed are accu-
rate. There are different metrics to determine how accurate a clustering is, or
how similar two clusters are. The silhouette score can achieve the former and
the adjusted rand index (ARI) can achieve the latter.
CHAPTER 2. BACKGROUND 7
Silhouette Score
The silhouette score determines, for each point in a cluster, to what extent it belongs in that cluster. The silhouette score goes from -1 to 1. Having a score closer to 1 is a good indication that the clustering is well-defined.
There are two intermediary computed values for each point i, a(i) and b(i), that are necessary to compute the silhouette score s(i). If i has been determined to belong to cluster A, then a(i) is the average distance of i to each other point in A and b(i) is the average distance of i to all the points in the cluster that is closest to A, in other words the neighbour of A. The silhouette score for the point i is computed with regards to a(i) and b(i) with the following formula:
s(i) = b(i) − a(i)
max(a(i), b(i)) (2.1)
[15].
Thus the average s(i) for each cluster determines how well-defined the points in that cluster are and the average s(i) of all clusters determines how well the data set as a whole has been clustered.
Adjusted Rand index (ARI)
The rand index, or the rand score, determines the similarity between two clus- tering results.
This is done by examining each pair of points in the data set and examining whether or not they have been placed in the same cluster by the two different clustering methods. For example: if two points a and b have been clustered together in both the first clustering method and the second clustering method the two clusterings are more similar than if both a and b were in the same cluster in the first method but not in the same cluster in the second method.
If the two different clusterings are Y and Y 0 and the number of points that are both in the ith cluster of Y and the jth cluster of Y 0 are denoted as n ij the rand index c(Y, Y 0 ) for the two clusterings is computed as:
c(Y, Y 0 ) =
N
2
− [ 1 2 { P i ( P j n ij ) 2 + P j ( P i n ij ) 2 } − P P n 2 ij ]
N
2
(2.2)
[16].
The ARI is similar to the rand index but it also takes into consideration
that data points could have been grouped up by chance. An ARI value of
8 CHAPTER 2. BACKGROUND
1 indicates that the clusters are identical and a value of 0 indicates that all clusters in the clusterings have been randomly assigned in relation to the other clustering [17].
2.4.4 Curse of dimensionality
The curse of dimensionality refers to different problems that can occur when performing various experiments in high dimensional space that would not oc- cur in low dimensional space.
In clustering problems the result of clustering is very much dependent on having a meaningful distance metric that can tell two data points apart from one another. The effects of the curse of high dimensionality has been studied extensively for an array of problems by researchers and has shown that distance metrics may be less meaningful in high dimensional spaces [18]. The qual- ity of similarity measures tend to decrease as the dimensionality in the data increases [19]. This is a consequence of the distance between two points grow- ing significantly as the number of dimensions increase. This in turn results in the data becoming very sparse which introduces large technical challenges as traditional mathematical approaches are not always applicable [20].
2.4.5 Dimensionality reduction
Dimensionality reduction aims at reducing the number of variables in data and thus also aims at tackling the curse of dimensionality problem [21].
Principal component analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique which increases the interpretability of data while minimizing information loss [22]. PCA effectively reduces the number of features describing a data set by constructing a new representation of the data consisting of new uncorrelated features. In an optimal setting, a small subset of these features capture most of the variability in the data [22].
PCA is essentially a coordinate transformation where the original data has
an axis for each feature [22]. PCA rotates these axes so that one of the axes are
transformed to lie in the direction of maximum variance, another axis lies in
the direction of second-most maximum variance and so on [22]. This new set
of axes are called principal components [22] and for machine learning tasks a
set of principal components that can explain most of the variation in the data
can be chosen as input for e.g. clustering tasks.
CHAPTER 2. BACKGROUND 9
T-distributed stochastic neighbor embedding (t-SNE)
T-distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality re- duction technique that is particularly applicable for the visualization of high dimensional data sets [23]. t-SNE works by giving each data point a location in a two-or three-dimensional map. The t-SNE algorithm takes into account similarities between pairs of instances in both the high dimensional space and the low dimensional space. A cost function is then applied to optimize the two types of similarity measures [23].
2.5 Obstacles to the automatic classification of neurons
There have previously been different approaches to classifying neurons auto- matically through various machine learning methods [6, 3]. These methods have usually been conducted using feature extraction, by selecting a limited number of morphometrics as this method is more computationally feasible [24]. Only selecting a subset of features, however, results in information loss and is not optimal for categorizing neurons [24]. Feature selection is also sub- jective due to the human element of choosing the features that are deemed to be the best predictors for the classification task [5]. Therefore neither feature selection, nor using all features result in satisfactory classification results.
Another factor adding to the difficulty of classifying neurons is the fact that different experts label differently [6]. There is still no consensus in the research community on the number of morphologically different neurons [6]. Due to this lack of a framework for the categorization of neurons into different types, different experts can label the same cell differently. This can pose a problem, especially in supervised neuronal classification , as the labels are subjective [5]. This suggests the need for an objective approach to the classification of neurons.
2.6 Topological data analysis (TDA)
TDA is a recent field and approach to data analysis that uses algebraic topology
(a subfield of mathematics focused on the study of shape) and computational
geometry to study the shape of input data. TDA aims at analyzing the com-
plex topological and underlying geometric structures in data with well-founded
mathematical methods [25].
10 CHAPTER 2. BACKGROUND
2.6.1 Persistent homology
Persistent homology is a theory in TDA that is used to reconstruct the shape of some underlying data. The application of persistent homology to data allows for the identification and extraction of interesting topological features describ- ing the features that are stable to noise in the input data. It also allows for representing a shape as a persistence barcode or equivalently as a persistence diagram, effectively describing the shape numerically [25].
2.6.2 The construction of persistence barcodes and persistence diagrams
The construction of persistence barcodes and persistence diagrams start with a set of points that describe a given shape. Figure 2.3 illustrates a set of points that represent a random shape. Each point in the figure has a growing disk around it and each disk grows at the same speed. In this example (Figure 2.3) all disks are “born” at the same time and this is also recorded in the persistence barcode as the birth of these points. The collision of two disks results in the
“death” of the oldest colliding disk (or the birth of a new disk formed by the two colliding disks, if all points are born at the same time, such as in this example). Each collision is recorded in the persistence barcode as the death of a certain point. Each point will thus have a time of birth and death which can be represented in the persistence diagram where each point represents a (birth,death) value [26].
Figure 2.3: An example of translating a set of points into a persistence dia-
gram. At a disk radius of 0.525 some collisions have started to occur, all of
which are recorded on the persistence diagram 3 .
CHAPTER 2. BACKGROUND 11
In Figure 2.4 one can see that two clusters of points have formed after increasing the radius of the disks. Points that are close to each other have collided early on and this is represented on the persistence diagram as death- values close to 0.
Figure 2.4: At a disk radius of 0.782 two clusters of points have formed 4 .
In Figure 2.5 the increased radius has resulted in a collision between the two clusters seen in Figure 2.4. This collision is again recorded on the persis- tence diagram. The two clusters of points seen in Figure 2.4 collide relatively late with a significantly larger radius as their points are quite far from each other. This collision-point is significantly farther from 0 than the other col- lision points. The gap between points in the persistence diagram, such as in this example (Figure 2.5) is therefore an indication as to what extent the data is clustered relative to its noise.
3
Gary Koplik. 0d Persistent Homology Example. 2019.
URL:https://gjkoplik.github.io/pers-hom-examples/0d_pers_2d_data_widget.html (visited on 03/29/2020)
4
See footnote 3
12 CHAPTER 2. BACKGROUND
Figure 2.5: At a disk radius of 3.494 the two previous clusters have collided into one. This collision occurs quite late, illustrated by a significantly large gap between the largest death-value in the persistence diagram and the rest of the death-values 5 .
2.7 The topological morphology descriptor (TMD)
The TMD is a freely available tool to extract a topological representation of the branching pattern in a neuronal tree. Developed by a team of researchers at the BlueBrain project 6 , the TMD has been shown to be a powerful tool in the automatic classification of neurons by their morphology [24] [6]. The tool provides an alternative representation of neuronal morphologies based on persistent homology and is a way of quantifying the branching structure in neurons by encoding their overall shape in persistence barcodes [6].
2.7.1 The TMD algorithm
The TMD algorithm takes as input a rooted neuronal tree (Figure 2.6) as well as a set of nodes containing all leaves (nodes without children) and bifurcations (inner nodes with children - seen as branching in Figure 2.6) in the neuron [24].
5
See footnote 3
6
https://www.epfl.ch/research/domains/bluebrain/
CHAPTER 2. BACKGROUND 13
Figure 2.6: An example of a neuronal tree and its persistent homology repre- sented in a persistence barcode.
The TMD algorithm computes the persistence barcode for this input by defining a function that computes the radial distance(as illustrated by the disks in Figure 2.4 and 2.5) between a node and the root or soma as well as another function that is used to order the age of sibling nodes (two nodes are siblings if they have the same parent node) [24].
From each leaf, there is a path to the root - the algorithm iteratively moves through all the paths, from each leaf to the root, and upon detecting a new node on the path kills all nodes but the oldest amongst the siblings it encounters [24].
For each killed component(a component of the tree is a sequence of con- secutive edges between a leaf and an internal node) one birth-death interval is added to the persistence barcode which can be seen in Figure 2.6. Thus, each interval in the persistence barcode encodes the lifetime of a connected component in the tree - identifying when a branch is first detected (birth) and when it connects to a larger subtree (death) [24].
2.7.2 TMD classification
The TMD persistence diagrams and barcodes can be converted into unweighted persistence images for the classification task [24]. A persistence image rep- resenting a neuron summarizes the density of the different components in a neuronal tree at different radial distances from the soma [6].
The persistence images are unweighted due to the fact that weighted images
fail to capture short components, which have been shown to be important in
classification [6]. The method for creating a persistence image describing the
TMD profile for a neuron is based on the discretization of a sum of Gaussian
14 CHAPTER 2. BACKGROUND
kernels, which generates a matrix of pixel values - effectively encoding the persistence image as a vector. Being able to describe neurons in this way, as vectors, enables the classification of neurons through various machine learning methods [24]. An example of a persistence image of a neuron can be seen in Figure 2.7.
Figure 2.7: An example of a neuronal tree and its persistent homology repre- sented in a persistence image.
2.7.3 The validity and objectiveness of TMD
The TMD algorithm is effective due to its topological representations display- ing less representation loss than the usual morphometrics used in feature selec- tion [6]. Neurons with different functional purposes exhibit unique branching patterns [6] and because TMD is based on the branching structure in neurons it can be used to separate between different neuronal morphologies. The cell types proposed by TMD are also unbiased relative to the classification done by experts as they are based on a mathematical descriptor of the branching struc- ture in neurons rather than visual inspection of neurons under a microscope [6].
A previous study [6] on the objective supervised classification of pyrami- dal neurons in the rat cortex using TMD showed that the majority of expert- labels could be supported by the TMD-labels. This example illustrates that the TMD is capable of seeing the same patterns as experts when labeling neurons.
A further extension to the investigation into this tool would be to analyze its
performance on other types of neurons in an unsupervised learning scenario.
CHAPTER 2. BACKGROUND 15
2.8 Related work
The study Objective morphological classification of neocortical pyramidal cells [6] is of much relevance due to the fact that it remarks that numerous expert-labels on morphological neuron types are subjective and the study demon- strates that an objective, stable and without need of expert input classifica- tion of rat cortical pyramidal cells is possible. The study tried to categorize different types of pyramidal cells and the researchers managed to objectively identify 17 types of pyramidal cells in the rat somatosensory cortex. This identification was done both by using supervised and semi-supervised learn- ing. The supervised learning classifier was first trained and tested on neurons with expert-labels, and then the procedure was repeated for cells with random labels. The expert classification of a neuron was kept if its accuracy was sig- nificantly higher than that of the randomized classification. Otherwise, the neuron was reclassified with semi-supervised learning [6].
A comparison of machine learning algorithms for automatic classification
of neurons by their morphology [27] was a study that compared the perfor-
mance of different supervised learning algorithms when trying to classify dif-
ferent neuron morphologies among mice neurons. The researchers acquired
neuron data from NMO. In contrast to the method used in this research pa-
per, they used morphometrics available at NMO instead of the TMD. The re-
searchers noted that NMO does not keep all the features of the cell and adviced
the usage of programs like L-measure that could extract more features [27].
Chapter 3 Method
3.1 Data collection
The rat neurons used were gathered from NMO as SWC files, which describe each neurons geometry and positioning. Rat neurons were chosen partly due to the fact that it has been demonstrated previously that the TMD algorithm works on rat neurons [6]. They were also chosen due to the vast amount of digital rat neuron reconstructions from NMO that could be converted into TMD profiles.
In choosing which species to focus the experiment on; rat, mice and human neurons were tested to see which species contained the most cells eligible for the TMD algorithm. TMD did not work on some cells as they were incomplete, meaning that they either contained no soma or no dendritic domains. These reconstructions were therefore screened out from the final set of data points.
After this screening of choosing species for the experiment, six rat neuron types were chosen for the final data set as they had displayed a high success rate in the number of neurons that worked with TMD, in the previous screening.
The final rat neuron types chosen for the experiment were : pyramidal, fast spiking, basket, medium spiny, glutamatergic and granule.
Table 3.1 displays how many neurons of each cell type that were used in the final data set.
Glutamatergic Granule Medium spiny Basket Fast spiking Pyramidal
109 452 856 459 57 1876
Table 3.1: The number of neurons by cell type that passed all screenings
16
CHAPTER 3. METHOD 17
3.2 Data formatting
The TMD algorithm was applied to each SWC file so that the corresponding persistence image could be retrieved and stored in a separate file. A small number of neurons, less than 1% of the total number of neurons that passed the first filtering, could not have their persistence images generated by the TMD algorithm and were therefore discarded from the data set of persistence images.
3.3 Dimensionality reduction
The persistence images are all 100-by-100 pixel images and therefore have a total of 10000 dimensions. To reduce the number of dimensions while still retaining as much information as possible PCA was used to extract 50 new dimensions (or principal components) with the maximum amount of varia- tion or information about how the original information from the persistence images were distributed. These top 50 principal components retained 99.9%
of the variation in the original data set meanwhile the top 2 principal com- ponents retained 67% of the variation in the original data set. Due to this, another dimensionality reduction technique, t-SNE, was used to embed the top 50 principal components generated by PCA in a two dimensional setting were the data would be more susceptible to the application of various distance metrics.
3.4 Unsupervised learning
Two unsupervised machine learning algorithms were applied on the final data set after dimensionality reduction - affinity propagation and Ward’s method.
These two methods were chosen as they have been shown previously to pro- vide slightly better results compared to other unsupervised learning methods when classifying neurons according to their morphology [3]. However, rather than clustering the neurons by taking into account various morphometrics, this study will use these two algorithms to cluster the neurons by their persistence images generated by the TMD algorithm.
Affinity propagation was first applied on the data set to determine the num-
ber of clusters that would cluster the persistence images into the most well-
defined clusters. Affinity propagation was run with different parameters, the
most important ones being the damping value, preference value, number of
iterations and distance function.
18 CHAPTER 3. METHOD
The damping value is a value between 0.5 and 1 and a higher damping value leads to less drastic changes in the clustering in each iteration. The pref- erence value guides the algorithm to prefer some points over other points in the data as exemplars. The two important iteration parameters are the maximum iterations and the convergence iterations. The maximum iteration stops the al- gorithm from executing if the total number of iterations exceeds the maximum iteration value set, and the convergence iteration value determines how many iterations the algorithm can be run without a change in the clustering. The dis- tance function is the function that calculates the distance between points. The distance function used for Affinity propagation was the negative squared Eu- clidean distance. The optimal parameter values were chosen using the mean silhouette score. The parameter values that generated the highest mean silhou- ette score for the whole clustering done by affinity propagation on the data set was a damping value of 0.70, a preference value of -0.5, a maximum of 5000 iterations allowed or stopping early if the clustering remained unchanged for 200 iterations.
The optimal number of clusters were determined to be 684 by running affinity propagation with the parameters that yielded the highest silhouette score. Ward’s method was then run on the data set with 684 clusters as in- put.
After the best set of parameters had been determined for both affinity prop- agation and Ward’s method they were run with these specific parameters and their clustering of the neurons was assessed.
3.5 Clustering assessment
The assessment of the clusters made by both unsupervised learning methods were done by examining the silhouette scores and ARI of the methods. The silhouette score helped in determining how well-defined the clusters were and the ARI helped in determining how similar the clusters made by affinity prop- agation were to Ward’s method.
3.6 Software
The programs used for clustering and analysis of the clusterings were written in Python3. All essential functions for creating the data set and clustering came from scikit-learn (sklearn) and the TMD repository by BlueBrain 1 .
1