A comparative study on the unsupervised classification of rat neurons by their morphology

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2020 ,

A comparative study on the

unsupervised classification of rat neurons by their morphology

SABRINA CHOWDHURY ADDED KINA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

En jämförelsestudie av

oövervakad klassificering av råttneuroners morfologi

SABRINA CHOWDHURY ADDED KINA

Degree Project in Computer Science, DD142X Date: June 8, 2020

Supervisor: Alexander Kozlov Examiner: Pawel Herman

KTH Royal Institute of Technology

School of Electrical Engineering and Computer Science

(3)

(4)

iii

Abstract

An ongoing problem regarding the automatic classification of neurons by their

morphology is the lack of consensus between experts on neuron types. Un-

supervised clustering using persistent homology as a descriptor for the mor-

phology of neurons helps tackle the problem of bias in feature selection and

has the potential of aiding neuroscience research in developing a framework

for automatic neuron classification. This thesis investigates how two different

unsupervised machine learning algorithms would cluster persistence images

of already labeled neurons and how similar their clusterings would be. The

results showed that the clusterings done by both methods were highly simi-

lar and that there was a large variation within the neuronal types defined by

experts.

(5)

iv

Sammanfattning

Ett pågående problem gällande den automatiska klassificeringen av neuroner

med avseende på morfologi är bristen på konsensus bland experter vad gäl-

ler neurontyper. Oövervakad klusteranalys med persistent homologi som en

deskriptor för neuroners morfologi hjälper lösa problemet med partiskhet in-

om egenskapsurval och kan potentiellt gynna neurovetenskapen i utvecklingen

av ett ramverk för automatisk klassificering av neuroner. Denna uppsats hade

som mål att undersöka hur två olika oövervakade maskininlärningsalgoritmer

klassificerar persistensbilder av tidigare klassificerade neuroner samt graden

av överensstämmelse mellan de två metoderna. Studiens resultat visade att bå-

da metoders resultat hade en hög grad av överensstämmelse och visade även på

en stor variation inom de klasser av neuroner som redan definierats av experter.

(6)

v

Acknowledgements

Many thanks to our supervisor for supporting us throughout the project. We

would also like to thank everyone that helped us finish this study.

(7)

Chapter 1 Introduction

The use of machine learning has led to successes in numerous fields [1, 2], one of them being neuroscience. An ongoing problem in neuroscience today is to classify neurons, or nerve cells, by their morphology, that is, by analyzing their shape and form. It is, however, not obvious how to characterize neurons as there is no clear consensus among neuroscientists which features should be used in the process of defining neurons [3]. Although there is a vast amount of data available on the morphology of neurons [4] this disagreement on the framework on neuron classification raises some concerns about the validity of the data. This is due to the fact that reconstructions of neurons are labeled on subjective grounds [5]. A recent addition to the field of automatic neuronal classification, that aims at reducing this subjective factor, is the Topological Morphology Descriptor (TMD) that is able to describe the overall shape of a neuron which can be represented by an image called the persistence image which can be sent as input to different machine learning algorithms [6]. TMD has previously been shown to be able to confirm or disconfirm expert-labeling of pyramidal neurons in the rat cortex [6] and it would be of benefit to neuro- science to further investigate whether or not expert-labeling on other neuron types can hold up to scrutiny when examined by an unsupervised learning model using the TMD algorithm. The information gained from such an in- quiry would also showcase the comparative performance of certain machine learning methods in the objective classification of neurons.

1.1 Purpose

The purpose of this study is to investigate to which extent two unsupervised learning methods trained on the persistence images of neurons generated by

1

(11)

2 CHAPTER 1. INTRODUCTION

the TMD algorithm would agree with the labeling done on those same neurons by experts. The results of this investigation would further help highlight how machine learning can be used in neuroscience with respect to the morphology of neurons.

1.2 Problem statement

The problem statement this study aims to answer is as follows:

How similar are the classifications of neurons into different morphological types made by two unsupervised learning algorithms trained on the persis- tence images of neurons generated by the TMD algorithm and how similar are their classifications compared to classifications made by experts on those same neurons?

1.3 Scope

There are many possible neuron classifications to test in this study. It would be unfeasible to test all of them due to time constraints and therefore only reconstructions of rat neurons were chosen for this study. Rat cells were chosen partly due to the fact that rats at the time of this study were one of the most well-documented species on NeuroMorpho.Org and also because they have been shown previously to work with the TMD algorithm without difficulties [6].

The following six types of rat neurons were chosen : pyramidal, fast spik- ing, basket, medium spiny, glutamatergic and granule ¹ .

The unsupervised learning algorithms used in this study are affinity prop- agation and Ward’s method.

1

The glutamatergic cell type is a non-specific, generic cell type for excitatory neurons.

Depending on which brain area they are active in, they might display different functions in

the local circuits and have different morphologies. E.g. if found in layer 2-4 in the neocortex

they are most probably pyramidal cells. Due to this, they add some amount of noise to the

classification of the pyramidal cells. A similar argument goes for the granule cell type, which

can be inhibitory (GABAergic) or excitatory(glutamatergic) depending on which brain area

they are located in.

(12)

Chapter 2 Background

2.1 The neuron

Neurons, or nerve cells, are "the signaling units of the nervous system" [7].

Four important morphological regions of a typical neuron are the cell body, dendrites, axon, and presynaptic terminals. The soma contains the nucleus, the dendrites are tree-like branches responsible for receiving signals from other neurons and the axon is a long extension of the soma that is responsible for sending signals to other neurons [7]. A visual representation of the neuron can be seen in Figure 2.1.

Figure 2.1: An illustration of a neuron [7].

3

(13)

4 CHAPTER 2. BACKGROUND

2.2 The digital reconstruction of neurons

Understanding the morphology of neurons is key to understanding the process- ing of information in the nervous system [8]. Digitally reconstructed neurons are often used in this regard to more closely study the morphology of a given neuron for different research tasks and can today be made from any species, brain region and neuron types [8].

The computer was first incorporated in the process of tracing, archiving and analyzing neuronal morphology in the 1960s [9] where an analog com- puter was used to store point coordinates of a neuron under a microscope which were given manually by a human operator [10]. In the following decades, many attempts were made to reduce the amount of manual labor in this process of digitally reconstructing neurons [10]. Although there has been significant im- provements in computational power and computer vision which have given way to many successful commercial and academic tools, most neuroanatomists struggle with the general applicability of the available tools today [10]. There- fore digital reconstructions of neurons are often still made manually by human experts [10].

An example of a digital reconstruction of a neuron can be seen in Figure 2.2

Figure 2.2: An example of a digital reconstruction of a neuron ¹ .

1

2020.URL:http://neuromorpho.org/neuron_info.jsp?neuron_name=int7_1_2

(visited on 05/06/2020)

(14)

CHAPTER 2. BACKGROUND 5

2.3 NeuroMorpho.Org (NMO)

NeuroMorpho.Org (NMO) is an archive of digitally reconstructed neurons from peer-reviewed publications accessible online [11]. The inventory is up- dated each month and has contributions from over 500 laboratories around the world ² .

2.4 Machine learning

Machine learning uses the theory of statistics to program computers to opti- mally perform a task using previous data [12]. Two common types of machine learning are supervised and unsupervised learning. The goal in supervised learning is to learn what output will be caused by a certain input and to train the machine learning model using correct values from a supervisor. The goal in unsupervised learning is to learn how the input maps to an output without answers from a supervisor, but instead by finding patterns in the input [12].

2.4.1 Affinity propagation

Affinity propagation is an unsupervised learning algorithm which given an input consisting of the range of similarities between pairs of data points will output clusters. Representing the data as a network and each data point as a node, the data points recursively send messages to each other to determine their affinity. The affinity quantifies how likely one data point sees the other as its exemplar which is the center of a set of data points [13].

The input consists of numerous data points and these data points can be similar to one another. These real-valued similarities are also called prefer- ences and are contained in the matrix s, where s(i, k) indicates how well the data point with index k is suited to be the exemplar for data point i [13] .

The similarity between two points is the Euclidean distance between them when the goal is to minimize squared error. For two points x ⁱ and x ^k , the similarity is s(i, k) = −abs(x i − x _k ) ² [13].

Once the similarities are computed, affinity propagation tries to cluster the data points iteratively by sending messages between data points. At each stage or iteration the algorithm decides which points are exemplars and which other points belong to those exemplars [13].

2

http://neuromorpho.org/

(15)

6 CHAPTER 2. BACKGROUND

The responsibility matrix r determines the best exemplar for each point.

r(i, k) is to be interpreted as how reasonable it is for point k to be the exemplar for point i, whilst considering all other possible exemplars for i. The message is sent from i to k [13].

The messages between points concerning the availability matrix a are sent in the opposite direction of r: a(i, k) is sent from candidate exemplar k to point i and indicates how reasonable it would be for i to choose k as an exemplar whilst considering the other points that should have k as their exemplar [13].

All availabilities are initialized to zero ( a(i, k) ) and the responsibilities are subsequently computed using the rule

r(i, k) ← s(i, k) − max

k

⁰

6=k {a(i, k ⁰ ) + s(i, k ⁰ )}.

Then, availability and self-availability is updated as follows:

a(i, k) ← min(0, r(k.k) + ^X

i

⁰

∈{i,k} /

max(0, r(i ⁰ .k))) f or i 6= k

a(k.k) ← ^X

i

⁰

6=k

max(0, r(i ⁰ .k))

[13].

2.4.2 Ward’s method

Ward’s method is an unsupervised learning method which initially considers each point as a cluster and then iteratively groups the points together such that an objective loss function is minimized. As the number of groups are reduced in this way, Ward’s method is a type of hierarchical clustering algo- rithm. Ward’s method compares the similarity between points by calculating the sum of squares between each pair of points [14].

2.4.3 Clustering assessment metrics

The clustering received from unsupervised learning methods must be assessed

somehow in order to increase the belief that the clusters constructed are accu-

rate. There are different metrics to determine how accurate a clustering is, or

how similar two clusters are. The silhouette score can achieve the former and

the adjusted rand index (ARI) can achieve the latter.

(16)

CHAPTER 2. BACKGROUND 7

Silhouette Score

The silhouette score determines, for each point in a cluster, to what extent it belongs in that cluster. The silhouette score goes from -1 to 1. Having a score closer to 1 is a good indication that the clustering is well-defined.

There are two intermediary computed values for each point i, a(i) and b(i), that are necessary to compute the silhouette score s(i). If i has been determined to belong to cluster A, then a(i) is the average distance of i to each other point in A and b(i) is the average distance of i to all the points in the cluster that is closest to A, in other words the neighbour of A. The silhouette score for the point i is computed with regards to a(i) and b(i) with the following formula:

s(i) = b(i) − a(i)

max(a(i), b(i)) (2.1)

[15].

Thus the average s(i) for each cluster determines how well-defined the points in that cluster are and the average s(i) of all clusters determines how well the data set as a whole has been clustered.

Adjusted Rand index (ARI)

The rand index, or the rand score, determines the similarity between two clus- tering results.

This is done by examining each pair of points in the data set and examining whether or not they have been placed in the same cluster by the two different clustering methods. For example: if two points a and b have been clustered together in both the first clustering method and the second clustering method the two clusterings are more similar than if both a and b were in the same cluster in the first method but not in the same cluster in the second method.

If the two different clusterings are Y and Y ⁰ and the number of points that are both in the ith cluster of Y and the jth cluster of Y ⁰ are denoted as n ij the rand index c(Y, Y ⁰ ) for the two clusterings is computed as:

c(Y, Y ⁰ ) =

_N

2 − [ ¹ ₂ { ^P _i ( ^P _j n _ij ) ² + ^P _j ( ^P _i n _ij ) ² } − ^{P P} n ² _ij ]

_N

2 (2.2)

[16].

The ARI is similar to the rand index but it also takes into consideration

that data points could have been grouped up by chance. An ARI value of

(17)

8 CHAPTER 2. BACKGROUND

1 indicates that the clusters are identical and a value of 0 indicates that all clusters in the clusterings have been randomly assigned in relation to the other clustering [17].

2.4.4 Curse of dimensionality

The curse of dimensionality refers to different problems that can occur when performing various experiments in high dimensional space that would not oc- cur in low dimensional space.

In clustering problems the result of clustering is very much dependent on having a meaningful distance metric that can tell two data points apart from one another. The effects of the curse of high dimensionality has been studied extensively for an array of problems by researchers and has shown that distance metrics may be less meaningful in high dimensional spaces [18]. The qual- ity of similarity measures tend to decrease as the dimensionality in the data increases [19]. This is a consequence of the distance between two points grow- ing significantly as the number of dimensions increase. This in turn results in the data becoming very sparse which introduces large technical challenges as traditional mathematical approaches are not always applicable [20].

2.4.5 Dimensionality reduction

Dimensionality reduction aims at reducing the number of variables in data and thus also aims at tackling the curse of dimensionality problem [21].

Principal component analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique which increases the interpretability of data while minimizing information loss [22]. PCA effectively reduces the number of features describing a data set by constructing a new representation of the data consisting of new uncorrelated features. In an optimal setting, a small subset of these features capture most of the variability in the data [22].

PCA is essentially a coordinate transformation where the original data has

an axis for each feature [22]. PCA rotates these axes so that one of the axes are

transformed to lie in the direction of maximum variance, another axis lies in

the direction of second-most maximum variance and so on [22]. This new set

of axes are called principal components [22] and for machine learning tasks a

set of principal components that can explain most of the variation in the data

can be chosen as input for e.g. clustering tasks.

(18)

CHAPTER 2. BACKGROUND 9

T-distributed stochastic neighbor embedding (t-SNE)

T-distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality re- duction technique that is particularly applicable for the visualization of high dimensional data sets [23]. t-SNE works by giving each data point a location in a two-or three-dimensional map. The t-SNE algorithm takes into account similarities between pairs of instances in both the high dimensional space and the low dimensional space. A cost function is then applied to optimize the two types of similarity measures [23].

2.5 Obstacles to the automatic classification of neurons

There have previously been different approaches to classifying neurons auto- matically through various machine learning methods [6, 3]. These methods have usually been conducted using feature extraction, by selecting a limited number of morphometrics as this method is more computationally feasible [24]. Only selecting a subset of features, however, results in information loss and is not optimal for categorizing neurons [24]. Feature selection is also sub- jective due to the human element of choosing the features that are deemed to be the best predictors for the classification task [5]. Therefore neither feature selection, nor using all features result in satisfactory classification results.

Another factor adding to the difficulty of classifying neurons is the fact that different experts label differently [6]. There is still no consensus in the research community on the number of morphologically different neurons [6]. Due to this lack of a framework for the categorization of neurons into different types, different experts can label the same cell differently. This can pose a problem, especially in supervised neuronal classification , as the labels are subjective [5]. This suggests the need for an objective approach to the classification of neurons.

2.6 Topological data analysis (TDA)

TDA is a recent field and approach to data analysis that uses algebraic topology

(a subfield of mathematics focused on the study of shape) and computational

geometry to study the shape of input data. TDA aims at analyzing the com-

plex topological and underlying geometric structures in data with well-founded

mathematical methods [25].

(19)

10 CHAPTER 2. BACKGROUND

2.6.1 Persistent homology

Persistent homology is a theory in TDA that is used to reconstruct the shape of some underlying data. The application of persistent homology to data allows for the identification and extraction of interesting topological features describ- ing the features that are stable to noise in the input data. It also allows for representing a shape as a persistence barcode or equivalently as a persistence diagram, effectively describing the shape numerically [25].

2.6.2 The construction of persistence barcodes and persistence diagrams

The construction of persistence barcodes and persistence diagrams start with a set of points that describe a given shape. Figure 2.3 illustrates a set of points that represent a random shape. Each point in the figure has a growing disk around it and each disk grows at the same speed. In this example (Figure 2.3) all disks are “born” at the same time and this is also recorded in the persistence barcode as the birth of these points. The collision of two disks results in the

“death” of the oldest colliding disk (or the birth of a new disk formed by the two colliding disks, if all points are born at the same time, such as in this example). Each collision is recorded in the persistence barcode as the death of a certain point. Each point will thus have a time of birth and death which can be represented in the persistence diagram where each point represents a (birth,death) value [26].

Figure 2.3: An example of translating a set of points into a persistence dia-

gram. At a disk radius of 0.525 some collisions have started to occur, all of

which are recorded on the persistence diagram ³ .

(20)

CHAPTER 2. BACKGROUND 11

In Figure 2.4 one can see that two clusters of points have formed after increasing the radius of the disks. Points that are close to each other have collided early on and this is represented on the persistence diagram as death- values close to 0.

Figure 2.4: At a disk radius of 0.782 two clusters of points have formed ⁴ .

In Figure 2.5 the increased radius has resulted in a collision between the two clusters seen in Figure 2.4. This collision is again recorded on the persis- tence diagram. The two clusters of points seen in Figure 2.4 collide relatively late with a significantly larger radius as their points are quite far from each other. This collision-point is significantly farther from 0 than the other col- lision points. The gap between points in the persistence diagram, such as in this example (Figure 2.5) is therefore an indication as to what extent the data is clustered relative to its noise.

3

Gary Koplik. 0d Persistent Homology Example. 2019.

URL:https://gjkoplik.github.io/pers-hom-examples/0d_pers_2d_data_widget.html (visited on 03/29/2020)

4

See footnote 3

(21)

12 CHAPTER 2. BACKGROUND

Figure 2.5: At a disk radius of 3.494 the two previous clusters have collided into one. This collision occurs quite late, illustrated by a significantly large gap between the largest death-value in the persistence diagram and the rest of the death-values ⁵ .

2.7 The topological morphology descriptor (TMD)

The TMD is a freely available tool to extract a topological representation of the branching pattern in a neuronal tree. Developed by a team of researchers at the BlueBrain project ⁶ , the TMD has been shown to be a powerful tool in the automatic classification of neurons by their morphology [24] [6]. The tool provides an alternative representation of neuronal morphologies based on persistent homology and is a way of quantifying the branching structure in neurons by encoding their overall shape in persistence barcodes [6].

2.7.1 The TMD algorithm

The TMD algorithm takes as input a rooted neuronal tree (Figure 2.6) as well as a set of nodes containing all leaves (nodes without children) and bifurcations (inner nodes with children - seen as branching in Figure 2.6) in the neuron [24].

5

See footnote 3

6

https://www.epfl.ch/research/domains/bluebrain/

(22)

CHAPTER 2. BACKGROUND 13

Figure 2.6: An example of a neuronal tree and its persistent homology repre- sented in a persistence barcode.

The TMD algorithm computes the persistence barcode for this input by defining a function that computes the radial distance(as illustrated by the disks in Figure 2.4 and 2.5) between a node and the root or soma as well as another function that is used to order the age of sibling nodes (two nodes are siblings if they have the same parent node) [24].

From each leaf, there is a path to the root - the algorithm iteratively moves through all the paths, from each leaf to the root, and upon detecting a new node on the path kills all nodes but the oldest amongst the siblings it encounters [24].

For each killed component(a component of the tree is a sequence of con- secutive edges between a leaf and an internal node) one birth-death interval is added to the persistence barcode which can be seen in Figure 2.6. Thus, each interval in the persistence barcode encodes the lifetime of a connected component in the tree - identifying when a branch is first detected (birth) and when it connects to a larger subtree (death) [24].

2.7.2 TMD classification

The TMD persistence diagrams and barcodes can be converted into unweighted persistence images for the classification task [24]. A persistence image rep- resenting a neuron summarizes the density of the different components in a neuronal tree at different radial distances from the soma [6].

The persistence images are unweighted due to the fact that weighted images

fail to capture short components, which have been shown to be important in

classification [6]. The method for creating a persistence image describing the

TMD profile for a neuron is based on the discretization of a sum of Gaussian

(23)

14 CHAPTER 2. BACKGROUND

kernels, which generates a matrix of pixel values - effectively encoding the persistence image as a vector. Being able to describe neurons in this way, as vectors, enables the classification of neurons through various machine learning methods [24]. An example of a persistence image of a neuron can be seen in Figure 2.7.

Figure 2.7: An example of a neuronal tree and its persistent homology repre- sented in a persistence image.

2.7.3 The validity and objectiveness of TMD

The TMD algorithm is effective due to its topological representations display- ing less representation loss than the usual morphometrics used in feature selec- tion [6]. Neurons with different functional purposes exhibit unique branching patterns [6] and because TMD is based on the branching structure in neurons it can be used to separate between different neuronal morphologies. The cell types proposed by TMD are also unbiased relative to the classification done by experts as they are based on a mathematical descriptor of the branching struc- ture in neurons rather than visual inspection of neurons under a microscope [6].

A previous study [6] on the objective supervised classification of pyrami- dal neurons in the rat cortex using TMD showed that the majority of expert- labels could be supported by the TMD-labels. This example illustrates that the TMD is capable of seeing the same patterns as experts when labeling neurons.

A further extension to the investigation into this tool would be to analyze its

performance on other types of neurons in an unsupervised learning scenario.

(24)

CHAPTER 2. BACKGROUND 15

2.8 Related work

The study Objective morphological classification of neocortical pyramidal cells [6] is of much relevance due to the fact that it remarks that numerous expert-labels on morphological neuron types are subjective and the study demon- strates that an objective, stable and without need of expert input classifica- tion of rat cortical pyramidal cells is possible. The study tried to categorize different types of pyramidal cells and the researchers managed to objectively identify 17 types of pyramidal cells in the rat somatosensory cortex. This identification was done both by using supervised and semi-supervised learn- ing. The supervised learning classifier was first trained and tested on neurons with expert-labels, and then the procedure was repeated for cells with random labels. The expert classification of a neuron was kept if its accuracy was sig- nificantly higher than that of the randomized classification. Otherwise, the neuron was reclassified with semi-supervised learning [6].

A comparison of machine learning algorithms for automatic classification

of neurons by their morphology [27] was a study that compared the perfor-

mance of different supervised learning algorithms when trying to classify dif-

ferent neuron morphologies among mice neurons. The researchers acquired

neuron data from NMO. In contrast to the method used in this research pa-

per, they used morphometrics available at NMO instead of the TMD. The re-

searchers noted that NMO does not keep all the features of the cell and adviced

the usage of programs like L-measure that could extract more features [27].

(25)

Chapter 3 Method

3.1 Data collection

The rat neurons used were gathered from NMO as SWC files, which describe each neurons geometry and positioning. Rat neurons were chosen partly due to the fact that it has been demonstrated previously that the TMD algorithm works on rat neurons [6]. They were also chosen due to the vast amount of digital rat neuron reconstructions from NMO that could be converted into TMD profiles.

In choosing which species to focus the experiment on; rat, mice and human neurons were tested to see which species contained the most cells eligible for the TMD algorithm. TMD did not work on some cells as they were incomplete, meaning that they either contained no soma or no dendritic domains. These reconstructions were therefore screened out from the final set of data points.

After this screening of choosing species for the experiment, six rat neuron types were chosen for the final data set as they had displayed a high success rate in the number of neurons that worked with TMD, in the previous screening.

The final rat neuron types chosen for the experiment were : pyramidal, fast spiking, basket, medium spiny, glutamatergic and granule.

Table 3.1 displays how many neurons of each cell type that were used in the final data set.

Glutamatergic Granule Medium spiny Basket Fast spiking Pyramidal

109 452 856 459 57 1876

Table 3.1: The number of neurons by cell type that passed all screenings

16

(26)

CHAPTER 3. METHOD 17

3.2 Data formatting

The TMD algorithm was applied to each SWC file so that the corresponding persistence image could be retrieved and stored in a separate file. A small number of neurons, less than 1% of the total number of neurons that passed the first filtering, could not have their persistence images generated by the TMD algorithm and were therefore discarded from the data set of persistence images.

3.3 Dimensionality reduction

The persistence images are all 100-by-100 pixel images and therefore have a total of 10000 dimensions. To reduce the number of dimensions while still retaining as much information as possible PCA was used to extract 50 new dimensions (or principal components) with the maximum amount of varia- tion or information about how the original information from the persistence images were distributed. These top 50 principal components retained 99.9%

of the variation in the original data set meanwhile the top 2 principal com- ponents retained 67% of the variation in the original data set. Due to this, another dimensionality reduction technique, t-SNE, was used to embed the top 50 principal components generated by PCA in a two dimensional setting were the data would be more susceptible to the application of various distance metrics.

3.4 Unsupervised learning

Two unsupervised machine learning algorithms were applied on the final data set after dimensionality reduction - affinity propagation and Ward’s method.

These two methods were chosen as they have been shown previously to pro- vide slightly better results compared to other unsupervised learning methods when classifying neurons according to their morphology [3]. However, rather than clustering the neurons by taking into account various morphometrics, this study will use these two algorithms to cluster the neurons by their persistence images generated by the TMD algorithm.

Affinity propagation was first applied on the data set to determine the num-

ber of clusters that would cluster the persistence images into the most well-

defined clusters. Affinity propagation was run with different parameters, the

most important ones being the damping value, preference value, number of

iterations and distance function.

(27)

18 CHAPTER 3. METHOD

The damping value is a value between 0.5 and 1 and a higher damping value leads to less drastic changes in the clustering in each iteration. The pref- erence value guides the algorithm to prefer some points over other points in the data as exemplars. The two important iteration parameters are the maximum iterations and the convergence iterations. The maximum iteration stops the al- gorithm from executing if the total number of iterations exceeds the maximum iteration value set, and the convergence iteration value determines how many iterations the algorithm can be run without a change in the clustering. The dis- tance function is the function that calculates the distance between points. The distance function used for Affinity propagation was the negative squared Eu- clidean distance. The optimal parameter values were chosen using the mean silhouette score. The parameter values that generated the highest mean silhou- ette score for the whole clustering done by affinity propagation on the data set was a damping value of 0.70, a preference value of -0.5, a maximum of 5000 iterations allowed or stopping early if the clustering remained unchanged for 200 iterations.

The optimal number of clusters were determined to be 684 by running affinity propagation with the parameters that yielded the highest silhouette score. Ward’s method was then run on the data set with 684 clusters as in- put.

After the best set of parameters had been determined for both affinity prop- agation and Ward’s method they were run with these specific parameters and their clustering of the neurons was assessed.

3.5 Clustering assessment

The assessment of the clusters made by both unsupervised learning methods were done by examining the silhouette scores and ARI of the methods. The silhouette score helped in determining how well-defined the clusters were and the ARI helped in determining how similar the clusters made by affinity prop- agation were to Ward’s method.

3.6 Software

The programs used for clustering and analysis of the clusterings were written in Python3. All essential functions for creating the data set and clustering came from scikit-learn (sklearn) and the TMD repository by BlueBrain ¹ .

1

https://github.com/BlueBrain/TMD

(28)

CHAPTER 3. METHOD 19

Sklearn is a Python library that contains numerous machine learning algo- rithms for both supervised and unsupervised learning [28]. The affinity prop- agation and Ward’s method algorithms used in this study were imported from Sklearn.

The functions used in this study that concerned the TMD algorithm, such as creating the persistence images of neurons, were imported from the TMD repository made by BlueBrain.

The code that was written for this study can be accessed via the GitHub

repository that is mentioned in appendix 6.

(29)

Chapter 4 Results

The results of the parameter test which was performed to determine the pareme- ters which would generate the highest silhouette score is presented in Table B.1 found in Appendix B. The results from the clustering from affinity propagation and the clustering from Ward’s method are presented in this chapter.

The highest silhouette score of the clustering generated by affinity propa- gating was 0.436 and the number of clusters were 684. When Ward’s method was run with 684 clusters it achieved a silhouette score of 0.432. A compari- son between the clusterings generated by both affinity propagation and Ward’s method gave an ARI score 0.794. The clusterings by affinity propagation and Ward’s method had a total of 258 identical clusters and the average size of these identical clusters was 5.

Table 4.1 shows the total amount of clusters that the neurons of a given cell type were clustered into, e.g the number in row "AP" and column "Gluta- matergic" details that the glutamatergic neurons were clustered into 93 differ- ent clusters by affinity propagation.

Table 4.2 shows for each cell type, how many clusters with more than 1 neuron contained only that cell type. An example of this is the number in row

"WARD" and column "Medium spiny" which details that 38 clusters with a size larger than 1, in the output of Ward’s method, only contained Medium spiny neurons. These types of clusters will henceforth be called pure clusters.

20

(30)

CHAPTER 4. RESULTS 21

Method Glutamatergic Granule Medium spiny Basket Fast spiking Pyramidal

AP 93 172 250 155 41 504

WARD 92 172 246 156 40 507

Table 4.1: The total number of clusters that each cell type was clustered into based on the unsupervised learning method used.

Method Glutamatergic Granule Medium spiny Basket Fast spiking Pyramidal

AP 0 25 41 22 0 188

WARD 1 27 38 29 0 189

Table 4.2: The total number of pure clusters with regards to neuron type and

method

(31)

Chapter 5 Discussion

5.1 Discussion of results

This study used the mean silhouette score for all points in the final clustering to determine the optimal parameters for clustering. The silhouette value for each point ranges from -1 to 1 and the results showed that the highest mean silhouette value of 0.436 for affinity propagation and mean silhouette value of 0.432 for the given amount of clusters in Ward’s method still yielded a high ARI when comparing the clustering of both methods. A silhouette value close to 1 for the clusterings would indicate that all neurons were very similar to their own clusters in comparison to the other clusters, meaning that they would be poorly matched to other clusters. The relatively low silhouette score might indicate that neurons although clustered into various groups still are very similar to other groups as well. This might indicate that some neurons are related, but could further be divided into subgroups with little difference between each other.

The ARI of some given clustering is easier to evaluate in scenarios where the ground truth is known. As there exists no ground truth labels for this study, the ARI was used to compute the similarity between clusterings done by affin- ity propagation and Ward’s method. The high ARI of 0.8 indicates that the clusterings are in quite high agreement with each other. However, due to the absence of a ground truth, this ARI does not illustrate whether any of the clus- terings are wrong or right.

Affinity propagation clustered all the neurons into a total of 684 clusters which is more than a hundred times larger than the original 6 cell types defined in NMO. An ARI of approximately 0.8 shows that affinity propagation and Ward’s method clustered the neurons similarly. Furthermore both methods

22

(32)

CHAPTER 5. DISCUSSION 23

had 258 identical clusters with an average size 5. This is in sharp contrast to the size of the original 6 labels where the least populated neuron type contained 57 cells. There is much reason to question how appropriate the neuron labels from NMO were since two different clustering methods objectively clustered the neurons similarly using the TMD algorithm into 684 clusters.

If the two clustering methods clustered the neurons into 6 pure clusters that only contained neurons from one neuron cell type, then that would suggest that the labels the neurons had from NMO were deemed accurate by those cluster- ing methods. However this was not the type of clustering that was made by the two clustering methods. Apart from glutamatergic and fast spiking cells, neurons from the other cell types were clustered into many smaller clusters containing the same cell type. For example in affinity propagation, the cluster- ing result had 25 clusters that only contained granule cells. This suggests that all the cell types except glutamatergic and fast spiking can be further subdi- vided into other clusters or classifications that are more well-defined. This is similar to the research mentioned earlier in section 2.8 where pyramidal cells were divided into 17 different types.

Glutamatergic and fast spiking stand out when examining the number of pure clusters. It seems among the other neuron types that the number of pure clusters is somewhat proportional to the number of neurons in the data set.

This is, however, not at all true for glutamatergic and fast spiking cells, since there were none if not a minuscule amount of pure clusters compared to the amount of neurons that had those labels from NMO in the data set. The low number of pure clusters containing glutamatergic and fast spiking might be a consequence of the method used in this study not being suited for clustering with these types of neurons. It could also have been a consequence of these cell types not being well-defined enough for the algorithms to detect their sim- ilarities.

5.2 Discussion of method

Before making the choice of conducting the experiment with rat neurons, a

test was conducted to compare how many mouse, rat and human neurons could

successfully be reconstructed using the TMD tool. This test showed that many

reconstructions on NMO are incomplete, meaning that although the statistics

on the repository might state that there are a certain amount of a given cell

type, it does not necessarily mean that all of those neuron reconstructions are

complete. A neuron is considered incomplete when it does not contain a soma

or dendritic domains. Without the time constraint, an improvement to this

(33)

24 CHAPTER 5. DISCUSSION

study would therefore be to test more species and to see whether more neurons from those species work with the TMD tool.

Two dimensionality reduction techniques were used in this study to ac- commodate for the use of distance metrics in generating clusterings as well as evaluating those clustering. PCA was first used to extract the top 50 prin- cipal components which showed 99.9% of the variation in the original data.

The top 2 principal components meanwhile only showed 67% of the varia- tion in the original data and it was therefore determined that t-SNE was to be used to embed the top 50 principal components in a 2-dimensional map. t- SNE can also be used to embed data into a 3-dimensional map, however, due to time constraints only a 2-dimensional final dataset was used in this study.

It would therefore be interesting to see the effects of using a 3-dimensional dataset generated by t-SNE in this experimental setting. Furthermore it would be interesting to see the effects of not adding t-SNE at all and only doing clus- tering on the top 2 or 3 principal components from PCA. This could help in the understanding of how effective PCA and t-SNE are in terms of information loss when used to reduce the number of dimensions in persistence images.

5.3 Future improvements

The low silhouette score contrasts the ARI which indicates a high level of agreement between affinity propagation and Ward’s method on the data set.

The relatively low silhouette score indicates some overlapping clusters in the data. A possible continuation to this study would therefore be to only ex- tract clusters with high silhouette scores. This would indicate that the final set of clusters are more well-defined and would probably be more convincing in terms of proposing either subgroups of neurons or new types of neurons. It would also be interesting to investigate whether the ARI comparison between two clusterings would be higher if clusters with only high silhouette scores were used. Further investigations should make attempts at acquiring a better silhouette score in order to achieve a higher confidence in the results, perhaps by trying other unsupervised learning methods or by clustering other types of neurons.

Something that is not currently included in the computation of the TMD al-

gorithm is the enabling of the multidimensional persistence of features, mean-

ing that the persistence barcodes would be multidimensional. This could be

used for combining independent morphological characteristics into a single

descriptor, such as e.g. the difference in bouton density between different

cells. This is something that cannot currently be distinguished by the TMD

(34)

CHAPTER 5. DISCUSSION 25

descriptor [6].

Another important future improvement to TMD is to include the projecting pattern of long range axons that target distant brain regions. However, due to technical limitations, this is not currently available [6].

One of the potential implications of the clustering results is that the cur- rent labels given by neuroscientists do not incorporate the significantly large amount of variation in the neuron types that was seen in the clustering results.

This can be compared to the original 6 neuron labels used in the study. If it is

desired to investigate neuronal morphology in more detail than has previously

been seen then this is further incentive to use unsupervised classification of

neurons to have more well-defined labels. In a broader perspective this could

open up a discussion on how neurons are labeled and the benefits of automat-

ing that process with machine learning.

(35)

Chapter 6 Conclusions

The first part of the research question "How similar are the classifications of neurons into different morphological types made by two unsupervised learning algorithms trained on the persistence images of neurons generated by the TMD algorithm?" is answered by observing that the clusterings yielded by affinity propagation and Ward’s method were highly similar, indicated by the high ARI. The number of clusters given by the highest mean silhouette score in the affinity propagation clustering was 684 and is significantly higher than the 6 neuron labels which were originally assigned to the neurons in the data set. This suggests that the original labels only captured similarities between the neurons on a higher level, meanwhile the clusterings yielded in the study captured more detailed differences.

The second part of the research question "How similar are their classifica- tions compared to classifications made by experts on those same neurons?" is answered by detecting that many of the cell types were subdivided into smaller pure clusters. This was the case for granule, medium spiny, basket and pyra- midal labeled cells. Glutamatergic and fast spiking cells had almost no pure clusters which might suggest that those are inaccurate labels or that the meth- ods used in this study were ineffective in terms of clustering those types of cells.

In order to get more definitive answers on how well neurons can be clus- tered with persistence images perhaps further research should be conducted into other unsupervised learning methods than those used in this study as well as using other neuron types.

26

(36)

Bibliography

[1] Quoc V Le. “Building high-level features using large scale unsuper- vised learning”. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE. 2013, pp. 8595–8598.

[2] David Silver et al. “Mastering the game of go without human knowl- edge”. In: Nature 550.7676 (2017), pp. 354–359.

[3] Xavier Vasques et al. “Morphological neuron classification using ma- chine learning”. In: Frontiers in neuroanatomy 10 (2016), p. 102.

[4] Rubén Armañanzas and Giorgio A Ascoli. “Towards the automatic clas- sification of neurons”. In: Trends in neurosciences 38.5 (2015), pp. 307–

318. [5] Javier DeFelipe et al. “New insights into the classification and nomen- clature of cortical GABAergic interneurons”. In: Nature Reviews Neu- roscience 14.3 (2013), pp. 202–216.

[6] Lida Kanari et al. “Objective morphological classification of neocorti- cal pyramidal cells”. In: Cerebral Cortex 29.4 (2019), pp. 1719–1735.

[7] Eric Kandel. Principles of Neural Science, Fifth Edition. eng. 5th ed..

2012. isbn: 9780071810012.

[8] Maryam Halavi et al. “Digital reconstructions of neuronal morphol- ogy: three decades of research trends”. In: Frontiers in neuroscience 6 (2012), p. 49.

[9] Stephen L Senft. “A brief history of neuronal reconstruction”. In: Neu- roinformatics 9.2-3 (2011), pp. 119–128.

[10] Erik Meijering. “Neuron tracing in perspective”. In: Cytometry Part A 77.7 (2010), pp. 693–704.

[11] Giorgio A Ascoli, Duncan E Donohue, and Maryam Halavi. “Neuro- Morpho. Org: a central resource for neuronal morphologies”. In: Jour- nal of Neuroscience 27.35 (2007), pp. 9247–9251.

27

(37)

28 BIBLIOGRAPHY

[12] Ethem Alpaydin. Introduction to machine learning. eng. Third edition.

Adaptive computation and machine learning. 2014. isbn: 0-262-32575- 6.

[13] Brendan J Frey and Delbert Dueck. “Clustering by passing messages between data points”. In: science 315.5814 (2007), pp. 972–976.

[14] Joe H Ward Jr. “Hierarchical grouping to optimize an objective func- tion”. In: Journal of the American statistical association 58.301 (1963), pp. 236–244.

[15] Peter J Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In: Journal of computational and applied mathematics 20 (1987), pp. 53–65.

[16] William M Rand. “Objective criteria for the evaluation of clustering methods”. In: Journal of the American Statistical association 66.336 (1971), pp. 846–850.

[17] Lawrence Hubert and Phipps Arabie. “Comparing partitions”. In: Jour- nal of classification 2.1 (1985), pp. 193–218.

[18] Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. “On the surprising behavior of distance metrics in high dimensional space”. In:

International conference on database theory. Springer. 2001, pp. 420–

434. [19] Michael E. Houle et al. “Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?” In: Scientific and Statistical Database Man- agement. Ed. by Michael Gertz and Bertram Ludäscher. Berlin, Heidel- berg: Springer Berlin Heidelberg, 2010, pp. 482–500.

[20] Manuel Gugger. “Clustering High-dimensional Sparse Data”. In: (Jan.

2012), p. 55.

[21] L.J.P. van der Maaten, E. O. Postma, and H. J. van den Herik. Dimen- sionality Reduction: A Comparative Review. 2009.

[22] Pascal Wallisch et al. MATLAB for neuroscientists: an introduction to scientific computing in MATLAB. Academic Press, 2014.

[23] Laurens van der Maaten and Geoffrey Hinton. “Visualizing data us- ing t-SNE”. In: Journal of machine learning research 9.Nov (2008), pp. 2579–2605.

[24] Lida Kanari et al. “A topological representation of branching neuronal

morphologies”. In: Neuroinformatics 16.1 (2018), pp. 3–13.

(38)

BIBLIOGRAPHY 29

[25] Frédéric Chazal and Bertrand Michel. “An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists”.

In: arXiv preprint arXiv:1710.04019 (2017).

[26] René Corbet et al. “A kernel for multi-parameter persistent homology”.

In: Computers & Graphics: X 2 (2019), p. 100005.

[27] Marcus Östling and Joakim Lilja. A comparison of machine learning algorithms for automatic classification of neurons by their morphology.

2018.

[28] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”. In:

A comparative study on the unsupervised classification of rat neurons by their morphology

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2020 ,

A comparative study on the

unsupervised classification of rat neurons by their morphology

SABRINA CHOWDHURY ADDED KINA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

En jämförelsestudie av

oövervakad klassificering av råttneuroners morfologi

SABRINA CHOWDHURY ADDED KINA

Degree Project in Computer Science, DD142X Date: June 8, 2020

Supervisor: Alexander Kozlov Examiner: Pawel Herman

KTH Royal Institute of Technology

School of Electrical Engineering and Computer Science

iii

Abstract

An ongoing problem regarding the automatic classification of neurons by their

morphology is the lack of consensus between experts on neuron types. Un-

supervised clustering using persistent homology as a descriptor for the mor-

phology of neurons helps tackle the problem of bias in feature selection and

has the potential of aiding neuroscience research in developing a framework

for automatic neuron classification. This thesis investigates how two different

unsupervised machine learning algorithms would cluster persistence images

of already labeled neurons and how similar their clusterings would be. The

results showed that the clusterings done by both methods were highly simi-

lar and that there was a large variation within the neuronal types defined by

experts.

iv

Sammanfattning

Ett pågående problem gällande den automatiska klassificeringen av neuroner

med avseende på morfologi är bristen på konsensus bland experter vad gäl-

ler neurontyper. Oövervakad klusteranalys med persistent homologi som en

deskriptor för neuroners morfologi hjälper lösa problemet med partiskhet in-

om egenskapsurval och kan potentiellt gynna neurovetenskapen i utvecklingen

av ett ramverk för automatisk klassificering av neuroner. Denna uppsats hade

som mål att undersöka hur två olika oövervakade maskininlärningsalgoritmer

klassificerar persistensbilder av tidigare klassificerade neuroner samt graden

av överensstämmelse mellan de två metoderna. Studiens resultat visade att bå-

da metoders resultat hade en hög grad av överensstämmelse och visade även på

en stor variation inom de klasser av neuroner som redan definierats av experter.

v

Acknowledgements

Many thanks to our supervisor for supporting us throughout the project. We

would also like to thank everyone that helped us finish this study.

Contents

1 Introduction 1

1.1 Purpose . . . . 1

1.2 Problem statement . . . . 2

1.3 Scope . . . . 2

2 Background 3 2.1 The neuron . . . . 3

2.2 The digital reconstruction of neurons . . . . 4

2.3 NeuroMorpho.Org (NMO) . . . . 5

2.4 Machine learning . . . . 5

2.4.1 Affinity propagation . . . . 5

2.4.2 Ward’s method . . . . 6

2.4.3 Clustering assessment metrics . . . . 6

2.4.4 Curse of dimensionality . . . . 8

2.4.5 Dimensionality reduction . . . . 8

2.5 Obstacles to the automatic classification of neurons . . . . 9

2.6 Topological data analysis (TDA) . . . . 9

2.6.1 Persistent homology . . . . 10

2.6.2 The construction of persistence barcodes and persis- tence diagrams . . . . 10

2.7 The topological morphology descriptor (TMD) . . . . 12

2.7.1 The TMD algorithm . . . . 12

2.7.2 TMD classification . . . . 13

2.7.3 The validity and objectiveness of TMD . . . . 14

2.8 Related work . . . . 15

3 Method 16 3.1 Data collection . . . . 16

3.2 Data formatting . . . . 17

vi

CONTENTS vii

3.3 Dimensionality reduction . . . . 17

3.4 Unsupervised learning . . . . 17

3.5 Clustering assessment . . . . 18

3.6 Software . . . . 18

4 Results 20 5 Discussion 22 5.1 Discussion of results . . . . 22

5.2 Discussion of method . . . . 23

5.3 Future improvements . . . . 24

The following six types of rat neurons were chosen : pyramidal, fast spik- ing, basket, medium spiny, glutamatergic and granule ¹ .

Figure 2.2: An example of a digital reconstruction of a neuron ¹ .

NeuroMorpho.Org (NMO) is an archive of digitally reconstructed neurons from peer-reviewed publications accessible online [11]. The inventory is up- dated each month and has contributions from over 500 laboratories around the world ² .

The similarity between two points is the Euclidean distance between them when the goal is to minimize squared error. For two points x ⁱ and x ^k , the similarity is s(i, k) = −abs(x i − x _k ) ² [13].

6=k {a(i, k ⁰ ) + s(i, k ⁰ )}.

a(i, k) ← min(0, r(k.k) + ^X

max(0, r(i ⁰ .k))) f or i 6= k

a(k.k) ← ^X

max(0, r(i ⁰ .k))