VAE-clustering of neural signals and their association to cytokines

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020 ,

VAE-clustering of neural signals and their association to cytokines

ARAM ESKANDARI

(2)

(3)

VAE-clustering of neural signals and their association to cytokines

ARAM ESKANDARI

Degree Projects in Mathematical Statistics (30 ECTS credits) Master*s Programme in Applied and Computational Mathematics KTH Royal Institute of Technology year 2020

Supervisor at KTH: Henrik Hult

Examiner at KTH: Henrik Hult

(4)

TRITA-SCI-GRU 2020:094 MAT-E 2020:056

Royal Institute of Technology

School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden

URL: www.kth.se/sci

(5)

Abstract

In this thesis we start by reproducing previous experiments by Zanos et al [5], where they have shown that it is possible to associate neural signals with specific cytokines. One future aim of this project is to send synthetic neural signals through the efferent arc of the vagus nerve and observe reactions without the corresponding catalyst of the symptoms.

We use a variational autoencoder (VAE) in our experiment to create a model able to generate new neural signals, and we introduce a novel clustering technique called VAE-clustering, which will be used to cluster neural signals with their associated cytokines. The focus of this paper is the implementation of this method and applying it on the neural signals.

Running VAE-clustering on the MNIST dataset shows it to be viable for finding detailed properties of a dataset. We also find that using a VAE as a generative model for neural signals is a good way for recreating detailed waveforms.

(6)

(7)

VAE-klustring av nervsignaler och dess associationer till cytokiner

Author: Aram Eskandari Supervisor: Prof. Henrik Hult

May 19, 2020

Abstract

I detta examensarbete börjar vi med att reproducera tidigare experiment av Zanos et al. [5], där dom visat att det är möjligt att associera nervsignaler med specifika cytokiner. Ett framtida mål med detta projekt är att skicka syntetiska nervsignaler till kroppen för att observera reaktioner utan motsvarande katalysator av symptomen. Vi använder en variational autoencoder (VAE) i våra experiment för att skapa en modell kapabel till att generera nya nervsignaler, och vi introducerar en ny klusterings-teknik kallad VAE-klustring, vilken kommer att användas för att klustra nervsignaler med dess associerade cytokiner. Fokuset i detta arbete ligger i implementationen av denna metod och applicerandet på nervsignaler. Efter att ha kört

VAE-klustring på MNIST dataset fann vi att det det är användbart för att hitta detaljerade egenskaper hos ett dataset. Vi har även funnit att användningen av en VAE som en generativ modell för nervsignaler är ett bra sätt att återskapa detaljerade vågformer.

(8)

(9)

1 Introduction 6

1.1 Bioelectronic medicine . . . . 6

1.2 Motivation of this research . . . . 6

2 Background 6 2.1 Inflammatory reflex . . . . 6

2.2 Reproducing experiment . . . . 7

3 Method 10 3.1 Variational autoencoder . . . 10

3.1.1 Preliminary . . . 10

3.1.2 Deriving the objective function . . . 12

3.1.3 Reparameterization trick . . . 12

3.1.4 Analytical solution of ELBO . . . 13

3.2 VAE-clustering . . . 13

3.2.1 VAE-clustering with noise . . . 15

4 Experiments and results 16 4.1 VAE-clustering experiment on MNIST . . . 16

4.2 VAE-clustering experiment on neural recordings . . . 21

4.2.1 Neural recording data preparation . . . 21

4.2.2 Training the model . . . 22

4.2.3 VAE-clustering on waveforms . . . 24

5 Discussion 27 5.1 Comparing the methods . . . 27

5.2 Future work . . . 28

Appendices 29 A Kullback-Leibler Divergence 29 A.1 Definition . . . 29

A.2 Proof that D

KL

(q(x)||p(x)) ≥ 0 . . . 29

A.3 Analytic solution of −D

KL

(q

_Z|X

(z|x)||p

_Z

(z)) . . . 30

B Likelihood of data (VAE-approach) 30 C Theory of t-SNE, DBSCAN, k-means 31 C.1 t-distributed stochastic neighbor embedding (t-SNE) . . . 31

C.2 Density-based spatial clustering of applications with noise (DB- SCAN) . . . 32

C.3 k-means clustering . . . 33

(10)

D VAE-code (python) 33

D.1 MNIST VAE . . . 33

D.2 Waveforms VAE . . . 36

(11)

1 Introduction

1.1 Bioelectronic medicine

Bioelectronic medicine [1] is an emerging discipline that combines molecular medicine, neuroscience, engineering and computing to develop devices to diagnose and treat diseases. The pharmaceutical drugs used today are not inherently designed to adapt to individual treatment, so both under- and overdosing are common, resulting in therapy failure or adverse events.

Advances in bioelectronic medicine hold promise to address some of these challenges by providing personalized and localized treatment of disease, e.g. by regulating neural signals.

1.2 Motivation of this research

It has been shown from experimental disease models of arthritis, colitis, sepsis, hemorrhagic shock and congestive heart failure, that electrical stimulation of the vagus nerve can prevent or reverse these diseases [3].

In this thesis we will conduct experiments with the aim of clustering neural signals with their associated cytokines using unsupervised learning and a novel clustering method. The long-term goal of this research is to enable the

regulation of our inflammatory reflex, which can be very beneficial e.g. in combatting autoimmune diseases. See section 2.1 for more information regarding the inflammatory reflex and the part played by cytokines.

We will design an accurate generative model for the neural signals, and for this we will use a variational autoencoder. The reason we prefer this over a

generative adversarial network (GAN), is that we will utilize the information in the latent space to find clusters in our dataset, this is much harder to do with a GAN since we have little information of its latent space. See section 3.1.1 for a closer look at the variational autoencoder, and section 3.2 for the method called VAE-clustering, developed at KTH by Henrik Hult and Adam Lindh.

2 Background

2.1 Inflammatory reflex

From wikipedia, on the inflammatory reflex;

“The inflammatory reflex is a neural circuit that regulates the immune response to injury and invasion.”

Reflex circuits regulate cellular function, metabolism and blood flow in the visceral organs, thereby continuously working to keep our bodies in homeostasis. Recent advances in neuro-science and immunology reveal previously unrecognized mechanisms for the reflex control of inflammation [2].

Inflammatory response attracts immune cells which in turn release a type of

(12)

protein, called cytokines, and other inflammatory mediators which have been implicated in activating sensory (affarent) neurons.

This activation lets the sensory neurons transmit action potentials to the spinal cord and brainstem, and is the first step in modulating the outflow of motor neurons in the efferent arc of reflex circuits that control homeostasis.

Increased signaling in the efferent arc inhibits inflammation and prevents organ damage.

The vagus nerve is the largest cranial nerve, and is comprised of

80.000 − 100.000 fibers in humans [4]. It is the leading sensory conduit for transmitting sensory signals from organs to the brain, which makes it an ideal place for us to conduct experiments.

2.2 Reproducing experiment

The isolation and identification of cytokine-specific neural signals encoded in the murine vagus nerve has proved to be viable [5]. Our project will begin by reproducing the results from this paper, where they develop a framework to isolate and decode the neural activity recorded on the surface of the vagus nerve in mice to identify groups of neurons firing in response to specific cytokines.

(a) CorTec Micro Cuff Sling bipolar electrodes connected to the cervical vagus nerve in anesthetized mice [5]

(b) Recording workflow [5]

Figure 1

These neuron impulses are called compound action potentials (CAPs), in agreement with previous research [11]. Figure (1a) shows the experimental setup, with bipolar electrodes connected to the cervical vagus nerve of mice.

Figure (1b) presents the experimental workflow, where IL-1β and TNF

represents two different cytokines.

(13)

All recordings are saved by Zanos et al., at the link

https://public.feinsteininstitute.org/cbem/PNAS%20Submission/.

After the recording is done, the data is analyzed in matlab. The file we have chosen to work with represents a wild-type mouse, where IL-1β was injected first, and then TNF.

¹

Many of the files were very noisy which made it difficult to reproduce multiple working examples, however the file we chose exhibited very clear signals which is why we will run our own experiments on the same file. See figure (2) for a quick presentation of the mentioned file.

(a) Channel 1. Note the spikes close to injection times. (b) Zoomed figure (a) where it is most thin, beats per minute corresponds to respiratory signal.

(c) Channel 9. Data is very flat. (d) Zoomed figure (c), beats per minute corresponds to cardiac signal.

Figure 2: Visualizing the neural recording

We have two different channels in the datafile, where it can be seen in figure (2a) that channel 1 has clear impulse spikes around the injection-times.

According to figure (2b), this channel also includes the respiratory signal which is underlying the whole recording. Figure (2c) presents channel 9 which is clearly very flat. When zoomed in at figure (2d), we can see that it is a recording of the cardiac signal.

In this project they used an adaptive threshold and an amplitudal threshold.

We visualize the adaptive threshold on channel 1, in figure (3b). As we can see it does a good job of thresholding the data in a dynamic way. For this

experiment we disregard the amplitudal threshold and only use the displayed

1The filename at the specified link is R10_6.28.16_BALBC_IL1B(35ngperkg)_TNF(0.5ug)_03.plx

(14)

adaptive thresholding. After this is done, the impulses are transformed into waveforms using the matlab package continuous 1-D wavelet transform.

(a) Overview of both channels (b) Channel 1 with adaptive threshold

Figure 3

After transforming the data to a wavelet representation, we want to find clusters which ideally represent each of the cytokines injected. But since each wavelet contains multiple values (141 coefficients), we want to reduce its dimensionality first. For this, Zanos et al. used the method t-distributed stochastic neighbor embedding (t-SNE), and finally unsupervised classification was made through the density-based spatial clustering of applications with noise (DBSCAN) algorithm, which determined the number of CAP groups.

(a) t-SNE on full dataset, perplexity 100 (b) DBSCAN, ε = 5.5, minpts 450

Figure 4

In appendix C we take a closer look at the theory behind both of these methods. To visualize our clustering, we look at figure (4). After running t-SNE, we can see a 2D-representation of our waveforms in figure (4a), and figure (4b) is the same plot with labeled points which represent a

corresponding CAP cluster.

(15)

(a) Event-rates of each CAP cluster (b) Wavelets associated with CAP clusters

Figure 5

After the clustering and labeling of our data is finished, we can create a plot with the event rates of each cluster, see figure (5a). The x-axis is given in time, and the y-axis is given in CAPs/sec, meaning we are looking at the waveforms number of occurrences from a specific cluster, over time. Note the spike of CAP Cluster 2 when the first cytokine is injected (30 min), and the spike of CAP Cluster 3 when the second cytokine is injected (60 min).

The different waveforms for each specific cluster are shown in figure (5b), in this paper we call these waveforms the prototype waveforms to emphasize that we are trying to find the underlying waveform of each CAP-cluster.

This short summary of the experiment conducted by Zanos et al. [5] shows that it is possible to locate and isolate neural signals with specific cytokines injected at different times. It will be recreated with our own methods explained in section 3, with its results presented in section 4, and finally compared to the results from this experiment in section 5.

3 Method

See python code for the variational autoencoder in appendix D.

3.1 Variational autoencoder

3.1.1 Preliminary

A regular autoencoder [7] is an unsupervised artificial neural network that is

often used to learn a representation (encoding) for a given dataset, typically

for dimensionality reduction, by noise-reducing the data. A problem with the

regular autoencoder is that its latent space p

Z

(z) isn’t continuous, meaning

that a random draw from the latent space of a trained model will not produce

a sample that is representative of the real latent space distribution.

(16)

This problem was solved in 2013, when Diederik Kingma and Max Welling presented a paper titled Autoencoding variational bayes [8]. A variational autoencoder (VAE) is a deep generative model, the main practical difference between it and an autoencoder is that the loss function is regularized during training in order to ensure that a random draw from the latent space allows us to generate new data with good properties.

x

1

x

2

x

3

x

₄

x

5

Encoder input Hidden layer

~ µ

~ σ

Encoder

output

z ∼ p

_Z|X

Decoder input Hidden layer

y

₁

y

2

y

₃

y

4

y

₅ Decoder

output

Figure 6: Variational autoencoder example

Figure (6) attempts to describe the pipeline of a VAE by presenting a simple example. During the forward-pass, we sample from the latent space

distribution (decoder input in figure (6)), and use what is called the

reparameterization trick (see section 3.1.3) when backpropagating. Table (1) gives a quick introduction to the mathematical symbols and how to interpret them.

Symbol Explanation

x Input data variable

z Latent space variable

p

_X

(x) Evidence probability, distribution of input-space p

_Z

(z) Prior probability, distribution of latent space p

_X|Z

(x|z) Likelihood probability, decoder-part of the VAE p

_Z|X

(z|x) Posterior probability, encoder-part of the VAE

Table 1: Glossary

(17)

3.1.2 Deriving the objective function

The objective of a VAE is to model the data, hence we want to find

p

X

(x) = Z

p

_X|Z

(x|z)p

Z

(z)dz (1)

Solving equation (1) directly is intractable, since the likelihood is a non-linear neural network. In a VAE we infer p

Z

(z) using p

Z|X

(z|x) , since we want our latent sample to be likely given our data. Variational inference [10] lets us model the true intractable posterior p

Z|X

(z|x) with an approximation q

_Z|X

(z|x) .

In practice we will maximize the log-evidence log p(x

1

, . . . , x

n

) = P

n

i=1

log p(x

i

) , meaning we need an analytical expression for log p(x

i

) . Appendix B presents a more detailed derivation of the function, here we simply restate the result

log p(x

i

) = D

KL

(q

_Z|X

(z|x

i

)||p

_Z|X

(z|x

i

)) + L(x

i

) (2) We have shown in appendix A.2 that D

KL

(·||·) ≥ 0 , implying that maximizing the log evidence is equal to maximizing what we call the evidence lower bound (ELBO), which is introduced in appendix B as

L(x

i

) = −D

_KL

(q

_Z|X

(z|x

_i

)||p(z)) + E

q_Z|X

(log p(x

_i

|z)) (3) Essentially, a VAE aims to locate the optimal lower bound of log p

X

(x) , as the exact distribution is intractable. The second term in equation (3),

E

q_Z|X

(log p(x

_i

|z)) , also called the reconstruction loss, measures how likely the sampled z is to reconstruct the data. It pulls the model in the direction of explaining the data as well as it can.

The first term in equation (3), D

KL

(q

_Z|X

(z|x

i

)||p(z)) , the regularization term, measures the difference between the prior and the approximate posterior. If we assume the prior to be a standard gaussian p

Z

(z) = N (0, I), we restrict our posterior to stay close to a Gaussian distribution, not letting our model overfit to the data. This will also make random draws from the distribution give sensible samples.

3.1.3 Reparameterization trick

We have invoked a problem when stating that the latent space variable z is drawn from a probability distribution - the equations of backpropagation does not hold for random variables. Therefore, we will re-parameterize z, using

z = g(ε, x

i

), where ε ∼ p(ε) (4)

In equation (4), we let p(ε) be the standard Gaussian distribution N (0, 1). If

we also assume that our approximate posterior is Gaussian with a diagonal

(18)

covariance q

Z|X

(z|x) = N (z; µ, σ

²

I), we can use the expression below for z = g(ε, x

i

) , where is the Hadamard (element-wise) product.

g(ε, x

i

) = µ + σ ε (5)

By introducing the parameter ε, we have reparameterized z in a way that allows backpropagation to work properly. Note that µ, σ are outputs of the encoding neural network.

3.1.4 Analytical solution of ELBO

In appendix A.3 we have shown that the regularization term of equation (3) can be evaluated analytically when both the posterior and the prior are Gaussian distributions. Since we have already made these assumptions in previous sections, we can simply restate the result

−D

_KL

(q

_Z|X

(z|x)||p

_Z

(z)) = 1 2

K

X

k=1

1 + log σ

_k²

− µ

²_k

− σ

_k²

(6) Without an analytical solution of the posterior, determining the reconstruction term of equation (3) will require estimation by sampling. For each datapoint x

i

, we take the mean of L samples from the posterior as explained in section 3.1.3 through z

i,l

∼ q

Z|X

(z|x) , using z

i,l

= g(ε

l

, x

i

) = µ

i

+ σ

i

ε

l

, where ε

l

∼ N (0, 1) and µ

i

, σ

i

are outputs of the encoder.

E

q_Z|X

(log p

_X|Z

(x

_i

|z)) ≈ 1 L

L

X

l=1

log p

_X|Z

(x

_i

|z

i,l

) (7) Finally, inserting equations (6, 7) in equation (3) gives us our resulting

objective function

L(x

i

) = 1 2

K

X

k=1

1 + log σ

²_k

− µ

²_k

− σ

²_k

+ 1 L

L

X

l=1

log p

X|Z

(x

i

|z

i,l

), (8) where x

i

∈ R

^K

, µ ∈ R

^D

, σ

²

∈ R

^D×D

, K ≥ D . We have set L = 1 in all our experiments. Note that when using tensorflow in practice, we will take the negative of equation (8), since the network wants a loss-function it can minimize , and we want to maximize the ELBO.

3.2 VAE-clustering

Gradient descent will be used in our clustering method to locate minimas of

the information content [9] given in equation (9). The expression for I(x), also

(19)

The lower the information content of an observation is, the higher its

probability. For example if you tell me that the sun will go up tomorrow you are not providing me with any valuable information (low surprise), meaning the probability is high. Intuitively we want to minimize the level of surprise when we are looking for clusters in the data, because it is equivalent to finding the observations with highest probability. To locate minimas, we take small steps in space x

n+1

= x

_n

− ∇I(x) , meaning we want an analytical expression for the gradient which is defined as

∇I(x

i

) = − 1

p

_X

(x

_i

) ∇p

X

(x

i

) = − 1 p

_X

(x

_i

)

Z

∇p

_X|Z

(x

i

|z)p

Z

(z)dz (10) Note that the integral in equation (10) can be viewed as the expectation value E

p_Z(z)

[∇p

X|Z

(x

i

|z)] . However, taking a draw from the prior is not informative enough. Using importance sampling and rewriting the integral as

Z

∇p

_X|Z

(x

_i

|z)p

_Z

(z)dz = Z

∇p

_X|Z

(x

_i

|z) p

_Z

(z)

p

Z|X

(z|x

i

) p

_Z|X

(z|x

_i

)dz, (11) we can now make draws from the posterior p

Z|X

(z|x

i

) , since the expression is equal to the expectation value E

p_Z|X(z|x_i)

[∇p

_X|Z

(x

_i

|z)

_p ^p^Z^(z)

Z|X(z|xi)

] . We will now use equation (11) in our gradient equation (10), together with monte carlo approximation to get

∇I(x

_i

) = − 1 p

_X

(x

_i

)

1 m

m

X

j=1

∇p

_X|Z

(x

_i

|z

_j

) p

Z

(z

j

)

p

_Z|X

(z

_j

|x

i

) (12) If we look at a single pair of observations x

i

, z

j

, and given the fact that the gradient is only enforced on the likelihood p

X|Z

which we assume follows a Gaussian distribution, we can rewrite equation (12) as

− p

Z

(z

j

)

p

X

(x

i

)p

_Z|X

(z

j

|x

i

) ∇p

_X|Z

(x

i

|z

j

) = − p

Z

(z

j

)

p

Z

(z

j

)p

_X|Z

(x

i

|z

j

) ∇p

_X|Z

(x

i

|z

j

)

= ∇p

_X|Z

(x

i

|z

j

) p

_X|Z

(x

_i

|z

_j

) =

−∇

X

(1/p2πσ(z

j

)

²

)exp(−

^(xⁱ_2σ(z^−µ(z^j⁾⁾²

j)²

) (1/p2πσ(z

j

)

²

)exp(−

^(xⁱ_2σ(z^−µ(z^j⁾⁾²

j)²

)

(13)

= x

i

− µ(z

j

)

σ(z

_j

)

²

(14)

We used bayes theorem in equation (13), note that the likelihood p

X|Z

is the decoder part of the variational autoencoder. We can now conclude that taking a step towards a local minima of I(x), is equivalent to moving according to

x

_i+1

= x

_i

− 1 m

m

X

j=1

x

_i

− µ(z

_j

)

σ(z

j

)

²

(15)

(20)

When we have stabilized at a local minima, we view this as the raw underlying signal of the specific cluster we have found. See figure (7) for a representation of how a hypothetical input-space could look, where we are using equation (15) to locate the minimas.

Figure 7: Visualization of input-space [6] (not representative of our dataset)

3.2.1 VAE-clustering with noise

If our input-space, visualized in figure (7), has a high amount of optimas (peaks/valleys), VAE-clustering may result in a high amount of clusters. To fix this problem we want to smooth our input-space, this is done by adding a noise-term to equation (15), which can now be written as

ˆ

x

i

= x

i

+ η, η ∼ N (0, 1) x

i+1

= ˆ x

i

− 1

m

X

j=1

x

i

− µ(z

j

) σ(z

j

)

²

(16)

(21)

4 Experiments and results

The following experiments was conducted using Python 3.6.10, with packages TensorFlow 2.0.0, NumPy 1.18.1 and Matplotlib 3.1.3. All algorithms were run on Google Cloud, using their implemented version of Jupyter Notebook with 16 v-CPU cores, and no GPU.

4.1 VAE-clustering experiment on MNIST

We have chosen to work with a deep CNN-model for both the encoder- and decoder part, since the architecture of CNN’s are particularly good for image recognition. The encoder has a total of 111.494 parameters, and the decoder has a total of 108.449, which gives us a total of 219.943 parameters.

The code for the VAE-class is imported from a file called vae.py, and can be found in appendix D.1. In this example I will assume access to the VAE-code, and go through an example where we cluster the MNIST dataset using VAE-clustering. First, we import some packages and the dataset, note that we only work with the numbers 0,1,2 for efficiency reasons.

import tensorflow as tf import numpy as np

import matplotlib.pyplot as plt from vae import VAE

from tensorflow.keras.datasets import mnist from mpl_toolkits.mplot3d import Axes3D

(x_train, y_train),(x_test, y_test) = mnist.load_data()

trainind = np.where(np.any([y_train == 0, y_train == 1,y_train ==

2], axis = 0))

,→

testind = np.where(np.any([y_test == 0, y_test == 1, y_test ==

2],axis = 0))

,→

x_train = (x_train.reshape(x_train.shape + (1,)).astype('float32') / 255.)[trainind]

,→

x_test = (x_test.reshape(x_test.shape + (1,)).astype('float32') / 255.)[testind]

,→

Next, we train the VAE with 3500 epochs and 3 latent dimensions. The number of epochs was not chosen randomly, we trained the model for 500 epochs at a time until we found that the ELBO had converged sufficiently which happend at ELBO ≈ −15.5. Another reason for training over 3500 epochs is that we had a very small learning rate of 10

⁻⁶

. The model is very deep with multiple layers and as mentioned before it has 219.943 parameters, meaning we need many epochs to get a precise model.

Training the model was a time-consuming process since we are working with a

high-dimensional dataset with an input-shape of (28, 28) and many training

(22)

examples. It took approximately one minute to run each epoch, the training took ≈ 60h to complete.

train_dataset = tf.data.Dataset.from_tensor_slices(

x_train).shuffle(x_train.shape[0]).batch(batch_size)

,→

test_dataset = tf.data.Dataset.from_tensor_slices(

x_test).shuffle(x_test.shape[0]).batch(batch_size)

,→

epochs = 3500

vae = VAE(latent_dim=3, batch_size=100)

vae.train(epochs, train_dataset, test_dataset)

After the training is finished, we can visualize our latent space with labels in figure (8), the code for producing this figure is given below. The figure shows that the trained VAE-model has divided the space into three regions which represents the different numbers from the MNIST-dataset. However, there are multiple overlays where points with a certain label is in the region of another label. The x-, y- and z-axis represents the three dimensions for the mean of the Gaussian posterior p

Z|X

.

fig = plt.figure() ax = Axes3D(fig)

for i

in

np.unique(y_test):

ax.scatter(z[np.where(y_test==i),0], z[np.where(y_test==i),1],

z[np.where(y_test==i),2],label=i)

,→

ax.set(title='VAE trained with 3500 epochs on MNIST', xlabel='$z_1$', ylabel='$z_2$', zlabel='$z_3$')

,→

plt.legend()

(23)

As a way of visualizing our data, we will plot a random draw from our trained model, and then plot the same number from the MNIST dataset, the code for producing these pictures is given below. Figure (9) shows that a random draw from our model can reproduce a realistic number.

plt.figure(0)

plt.imshow(vae.sample_image(1).reshape((28,28))) plt.title('Random draw after 3500 epochs') plt.figure(1)

plt.imshow(x_train[10].reshape((28,28))) plt.title('Example image from MNIST dataset') plt.show()

(a) Random draw from a trained VAE-model (b) Example picture from the MNIST dataset

Figure 9

VAE-clustering algorithm is presented through the function below which is part of the file vae.py mentioned previously in this section. M represents the number of steps we take in the gradient descent, γ (gamma) represents the step size, and η (eta) is the noise-term introduced in section 3.2.1.

def cluster(self,x,eta,gamma,m):

for i

in

range(m):

x_hat = x + eta*tf.random.normal(shape=x.shape) mu,sigma = self.transform(x_hat)

z = VAE.reparameterize(mu,sigma) x_rec = self.generate(z)

x = x - gamma*(x_hat-x_rec) return x

We ran this clustering algorithm multiple times for the same VAE-model, with

different values for M, γ, η. A high value of η will flatten our high-dimensional

space, making the number of clusters decrease since many optimas will

combine. We also found that a step-size γ which was too high, would result in

(24)

M (steps) Time (s)

100 ≈ 16

1000 ≈ 1800

10000 ≈ 18500

Table 2: Running time (using 16 vCPU-cores on Google Cloud)

a single cluster containing all points. The size of M had a big impact on how long the algorithm took to run, see table (2).

Below we present the code needed to cluster and produce a scatter-plot showing our clusters with given labels. We used x_test to run the clustering algorithm for time-efficiency, the times shown in table (2) are all run with the same testset. Figure (10) presents the results with parameters set to

M = 10.000, γ = 0.01, η = 0.1 .

x = vae.cluster_h(x_test,0.1,0.01,10000) z = vae.transform(x)[0].numpy()

fig = plt.figure() ax = Axes3D(fig)

for i

in

np.unique(y_test):

ax.scatter(z[np.where(y_test==i),0], z[np.where(y_test==i),1],

z[np.where(y_test==i),2],label=i)

,→

ax.set(title='M={}, gamma={}, eta={}, time: {}

seconds'.format(m,g,e,t), xlabel='$z_1$',

,→

ylabel='$z_2$', zlabel='$z_3$')

plt.legend()

(25)

algorithm, where each cluster should reasonably represent the three numbers we have used from MNIST when training our model (0,1,2). We make draws from the clusters to visualize what the points look like in figure (11). Figure (11a) is drawn from (z

1

, z

2

, z

3

) = (−0.45, 0.50, −0.80) , figure (11b) is drawn from (z

1

, z

2

, z

3

) = (0.40, 0.70, 0.80) , and figure (11c) is drawn from

(z

₁

, z

₂

, z

₃

) = (−1.00, 0.00, 0.30) .

(a) (b) (c)

Figure 11: Draws from each cluster

On the left-most cluster in figure (10) which mostly includes 2’s, we see it has a thin layer of 1’s, and the bottom cluster which mostly includes 0’s has a thin layer of 2’s. To get a better understanding of what is happening, we make draws from above these clusters and find an interesting result in figure (12).

Figure (12a) is drawn from (z

1

, z

₂

, z

₃

) = (−1.40, −0.30, 0.60) , and figure (12b) is drawn from (z

1

, z

₂

, z

₃

) = (−0.45, 0.50, −0.25) .

(a) (b)

Figure 12: Trained VAE-model on waveforms

When we make multiple draws from the right-most cluster (1’s) in figure (10), we only get 1’s with a right inclination, just as figure (11b). Taking a draw from above the left-most cluster (2’s) however, only gives straight 1’s. In comparison we only get curly 2’s when drawing from above the bottom cluster (0’s), and more solid 2’s (figure 11c) when making a draw from the left-most cluster (2’s).

This suggests that VAE-clustering not only succeeds in producing highly

probable draws of the numbers 0, 1 and 2, but it can also isolate different ways

to draw these numbers. Even though we made multiple draws from the latent

space, we have not found any 1’s with left-inclination. This might be because

(26)

the test-set that we ran our clustering algorithm on did not include enough 1’s that was drawn in this particular way.

4.2 VAE-clustering experiment on neural recordings

4.2.1 Neural recording data preparation

In this section we will showcase VAE-clustering on the same datafile which we used to reproduce the previous experiments, in section 2.2. The adaptive threshold shown in figure (3b) was applied, and the data was transformed to its wavelet representation using the matlab package continuous 1-D wavelet transform. The data was then exported to a .mat-file, and imported in python to train our model.

When training our model, we noticed that the large amplitudes of the dataset resulted in the ELBO diverging very quickly, which is why we decided to normalize the data before going any further. The transformation we chose is shown in equation (17), and figure (13). It later became obvious that this transformation was not optimal, this will be discussed more in section 5.

f (x) = sign(x) log(1 + |x|) (17)

Figure 13: Normalizing the waveforms

After normalizing the data using equation (17), our resulting dataset consisted

(27)

4.2.2 Training the model

The VAE-model we chose to work with in this experiment consists of three hidden (dense) layers for both the encoder and the decoder with 100 cells each.

Python code for this model can be found in appendix D.2. The encoder has a total of 34804 parameters and the decoder has a total of 34741 parameters, which gives us a total of 69545 parameters. Below we show the python-code used to import the correct packages, and prepare the dataset.

import tensorflow as tf import numpy as np

import matplotlib.pyplot as plt

from tensorflow.keras.layers import Dense, Input, Flatten, Reshape

,→

from tensorflow.keras.models import Model from scipy.io import loadmat

from time import time

def transform_waveforms(waveforms):

return np.sign(waveforms)*np.log(1+np.abs(waveforms)) file_name = 'waveforms_IL1B-TNF-03_adaptive-thresh-4.mat' waveforms = loadmat(file_name)['waveforms']

waveforms = waveforms.reshape(waveforms.shape + (1,)).astype('float32')

,→

waveforms = transform_waveforms(waveforms)

no_training_samples = int(waveforms.shape[0]*0.8) x_train = waveforms[:no_training_samples]

x_test = waveforms[no_training_samples:]

We trained the model over 1000 epochs and a 2-dimensional latent space.

Each epoch took around 20 seconds to run, which was a bit quicker then training on the MNIST dataset. This is because we have lower dimensionality on our wavelets (141 dimensions instead of 28 ∗ 28), but it is also because we use a more shallow VAE, with fewer and less complex layers.

train_dataset = tf.data.Dataset.from_tensor_slices(

x_train).shuffle(x_train.shape[0]).batch(batch_size)

,→

test_dataset = tf.data.Dataset.from_tensor_slices(

x_test).shuffle(x_test.shape[0]).batch(batch_size)

,→

epochs = 1000

vae = VAE(latent_dim=2, batch_size=64)

vae.train(epochs, train_dataset, test_dataset)

The latent space of our trained model can now be visualized without labels

(since we have none in this experiment) by feeding the trained model our

(28)

test-set, see figure (14a). We will also showcase the convergence-rate of our model through figure (14b). The code for producing both figures is given below.

z = vae.transform(x_test)[0].numpy() plt.figure(0)

plt.scatter(z[:,0],z[:,1])

plt.title('VAE latent space representation of waveforms') plt.xlabel('$z_1$')

plt.ylabel('$z_2$') plt.figure(1)

plt.plot(vae.all_elbos)

plt.title('Training VAE on waveforms') plt.ylabel('ELBO')

plt.xlabel('Epoch') plt.show()

(a) (b)

Figure 14: Trained VAE-model on waveforms

Our model is now trained and ready for VAE-clustering, which has been previously explained in section 3.2. The function in python is a part of the VAE-class and has been presented in the previous section, however we will restate the result here for clarity. Note that M is the number of steps in the gradient descent, γ represents the step size and η is the noise-term.

def cluster(self,x,eta,gamma,m):

for i

in

range(m):

x_hat = x + eta*tf.random.normal(shape=x.shape)

mu,sigma = self.transform(x_hat)

(29)

4.2.3 VAE-clustering on waveforms

The clustering algorithm was run multiple times with different values of γ, η, however we only compared results for runs with M = 50.000. We recorded how long time it took for different runs, these are written down in table 3.

Note that this clustering took longer than the MNIST clustering, this is because we only used the MNIST testset in the previous experiment, and the complete waveforms dataset for this experiment.

M (steps) Time (s)

100 ≈ 32

1000 ≈ 318

10000 ≈ 3190 50000 ≈ 15800

Table 3: Running time (using 16 vCPU-cores on Google Cloud) Figure (15) presents two different results from VAE-clustering, one run with γ = 0.5, η = 0.01 , and the other with γ = 0.99, η = 0.99. Note that

VAE-clustering is done in high-dimensional space, the figures merely shows their latent space representation and does not hold the exact same information.

In the latent space representation, we can see 13 clusters for figure (15a), and 7 clusters for figure (15b). Below we show code for one of the cases.

x = vae.cluster_h(waveforms, 0.01, 0.5, 50000) z = vae.transform(x)[0].numpy()

plt.scatter(z[:,0], z[:,1])

plt.title('VAE trained with 1000 epochs') plt.xlabel('$z_1$')

plt.ylabel('$z_2$')

(a) M = 50.000, γ = 0.5, η = 0.01 (b) M = 50.000, γ = 0.99, η = 0.99

Figure 15: Latent space representation after clustering

(30)

To label the datapoints in each cluster, we use a python implementation of k-means clustering [15], see appendix C.3 for a closer look at the method. We give the function our data and the number of clusters that we have, figure (16) presents the resulting labels on the same latent space representation seen in figure (15). Note that some clusters have the same color, e.g. cluster 0,10 in figure (16a), these are handled as two different clusters in python. In the code below, z is given from VAE-clustering showcased in the previous bit of code.

from sklearn.cluster import KMeans

# use K-means to label your points

no_clusters = 13

km = KMeans(n_clusters = no_clusters).fit(z) labels = km.labels_

fig, ax = plt.subplots() for i

in

np.unique(labels):

ax.scatter(z[np.where(labels==i),0], z[np.where(labels==i),1], label=i)

,→

ax.set(title='VAE trained with 1000 epochs', xlabel='$z_1$', ylabel='$z_2$')

,→

plt.legend()

(a) Labeled datapoints with 13 clusters (b) Labeled datapoints with 7 clusters

Figure 16: Latent space representation after k-means labeling

Each cluster represents a CAP-cluster which consists of a specific waveform,

we previously called this the prototype waveform. By making a draw from the

centroid of each cluster, we can visualize the prototype waveform of each

(31)

centers = km.cluster_centers_

prototype_waves = []

for z_mean

in

centers:

prototype_waves.append(vae.generate(

z_mean.reshape((1,2))).numpy())

,→

fig, ax = plt.subplots() for i

in

range(0,no_clusters):

ax.plot(origin_waves[i].squeeze(), label=i) ax.set(title='Waveforms')

ax.legend()

(a) Prototype waveformss for 13 clusters (b) Prototype waveforms for 7 clusters

Figure 17

Finally we want to reproduce the original event-rates plot in figure (5a) but with our prototype waveforms. To do this, we will reuse the code written by Zanos et al. [5] in matlab. Before being able to reproduce this plot, we need to import our own latent space representation of the waveforms with its

respective label, representing which CAP Cluster each waveform belongs to.

We use scipy to export the required data as a .mat-file, and the code below in matlab to create the event-rates.

% Load the data exported from python

data = load('13clusters_0.5g-0.01e.mat');

nr.waveforms.Y = data.z;

% labels + 1 because the labels should not start from 0 in matlab

nr.waveforms.labels = data.labels'+1;

nr.waveforms.event_rates(false, false);

% Visualization

N = 50;

real_clusters = 1:max(nr.waveforms.labels);

nr.waveforms.plot_event_rates(real_clusters, N);

(32)

nr in the code above represents the neural recording object which was used when we reproduced the experiment by Zanos et al. The code for this object and the waveforms object in matlab can be found at the previously mentioned link.

²

Finally, figure (18) is the event-rates for each cluster over the full recording.

(a) Event-rates for 13 clusters (b) Event-rates for 7 clusters

Figure 18

5 Discussion

As we mentioned in section 4.2.1, the transformation method we used was later shown to be a bad choice. When we plotted the waveforms in figure (17), we had not run an inverse transformation before plotting them. As soon as this was noticed, we ran the same prototype waveforms through the inverse transform which included the term e

^x

, but since many of the waveforms have absolute values above 15 (some of them reach ∼ 60), this term diverges and becomes unreasonably big.

Furthermore, another problem with our experiment is that we trained on the first 80% and tested on the last 20% which may have introduced problems in our trained model, for example over-fitting it to the first part of the recording, before the second cytokine was injected.

If this experiment is reproduced, a new normalization method should be used and the indices should be chosen randomly.

5.1 Comparing the methods

The clustering method should theoretically help us find the prototype

waveform of each CAP cluster, however it is difficult to know how many CAP

clusters there should reasonably be in the thousands of neuron signals that has

(33)

wavelets with more detail. Since this long-term project aims at being able to generate new neural signals, it is a very interesting find that our prototype waveforms have this refined detail.

We also see that we have multiple prototype waveforms that looks

approximately the same, it could be that these waveforms are really the same type of neural signal. Figure (5b) from the original experiment contains three CAP Clusters, one with higher amplitude, another one with medium

amplitude and the last one with small amplitude, however the phases of the waves looks similar. Comparing this with figure (17b), we can see that phase-wise the waves look similar, however CAP Cluster 2 has a bigger amplitude, CAP Clusters 1,3,4,6 has medium amplitudes and CAP Clusters 0,5 has a small amplitude. It might be that this could be condensed into three clusters, however we were not able to run VAE-clustering with any parameters to get three clusters.

Regarding the event-rates, figure (5a) shows that it is viable to associate a cytokine by injection with firing neuron signals categorized as CAP clusters.

Looking at figure (18a), we see that CAP Cluster 3 has a clear spike around the first injection time (30 min), this is also seen for CAP Cluster 2 in figure (18b).

For the second injection time (60 min), we see a clear spike in CAP Cluster 10 in figure (18a). For figure (18b), there is an increase for CAP Cluster 2 around minute 60, however it is not clear that this is due to the injection since this line is very noisy throughout the whole recording. In the original experiment, the spike at the second injection time is much more clear, see CAP Cluster 3 in figure (5a).

5.2 Future work

It was time-consuming to run the algorithms with many steps, we used M = 50.000 for each waveform experiment, which made it difficult to run as many experiments as we wanted. Using Google Cloud Notebooks was very helpful since our algorithms could run during night-time, however we were not able to recreate the original experiment with 3 CAP Clusters. It would be interesting to see the results from a few more examples, for example by working with different neural recording files found at

https://public.feinsteininstitute.org/cbem/PNAS%20Submission/.

The files could also be prepared in different ways. We only used the adaptive threshold for our experiments, the amplitudal threshold could also be applied for different files before running VAE-clustering. As mentioned before, a new normalizing transformation is also needed before reproducing our experiments.

When more tests are conducted, the behavior of the prototype waveforms

could be further investigated, for example an interesting experiment would be

to see whether the same prototype waveforms or CAP clusters are recurrent

for difference mice, since each recording is done on separate mice.

(34)

Our trained VAE-model constantly had a β-value of 1, which represents how much the regularization term should affect the model. This could be tried with different values. It is also possible to try different ways of running the

VAE-clustering, for example a dynamic step size γ could be used instead of the constant one that we are using, this might make the algorithm more efficient and/or precise. Another parameter that could be increased is L from section 3.1.4, which is set to 1 in all our experiments.

We used 2 latent dimensions in our VAE-experiment on the waveform, one reason for this is that the function that creates the event-rates graph in matlab only works if Z has 2 dimensions. It would be interesting to see the resulting event-rates graph with a new implementation of this function that can take any number of latent dimensions as input.

Appendices

A Kullback-Leibler Divergence

A.1 Definition

The Kullback-Leibler Divergence is a commonly used similarity measure between two distributions p(x), q(z), two distributions that are exactly the same will have a zero-valued measurement. It is calculated as the expected difference in information content between the two distributions, where the difference is given by ∆I(x) = I

p

− I

q

= − log p(x) + log q(x) = log

_q(x)

p(x)

.

D

KL

(q(x)||p(x)) := E

q

[∆I(x)] = Z

q(x) log q(x) p(x)

dx (18)

Notice that the measure is not symmetric.

D

KL

(q(x)||p(x)) 6= D

KL

(p(x)||q(x)) (19)

A.2 Proof that D

_KL

(q(x)||p(x)) ≥ 0

Using the fact log(t) ≤ t − 1, we see that

−D

KL

(q(x)||p(x)) = Z

q(x) log p(x) q(x)

≤ Z

q(x) p(x) q(x) − 1

= Z

p(x)dx − Z

q(x)dx

(20)

(35)

A.3 Analytic solution of −D

_KL

(q

_Z|X

(z|x)||p

_Z

(z))

If we assume that both the prior and the approximate posterior are Gaussian, we can analytically solve for the Kullback-Leibler Divergence.

p

_Z

(z) = N (0, I)

q

_Z|X

(z|x) = N (z; µ, σ

²

) (23) Let µ ∈ R

^K

and σ

²

∈ R

^K×K

, with k = 1, . . . , K. The integral is then

evaluated as

−D

KL

(q

Z|X

(z|x)||p

Z

(z)) = Z

q

Z|X

(z|x) log p

Z

(z) − log q

Z|X

(z|x) dz

= Z

N (z; µ, σ

²

) log N (z; 0, I) − log N (z; µ, σ

²

) dz

= Z

N (z; µ, σ

²

) log N (z; 0, I)dz − Z

N (z; µ, σ

²

) log N (z; µ, σ

²

)dz

= − K

2 log(2π) − 1 2

K

X

k=1

(µ

²_k

+ σ

²_k

)

!

− − K

2 log(2π) − 1 2

K

X

k=1

(1 + log σ

_k²

)

!

= 1 2

K

X

k=1

1 + log σ

_k²

− µ

²_k

− σ

²_k

(24)

B Likelihood of data (VAE-approach)

Starting at the Kullback-Leibler Divergence between our approximated posterior and our real posterior, we have

D

_KL

(q

_Z|X

(z|x

_i

)||p

_Z|X

(z|x

_i

)) = E

q_Z|X

[log(q

_Z|X

(z|x

_i

)) − log p

_Z|X

(z|x

_i

)]

= E

q_Z|X

log(q

_Z|X

(z|x

i

)) − log p

_X|Z

(x

i

|z)p

Z

(z) p

X

(x

i

)

= E

q_Z|X

log q

Z|X

(z|x

i

) + log p

X

(x

i

) − log p

X|Z

(x

i

|z) − log p

Z

(z) (25)

From here we see that since the expectation is over the latent variable, we can move p

X

(x

_i

) to the LHS.

D

_KL

(q

_Z|X

(z|x

_i

)||p

_Z|X

(z|x

_i

)) − log p

_X

(x

_i

)

= E

q_Z|X

log q

Z|X

(z|x

_i

) − log p

_X|Z

(x

_i

|z) − log p

Z

(z) (26)

(36)

Looking carefully at the RHS of equation (26), we see that it is possible to rewrite this with another Kullback-Leibler Divergence, using

D

KL

(q

Z|X

(z|x

i

)||p

Z

(z)) = E

q_Z|X

[log q

Z|X

(z|x

i

) − log p

Z

(z)] .

log p

_X

(x) = D

_KL

(q

_Z|X

(z|x

_i

)||p

_Z|X

(z|x

_i

))

+ E

q_Z|X

[log p

_X|Z

(x

i

|z)] − D

KL

(q

_Z|X

(z|x

i

)||p

Z

(z)) (27) Finally we introduce L(x

i

) = E

q_Z|X

[log p

_X|Z

(x

_i

|z)] − D

KL

(q

_Z|X

(z|x

_i

)||p

_Z

(z)) , and as such we have an expression for the log-likelihood.

log p

_X

(x) = D

_KL

(q

_Z|X

(z|x

_i

)||p

_Z|X

(z|x

_i

)) + L(x

_i

) (28)

C Theory of t-SNE, DBSCAN, k-means

C.1 t-distributed stochastic neighbor embedding (t-SNE)

t-SNE [12] is a nonlinear dimensionality reduction technique, used for example to visualize data with high dimensionality.

We are given some input-data X = x

1

, . . . , x

_N

where each x

i

is

high-dimensional. t-SNE starts by converting euclidean distances in high dimensions between our datapoints into conditional probabilities, representing similarity measures between the datapoints. From the original paper;

“The similarity of datapoint x

_j

to datapoint x

_i

is the conditional probability, p

_j|i

, that x

_i

would pick x

_j

as its neighbor if neighbors were picked in

proportion to their probability density under a Gaussian centered at x

i

.”

p

_j|i

= exp(−||x

i

− x

j

||

²

/2σ

²_i

) P

k6=i

exp(−||x

i

− x

k

||

²

/2σ

²_i

) (29) p

ij

= p

j|i

+ p

i|j

2N (30)

The reason we do not set p

ij

as equation (29), is because it would cause problems when a high-dimensional datapoint x

i

is an outlier. Instead, we define the joint probabilities p

ij

to be symmetrized conditional probabilities according to equation (30). The algorithm performs a binary search for σ

i

, that produces a conditional distribution with fixed perplexity that is specified by the user. Note that the bandwidth, σ

i

, is therefore smaller in denser parts of the input space.

After running this for each point, we will get a N × N matrix with high

probabilities for nearby datapoints, and low probabilities for widely separated

(37)

pairwise similarities in low-dimensional map q

ij

are given by employing a Student t-distribution with one degree of freedom

q

_ij

= (1 + ||y

_i

− y

_j

||

²

)

⁻¹

P

k6=i

(1 + ||y

i

− y

k

||

²

)

⁻¹

(31) Note that just in the high-dimensional case, we set q

ii

= 0, ∀i . The locations of y

i

in the mapping are determined by minimizing the Kullback-Leibler

divergence (see appendix A) of the distribution P from the distribution Q, that is

KL(P ||Q) = X

i6=j

p

_ij

log p

_ij

q

ij

(32)

This minimization is performed using gradient descent, and the result is a map that reflects the similarities between the high-dimensional inputs well.

C.2 Density-based spatial clustering of applications with noise (DBSCAN)

DBSCAN [13] is a clustering algorithm that can also mark outliers which lie in low-density regions. It takes two parameters, minP ts and ε, and in return clusters our datapoints with regards to these two parameters.

Let N be the number of points in some space, and let ε specify the radius around some point. In the DBSCAN-algorithm, each point can be classified as

• p is a core point if at least minPts points are within distance ε of it (including p)

• q is directly reachable from p, if q is within distance ε from core point p.

Points are only said to be directly reachable from core points.

• q is reachable from p if there is a path p

1

, . . . , p

N

with p

1

= p and p

N

= q , where each p

i+1

is directly reachable from p

i

. This implies that all points on the path must be core points, with the exception of q.

• Two points p, q are density-connected if there is a point o such that both p, q are reachable from o. This is a symmetric relation.

• All points not reachable from any other point are outliers or also called noise points .

We can now define a cluster. If p is a core point, then it forms a cluster together with all points that are reachable from it. Each cluster must therefore contain at least one core point. Non-core points are also part of the cluster, but since they cannot be used to reach further points, they form the edge of the cluster. All clusters will now satisfy two properties;

1. All points in a cluster are mutually density-connected.

(38)

2. If a point q is reachable from some other point of a cluster, then q is part of the same cluster.

C.3 k-means clustering

k-means clustering is a method that aims to partition n observations (x

1

, . . . , x

n

) , into the sets S = {S

1

, . . . , S

k

}, k ≤ n by minimizing the within-cluster variance (sum of squares), given as

argmin

S k

X

i=1

X

x∈Si

||x − µ

i

||

²

(33)

D VAE-code (python)

D.1 MNIST VAE

import tensorflow as tf from time import time

from tensorflow.keras.layers import Dense, Conv2D, MaxPool2D, Input, Conv2DTranspose, Flatten, Reshape

,→

from tensorflow.keras.models import Model class VAE(tf.keras.Model):

def reparameterize(mean,log_var):

eps = tf.random.normal(shape=mean.shape) return epstf.exp(log_var 0.5) + mean

@tf.function

def calc_loss(self,x):

mean, logvar = self.transform(x) z = VAE.reparameterize(mean,logvar) x_rec = self.generate(z)

rec_loss =

0.5*tf.reduce_sum(tf.math.squared_difference(x_rec, x), axis = [1,2,3])

,→

kl_loss = 0.5*tf.reduce_sum(tf.square(mean) + tf.exp(logvar) - logvar -1,axis = 1)

,→

return tf.reduce_mean(rec_loss + self.beta*kl_loss)

(39)

grad = tape.gradient(loss,self.trainable_variables) self.optimizer.apply_gradients(

zip(grad,self.trainable_variables))

,→

def init(self, latent_dim= 2, beta = 1, batch_size = 32):

super(VAE, self).init() self.latent_dim = latent_dim self.beta = beta

self.batch_size = batch_size self.image_size = (28,28,1) self.optimizer =

tf.keras.optimizers.Adam(learning_rate=1e-8)

,→

self.encoder = self.create_encoder_CNN_pool() self.decoder = self.create_decoder_CNN_pool() def create_encoder_CNN_pool(self):

inp = Input(shape=self.image_size)

conv_1 = Conv2D(filters = 32, kernel_size=3, strides = (1,1),padding = 'same', activation='relu')(inp)

,→

conv_2 = Conv2D(filters = 64, kernel_size=3, strides = (1,1),padding = 'same', activation='relu')(conv_1)

,→

mp1 = MaxPool2D(pool_size=(2,2), padding='valid')(conv_2) conv_3 = Conv2D(filters = 64, kernel_size=3, strides =

(1,1),padding = 'same', activation='relu')(mp1)

,→

conv_4 = Conv2D(filters = 64, kernel_size=3, strides = (1,1),padding = 'same', activation='relu')(conv_3)

,→

mp2 = MaxPool2D(pool_size=(2,2), padding='valid')(conv_4) fl = Flatten()(mp2)

mean_out = Dense(self.latent_dim)(fl) log_var_out = Dense(self.latent_dim)(fl) enc_model = Model(inp, [mean_out,log_var_out]) return enc_model

def create_decoder_CNN_pool(self):

inp_c = Input(shape=(self.latent_dim,))

den = Dense(units=7732, activation='relu')(inp_c) rs = Reshape(target_shape=(7,7,32))(den)

convt_1 = Conv2DTranspose(filters = 32, kernel_size=3, strides = (2,2), padding = 'same',

activation='relu')(rs)

,→

convt_2 = Conv2DTranspose(filters = 64, kernel_size=3, strides = (1,1), padding = 'same',

activation='relu')(convt_1)

,→

(40)

convt_3 = Conv2DTranspose(filters = 64, kernel_size=3, strides = (2,2),padding = 'same',

activation='relu')(convt_2)

,→

convt_4 = Conv2DTranspose(filters = 64, kernel_size=3, strides = (1,1),padding = 'same',

activation='relu')(convt_3)

,→

image_out =

Conv2DTranspose(filters=1,kernel_size=3,strides=(1,1),padding

= 'same',activation='sigmoid')(convt_4)

,→

VAE-clustering of neural signals and their association to cytokines

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020 ,

VAE-clustering of neural signals and their association to cytokines

ARAM ESKANDARI

VAE-clustering of neural signals and their association to cytokines

ARAM ESKANDARI

Degree Projects in Mathematical Statistics (30 ECTS credits) Master*s Programme in Applied and Computational Mathematics KTH Royal Institute of Technology year 2020

Supervisor at KTH: Henrik Hult

Examiner at KTH: Henrik Hult

Royal Institute of Technology

SE-100 44 Stockholm, Sweden

URL: www.kth.se/sci

VAE-klustring av nervsignaler och dess associationer till cytokiner

Author: Aram Eskandari Supervisor: Prof. Henrik Hult

May 19, 2020

Contents

1 Introduction 6

1.1 Bioelectronic medicine . . . . 6

1.2 Motivation of this research . . . . 6

2 Background 6 2.1 Inflammatory reflex . . . . 6

2.2 Reproducing experiment . . . . 7

3 Method 10 3.1 Variational autoencoder . . . 10

3.1.1 Preliminary . . . 10

3.1.2 Deriving the objective function . . . 12

3.1.3 Reparameterization trick . . . 12

3.1.4 Analytical solution of ELBO . . . 13

3.2 VAE-clustering . . . 13

3.2.1 VAE-clustering with noise . . . 15

4 Experiments and results 16 4.1 VAE-clustering experiment on MNIST . . . 16

4.2 VAE-clustering experiment on neural recordings . . . 21

4.2.1 Neural recording data preparation . . . 21

4.2.2 Training the model . . . 22

4.2.3 VAE-clustering on waveforms . . . 24

5 Discussion 27 5.1 Comparing the methods . . . 27

5.2 Future work . . . 28

Appendices 29 A Kullback-Leibler Divergence 29 A.1 Definition . . . 29

A.2 Proof that D

(q(x)||p(x)) ≥ 0 . . . 29

A.3 Analytic solution of −D

(q

(z|x)||p

(z)) . . . 30

B Likelihood of data (VAE-approach) 30 C Theory of t-SNE, DBSCAN, k-means 31 C.1 t-distributed stochastic neighbor embedding (t-SNE) . . . 31

C.2 Density-based spatial clustering of applications with noise (DB- SCAN) . . . 32

C.3 k-means clustering . . . 33

D VAE-code (python) 33

D.1 MNIST VAE . . . 33

D.2 Waveforms VAE . . . 36

1 Introduction

1.1 Bioelectronic medicine

Advances in bioelectronic medicine hold promise to address some of these challenges by providing personalized and localized treatment of disease, e.g. by regulating neural signals.

1.2 Motivation of this research

It has been shown from experimental disease models of arthritis, colitis, sepsis, hemorrhagic shock and congestive heart failure, that electrical stimulation of the vagus nerve can prevent or reverse these diseases [3].

In this thesis we will conduct experiments with the aim of clustering neural signals with their associated cytokines using unsupervised learning and a novel clustering method. The long-term goal of this research is to enable the

regulation of our inflammatory reflex, which can be very beneficial e.g. in combatting autoimmune diseases. See section 2.1 for more information regarding the inflammatory reflex and the part played by cytokines.

We will design an accurate generative model for the neural signals, and for this we will use a variational autoencoder. The reason we prefer this over a

2 Background

2.1 Inflammatory reflex

From wikipedia, on the inflammatory reflex;

“The inflammatory reflex is a neural circuit that regulates the immune response to injury and invasion.”

Reflex circuits regulate cellular function, metabolism and blood flow in the visceral organs, thereby continuously working to keep our bodies in homeostasis. Recent advances in neuro-science and immunology reveal previously unrecognized mechanisms for the reflex control of inflammation [2].

Inflammatory response attracts immune cells which in turn release a type of

protein, called cytokines, and other inflammatory mediators which have been implicated in activating sensory (affarent) neurons.

This activation lets the sensory neurons transmit action potentials to the spinal cord and brainstem, and is the first step in modulating the outflow of motor neurons in the efferent arc of reflex circuits that control homeostasis.

Increased signaling in the efferent arc inhibits inflammation and prevents organ damage.

The vagus nerve is the largest cranial nerve, and is comprised of

80.000 − 100.000 fibers in humans [4]. It is the leading sensory conduit for transmitting sensory signals from organs to the brain, which makes it an ideal place for us to conduct experiments.

2.2 Reproducing experiment

Figure 1

These neuron impulses are called compound action potentials (CAPs), in agreement with previous research [11]. Figure (1a) shows the experimental setup, with bipolar electrodes connected to the cervical vagus nerve of mice.

Figure (1b) presents the experimental workflow, where IL-1β and TNF

represents two different cytokines.

All recordings are saved by Zanos et al., at the link

https://public.feinsteininstitute.org/cbem/PNAS%20Submission/.

After the recording is done, the data is analyzed in matlab. The file we have chosen to work with represents a wild-type mouse, where IL-1β was injected first, and then TNF.

Many of the files were very noisy which made it difficult to reproduce multiple working examples, however the file we chose exhibited very clear signals which is why we will run our own experiments on the same file. See figure (2) for a quick presentation of the mentioned file.

Figure 2: Visualizing the neural recording

We have two different channels in the datafile, where it can be seen in figure (2a) that channel 1 has clear impulse spikes around the injection-times.