Towards Understanding Capsule Networks

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2020

Towards Understanding

Capsule Networks

(2)

Master of Science Thesis in Electrical Engineering

Towards Understanding Capsule Networks:

Johan Edstedt LiTH-ISY-EX--20/5309--SE Supervisor: M.Sc Gustav Häger

ISY, Linköpings universitet

Examiner: Prof. Michael Felsberg

ISY, Linköpings universitet

Computer Vision Laboratory Department of Electrical Engineering

(3)

Abstract

In this thesis capsule networks are investigated, both theoretically and empiri-cally. The properties of the dynamic routing [42] algorithm proposed for capsule networks, as well as a routing algorithm in a follow-up paper by Wang et al. [50] are thoroughly investigated. It is conjectured that there are three key attributes that are needed for a good routing algorithm, and these attributes are then related to previous algorithms. A novel routing algorithmEntMin is proposed based on the observations from the investigation of previous algorithms.

A thorough evaluation of the performance of different aspects of capsule net-works is conducted, and it is shown that EntMin outperforms both dynamic routing and Wang routing. Finally, a capsule network using EntMin routing is compared to a very deep Convolutional Neural Network and it is shown that it achieves comparable performance.

(4)

(5)

Acknowledgments

Thanks to my supervisor Gustav Häger for answering my many questions on slack and general discussions, as well as reviews of my drafts. Thanks to my examiner Michael Felsberg for inspiration and mathematical guidance as well as reviews of my drafts, and for the original topic of the thesis. Thanks to the Computer Vision Laboratory at LiU in general for being super nice to me and giving me a nice desk and also providing the GPUs necessary. Thanks to my opponent Ida Bjerwe for the final review. Thanks to my parents for helping me move during this period. Who knew moving everything in an apartment could be difficult? Special thanks to Felix Pahl (joriki) for helping out with a proof. No thanks to Corona who forces me to sit at home and write this instead of having a good time at CVL.

Linköping, May 2020 Johan Edstedt

(6)

(7)

Notation

Capsule Terminology Notation Meaning

C The set of all child capsules in the lower layer |_C| _{The total number of child capsules}

C_i A specific capsule in the child layer

P The set of all parent capsules in the upper layer |_P| _{The total number of parent capsules}

Pj A specific capsule in the parent layer

ˆ

P_ji The prediction of Pjby Ci

Mji The transformation(prediction) matrix between Ci →

Pj

D The dimensionality of a capsule

Neural Network Terminology Abbreviation Meaning

NN Neural Network

FCN Fully Connected Neural Network CNN Convolutional Neural Network

(12)

xii Notation

Mathematical Notation Notation Meaning

N The natural numbers

R The real numbers

RN The N-dimensional euclidean space

Mm _{The m-simplex}

N _{The Gaussian Distribution} χ2 The chi-squared Distribution h_{· , · i} _{Inner product between two vectors}

||_{· ||}_p _{The p-norm of a vector (usually the 2-norm)}

||_{· ||} _{The 2-norm of a vector or the spectral norm of a matrix} ||_{· ||}_F _{The Frobenius norm of a matrix}

E Expectation of a random variable L _{A loss function}

Mathematical Terminology Abbreviation Meaning

(13)

1

Introduction

In this chapter the concept of capsule networks will be introduced and the aim and scope of this thesis will be presented.

1.1 Motivation

Humans see the world with the help of millions of light sensitive receptors in our eyes, but each of these only see a tiny fraction of the world at any given time, how can this information be converted to a deep understanding of what is going on around us? It turns out that our brains are incredible pattern matching machines of visual input. The electrical signals coming in through the optic nerve to the back of the brain get transformed in an hierarchical fashion [29, 36, 45, 46] through a network of neurons. This transformation makes us able to understand complicated geometry, read, estimate distance, recognize faces etc. all from just the tiny amount of light that reaches our retinas.

A highly interesting question is if this understanding can be emulated in ma-chines. For a long time the answer was thought to be no. Recently however, artificial neural networks have for the first time been able to compete with hu-man understanding on a number of different tasks involving visual understand-ing [4, 9, 41]. Arguably, the biggest breakthrough was usunderstand-ing convolutional ar-tificial neural networks with a large amount of layers. Such networks are of-ten called convolutional neural networks (CNNs) (the word artificial is ofof-ten dropped). These networks only have connections between neurons in local neigh-bourhoods1_{and the connections are also strictly between consecutive layers and}

no intra-layer connections exist, i.e. the connections between two consecutive

1_{Neighbourhoods can be interpreted here as tightly connected spatially in the Euclidean sense.}

(14)

2 1 Introduction

layers constitute a bipartite graph. Computing a layer from a previous layer can in fact be thought of as correlating a set of (usually) small filters with the previous layer. An important property of CNNs is that they are a1-pass type of network, i.e. only the current input has an effect on the output. This is quite different from humans, who usually work in some kind of temporal context.

One of the great successes of deep CNNs come from the application of max-pooling over the layers of the network [44]. This gives empirically good results, but is in a theoretical perspective quite troublesome. The max-pooling operation simply outputs the largest elements in a local neighbourhood for all neighbour-hoods in the layer, which means that a large amount of information is lost in this step.

An alternate framework to CNNs was in parallel suggested by Hinton et al. [11], called a capsule network. These networks do not use max-pooling and work on different principles than CNNs, but still share many similarities.

In this thesis an investigation of capsule networks on the task of image classifi-cation will be conducted, which is the task of answering which type of object is present in an image.

1.1.1 What is a Capsule Network?

Most neural networks transform input vectors to output vectors with some com-bination of matrix multiplications and non-linear functions2, where the non-linearities are almost alwayspoint-wise i.e. the non-linearity performs the same operation independently on each element of the vector. Usually the elements of the intermediate representation will be referred to as the activation of a neuron, which comes from the neurological equivalent in the brain.

Capsule networks are different from regular neural networks in some important ways. In capsule networks a predefinedcollection of neurons in an arbitrary layer will be referred to as a capsule, importantly there should be more than one cap-sule, i.e. the collection should be smaller than the entire set of neurons in the layer. The combined activations of the neurons in a capsule will be referred to as pose. This notation will be used as it is common in the capsule network literature. While pose is most often used to describe the orientation of an object in 3d space (e.g. rotation), in this thesis it will be used to denote a more generalpose-space where pose refers to an arbitrary set of pose coordinates. For further motivation for the term see e.g. [42]. Further, child capsules and parent capsules will be re-ferred to, which simply means the capsules in layer ` and capsules in the layer ` + 1 respectively, where ` ∈ N is arbitrary.

Now assume that the pose of a capsule in some way reflects the underlying varia-tion in some kind of object. A common example is a triangle object, with the pose representing all possible in-plane rotations, but the pose can in fact represent arbitrary variations as mentioned previously. The main assumption of a capsule

(15)

1.2 Aim 3

network is that only one such object can be present in each spatial location in the image. This assumption was motivated by thecrowding phenomenon [37, 42]. A capsule network can be used in several different settings such as classification, generative modelling, and unsupervised learning. In this thesis the investigation will be restricted to the classification setting. In the classification setting a feed-forward network of capsules is used where child capsules infer the pose of the parent capsules sequentially. This inference process is called arouting algorithm. For classification it is then imposed that the final layer capsules represent the desired output classes.

1.1.2 What is a Routing Algorithm?

For capsule networks to work, some way to infer the pose of the parent capsules is necessary, given the pose of the child capsules. One way to do this is to let each child capsule predict the pose of each parent capsule, and then compare the predictions. If all predictions for a certain parent capsule are similar then it is assumed that such a capsule is likely. Note here that the pose is givenfor free simply by finding the best agreement of predictions.

1.2 Aim

While there is a large amount of literature on capsule networks with numerous, see e.g. [11, 12, 39, 40, 42, 48–50], proposed routing algorithms, less focus has been put on understanding what distinguishes these algorithms.

An important direction of research therefore seems to be to rigorously investigate the properties of routing algorithms to gain a better understanding of their inner workings. The hope is that by further understanding the behaviour of routing algorithms, a grounded approach to routing can be discovered.

1.3 Problem Statement

With the previous discussion in mind, the goal of this thesis is to:

1. Investigate theoretical and empirical properties of some current routing al-gorithms.

2. Propose ways of stabilizing/improving current algorithms.

3. Propose a routing algorithm which reconciles issues with previous routing algorithms.

1.4 Delimitations

The analysis will be constrained to deal with the class of norm-based routing algorithms, specifically dynamic routing [42] and Wang routing [50].

(16)

4 1 Introduction

One reason for this is that it is not feasible to analyze the properties of the many different forms of routing that has been proposed, so some kind of restriction on the scope of the analysis need to be formed.

Norm-based routing algorithms was chosen in particular, partly because it is the basis for dynamic routing [42] which has had a large influence on capsule net-work literature, and partly because it was found that these methods to exhibit interesting properties and easily lend themselves to analysis.

In the experimental section investigation is restricted to performance on the well known image classification dataset CIFAR-10 [21]. One might also consider other datasets and other visual understanding tasks. CIFAR-10 was chosen partly be-cause it is very common in the capsule network literature, which makes this work easier to compare to previous methods, and partly because it is a dataset with a reasonable computational cost for training and evaluation. The latter part is im-portant in cases of constrained computational resources3.

1.5 A Wider Perspective

The narrow field of capsule networks may seem irrelevant to society in general. However, the quest to improve learning in machines is one that has a huge impact on the world. Machines understanding language is what makes google translate possible. CNNs are vital for self driving cars. Facial recognition may allow robots to interact with humans, criminals to be caught, and governments to develop sophisticated surveillance systems to control their population.

While this work is connected to no specific such application, works like these in general enable further development of all these technologies, with all the benefits and all the drawbacks that they entail.

Any research in a field with such potential should be conducted in an ethical way. It is the responsibility of the researcher to make sure that the research they conduct is not used for evil.

1.6 Thesis Outline

In Chapter 2, a short summary of related work in the field will be presented to-gether with some theory. This is followed by Chapter 3, in which a theoretical analysis is conducted, and a novel routing algorithm is proposed. In Chapter 4, empirical experiments are conducted, related to the theoretical analysis con-ducted in Chapter 3. In Chapter 5, a discussion of the methodology and a wider context of the thesis are given. Finally, in Chapter 6, conclusions are drawn, and possible future work is discussed.

3_{Larger computer vision datasets such as ImageNet [15] may take multiple days on multiple GPUs} to train a single model

(17)

2

Related Work

In this chapter important background context along with previous research in the field will be discussed.

2.1 Invariance vs Equivariance

Formal mathematical definitions of invariance and equivariance will be avoided, and instead an intuitive explanation of the concepts will be given. An object is in-variant [5] under some transformation if none of its properties change under the transformation, e.g. rotating a ball gives exactly the same ball. An object is equiv-ariant [5] under some transformation if its properties change in a predictable way under that transformation. Imagine painting a picture upside down, for a person standing upright they could manipulate the painting to the correct orientation by simply rotating the paper by 180 degrees.

Arguably the success of CNNs is due to the almost-equivariance to translation of the input, this benefit comes in multiple forms. The computational cost of CNNs is orders of magnitude cheaper than an equivalent FCN, and the equivariance provides a stronginductive bias [31] which has empirically been shown to be very important for model performance, resulting in neural networks surpassing hu-man level recognition on some tasks [8].

Now, while the assumptions underlying CNNs are nowadays noncontroversial, equivariance under other transformations than translation is a more contentious topic. Consider an input which has been rotated by 30 degrees. How should the representation change in response to this?

(18)

6 2 Related Work

There are currently two views1 on how the representation should adapt. One view is that the representation should be invariant to certain transformations such as scale and rotation. This is exemplified in some hand-engineered features e.g. SIFT [28]. More recent examples include max-pooling in CNNs [44], and spa-tial transformer networks [14] whichinverse transform the input to some canon-ical frame. The motivation for invariance is that the transform is of no use for recognition and only makes learning more difficult. In this view, the rotation of the input should have no effect on the output of the network.

The other view is that true recognition demands preserving information about the transformation, and instead propose that the network should beequivariant to certain transformations. This is exemplified in group equivariant CNNs [1], where the network is explicitly constructed to be equivariant to a set (group) of transformations. In this view rotating the input should in some sense be the same as rotating the output.

Equivariance is one of the motivations behind capsule networks, where low level capsules code for things such as rotation and distribute the information to higher level capsules [42]. In more recent capsule literature, such as [24] and [49], equiv-ariance is enforced more explicitly by using provably equivariant transforma-tions.

2.2 Capsule Networks

Capsule networks were first suggested by Hinton et al. (2011) [11]. They pro-posed a distributed network architecture of self-contained units which they coined capsules. The task they were investigating was a special type of auto-encoder of which the input data was to be transformed by some given transform. This task is quite easy for simple transformations such as in-plane rotations and translations (there exists a simple linear transformation of the data), but for 3d-rotations there is no such formula. Their solution was to by backpropagation learn recognition capsules andgeneration capsules, which they related via a given transform. From training on a small subset of the given transformation group, the system could then generalize to the entire transformation group, e.g. if trained on 30 degree rotations, the system could extrapolate to 45 degree rotations. Routing in their case was conducted as a weighted sum operation using activation probabilities as weights, where the activation probabilities where determined by the recognition capsules [11].

Training with such explicit transformations in some sense implicitly forces the capsules to be equivariant under the transformation, howevertrue equivariance in the mathematical sense is not explicitly guaranteed [24].

The next major step in capsule networks was the work by Sabour et al. (2017) [42], where the transformation between capsules was not explicitly given, but rather also learned during training. This relaxation led to capsules being able to

(19)

2.3 Grouping Neurons 7

learn much more complicated relationships and gave state of the art results on MNIST [23], however the connection to transformation matrices was more diffuse and it is not clear if there is any equivariance (in the strict sense) in the network. They also changed the way of representing the activation probability by using the norm of the capsules.

The work of Sabour et al. [42] unleashed a torrent of papers expanding on the idea. Wang et al. [50] reframed routing as a more formal optimization prob-lem. Hinton et al. [12] re-introduced explicit activation probability parameters as well as EM-routing and also employed convolutional routing for the first time. Expanding on those ideas Ribeiro et al. [40] introduced priors on the parameters of the parent capsules and used variational bayes to infer the parent poses. In parallel, equivariant capsule routing was developed by Lenssen et al. [24], with further development by [49]. Recently Kosiorek et al. [20] produced an unsu-pervised version of capsule networks resulting in state of the art unsuunsu-pervised classification on MNIST. Li et al. [25] achieved state of the art capsule perfor-mance on CIFAR-10 [21] and CIFAR-100 [21] by using a much deeper network combined with a master/aid scheme and optimal transport regularization. A justified question is how to form the first capsules. Typically the first capsule layer is not made from simple pixel-data, although there is nothing inherently preventing this. In most cases the first capsule layer is constructed using some kind of feature extraction, usually some kind of shallow CNN back bone [42], although some have used deeper networks [35, 48].

Predicting a parent capsule can be done in multiple ways, in dynamic routing [42] a transformation matrix is trained by backpropagation for each child parent pair, and most other works [12, 40, 50] follow a similar approach. Some authors have however used shallow NNs [39, 49], and others equivariant transformations [24].

Usually there is some explicit way of representing the probability of a capsule, one exception is [48]. In [42] this is done by measuring the norm of the vector, in the rest of this thesis it will be assumed that the norm is linearly proportional to the probability that the capsule is active i.e. ||Pj|| ∝p(Pj). In other works, such as

[11],[12] and [8], the probability is explicitly represented as a separate feature.

2.3 Grouping Neurons

Besides capsule networks, there has also been a more general interest lately of grouping neurons together.

Insteerable CNNs [2], certain subsets of the filter banks are referred to as capsules, in that case referring to certain irreducible subspaces with regard to the dihedral group D4.

Another approach is proposed by He et al. [52], who propose to normalizegroups of neurons. They argue that small numbers of neurons may represent

(20)

semanti-8 2 Related Work

cally similar features, and therefore can be normalized together. They show em-pirically that group normalization outperforms batch normalization [7] on small batches.

2.4 Unbiased Decoding and Connections to

Population Codes

A surprising connection which will be shown in detail in Chapter 3, is that be-tween capsule networks andpopulation codes [6, 30].

In neuroscience, a population code is defined as the joint activation of a set of neurons with (usually) overlappingtuning curves with the peak of the curve indi-cating the preferred input of the neuron.

Estimating parameters from the population code by means of decoding can be done using e.g. maximum likelihood estimation if the shape of the tuning curves is known [3]. However, it is biologically implausible that the brain would be con-ducting ML estimation, and therefore Denève et al [3] suggested a biologically plausible neural approximation of the ML-estimate. They showed that their de-coding process performed closely to the ML-estimate for Gaussian tuning curves. However, it was later shown by Felsberg et al. [4], that this decoding process is heavily biased no matter how the parameters are set towardschannel centers, i.e. the peak of the tuning curve.

In Chapter 3 we will show that dynamic routing [42] has deep similarities to the Denève scheme, pertaining to the update equations in the iterative scheme, with the implication being that this routing process may also be biased towards channel centers. This would mean that the routing process is biased towards giving a certain set of solutions, which may be detrimental to learning.

(21)

3

Theoretical Analysis

In this chapter a number of specifications that a routing algorithm should fol-low will be set up. There will be a discussion of each one and how it relates to previous work. Then the theoretical and empirical properties of norm based routing algorithms will be investigated. A novel routing algorithm will be pre-sented and investigated. Finally an investigation of the capsule pose space will be undertaken.

In all but the last section, the capsule pose space will also be assumed to be vector valued. In the last section, an alternative pose space based on matrices will also be investigated.

A lot of the terminology introduced in the notation chapter will be used. To make the notation less heavy, writing e.g. Pjwill always refer to Pj∀j unless otherwise

specified.

3.1 The Specifications

In this section I will introduce some specifications which it seems sensible that a routing algorithm should follow. Note that these specifications are based on the subjective opinions of the author, however each of them will be properly moti-vated.

I The routing should emphasize parent capsules which are more likely to be in the image, while suppressing the others.

II The routing should be robust to noise, small changes in the input space should not lead to drastically different outcomes.

(22)

10 3 Theoretical Analysis

III The routing should be able to account for uncertainty in the estimate. The first specification is arguably the fundamental purpose of routing in a Cap-sule Network, see e.g. [42]. Child capCap-sules predict the parent capCap-sule poses, and through routing find which parent is likely to be in the image and its correspond-ing pose.

The second specification is also intuitively important. For example, a small change in the prediction of a single child capsule should not dramatically change the re-sulting route. If this was the case the routing would be extremely unstable and learning difficult.

The third specification is important in cases where there is noground truth par-ent capsule. If it is not obvious which parpar-ent capsule should be activated, the resulting route should somehow reflect this. This is strongly related to the sec-ond which will later be shown.

The first specification is fundamentally at odds with the second and third. This is not immediately obvious. Later in this chapter it will be shown that the same mechanism that emphasizes the correct capsules may also increase noise uncon-trollably and completely disregard uncertainty in the data.

The list of specification is admittedly in some sense subjective. There is no way to exhaustively list all wanted properties that a routing should exhibit. With that said, all listed specifications should be somewhat non-controversial and neces-sary for any well performing routing algorithm.

3.2 Introduction to Routing by Agreement

Both Dynamic routing and Wang routing that will be discussed in this chapter, are based on the notion ofrouting by agreement. It is worth to first get an under-standing of whatagreement means, for an easier understanding of the discussion that is to follow.

The basic premise of routing by agreement is to increase connection strength between parent and child capsules that in some sense agree. The issue is of course that the pose of the parent capsules is not known, and should be inferred by the child capsules. As previously discussed, this is typically done by simply giving an initial prediction of parent poses as the unweighted sum of child capsules multiplied with their respective prediction matrices.

The role of the routing is then to iteratively update the belief in each parent capsule by comparing the inner product of each child capsule prediction with each parent capsule. Importantly the routing is biased in such a way that each child prefers to be connected to only 1 parent (in each spatial location). Then the updated parent-child couplings can be used to alter the parent pose as a weighted sum of predictions. This process is quite similar to the well known expectation maximization type algorithms.

(23)

3.3 Baseline 11

The idea is that by adjusting routing between children and parents, a better es-timate of the pose of the parent capsules can be produced. Later the validity of this assumption will be discussed.

The discussion and analysis will be limited to norm based routing algorithms, as previously discussed in the introduction chapter.

3.3 Baseline

In Algorithm 1, a very simple routing algorithm is presented, it similar to an algo-rithm presented in [11] however here the probability is represented as the norm instead of as a separate value. It is based on the assumption that the sum of pre-dictions will be large when there is high agreement. Observe that the connection strength is uniform, in contrast with dynamic and Wang routing.

Algorithm 1Unweighted sum routing Input: C, the set of child capsules Output: P, the set of parent capsules

1: function Unweighted Sum Routing(C)

2: Pˆji = MjiCi

3: Pj =PiPˆji

4: return P

5: end function

3.3.1 Analysis

Assume all predictions are N (0,_D1). Assuming that for a parent capsule j the pre-dictions are i.i.d., then the expected norm is simply ||Pj|| =

√ |_C| |_C| = 1 √ |_C|, whereas

for a parent capsule with identical predictions, the expected norm of ||Pj|| = 1.

Hence a factor of √

|_C|difference in resulting norm is received.

3.3.2 Issues

There is an obvious issue: If linear routing is used, with no non-linearities, several capsule layers are equivalent to some implicit linear transform. In fact, if the capsule layer is fully connected the matrix can be formed explicitly in a familiar form:           P₁ .. . P|_P|           =           M₁₁ . . . M_1|C| .. . . .. ... M|_P|1 . . . M|_P||C|                     C₁ .. . C|_C|           (3.1)

(24)

Which leads to the conclusion that this routing is nothing else than a special case of an affine layer, with 0 bias.

In the context of capsule networks the 0 bias can be interpreted as having the capsule pose space be 0 centered for all capsules.

3.4 Dynamic Routing

In this section the dynamic routing procedure proposed by Sabour et al. [42] will be investigated. The main difference between the previously proposed linear routing algorithm is that the previously static affine transform between child and parent capsules will now beadaptive, depending on the pose of the child capsules. We begin by presenting the dynamic routing algorithm.

3.4.1 The algorithm

In Algorithm 2, the dynamic routing procedure is presented. Algorithm 2Dynamic Routing [42]

Input: C, The set of child capsules Output: P, The set of parent capsules

1: procedure Dynamic Routing(C)

2: Pˆji = MjiCi . Compute predictions

3: Bji = 0 . Initiate log coupling to 0

4: fork iterations do

5: Rji =

exp(Bji)

P

jexp(Bji) . Compute coupling weights

6: Pˆ_j =P

iRjiPˆji . Predict all parent poses

7: P_j = squash( ˆPj)

8: Bji= Bji+ hPj, ˆPjii . Update log coupling

9: end for

10: P_j= squash( ˆPj)

11: return P

12: end procedure

The squash function is given by squash( ˆPj) =

||_Pˆ_j||ˆ_P_j 1+|| ˆPj||2.

The dynamic routing procedure is based on the premise that agreement between parents and children can be measured using their inner product. This causes chil-dren with positive inner products to parents to reinforce their connections, since the updated pose of the parents will use a weighted sum based on the connection strength between the child-parent pairs.

The squash function is theoretically unmotivated, but intuitively it further re-duces the norm of parent capsules that have a smaller norm. One way to view

(25)

3.4 Dynamic Routing 13

the squashing function is that it increases the convergence speed of the routing al-gorithm, however as will be shown there are a lot of issues with this non-linearity which may not be obvious.

In the following sections 3 crucial parts of the algorithm will be discussed. First the size (norms) of the prediction matrices (Mji), and how it affects the algorithm

will be discussed. Then the actual log coupling coefficient update, and behaviour of the algorithm when it is iterated will be discussed. Then an analysis of the squash function will be conducted and a discussion if it fulfills its purpose will follow. Then parallels between dynamic routing and the Denève scheme will be drawn. Finally, practical adjustments of the algorithm to stabilize the routing, motivated by the previous section will be suggested. The specifications intro-duced previously will be kept in mind.

3.4.2 Prediction Matrices

First the effect of the norm of the prediction matrices on the log coupling coef-ficient update will be shown, and hence the update of the coupling coefcoef-ficients. This is an interesting issue, because the scale (which will be denoted by λ) of the log couplings is proportional to the confidence in the coupling coefficients in the following sense: Rji = exp(λBji) P jexp(λBji) (3.2) The size of α therefore dictates how strongly children are assigned to parents. Assume layers C, P with child and parent capsules, where Ci and Pjcorresponds

to individual capsules in the child and parent layers respectively. Further, assume that Rji = |1_P|. Let ˆPji be the prediction from Ci to Pj. Denote the individual

prediction matrices as Mji, and assume that their respective elements are i.i.d.

Gaussian with a variance α2. Now assume that the individual child capsules are C_i ∼ N₀_, I

D

, which gives an expected squared norm of Eh||_C_i||2i₌ 1

1.

Now each ˆPji ∼ N 0,

MjiIMTji D

!

which in expectation over prediction matrices give N_{0, α}2_Isince any affine transformation of a multivariate Gaussian is another Gaussian. At the last step the individual elements of Mji has been assumed to be

i.i.d. zero mean Gaussian, which makes the outer product diagonal in the sense of expectation with diagonal elements α2 where α is the standard deviation of the elements of the matrix.

Using that the sum of multivariate Gaussian variables is a Gaussian with the sum of the covariances as its covariance it is clear that:

ˆ P_j = 1 |_P| X i ˆ P_ji ∼ N _0,α 2_|_C| |_P|2 I ! := N0, σ2I (3.3)

(26)

Figure 3.1:Log10 plot of the expected norm of the squashed vector depend-ing on sigma.

Now this vector is squashed, as described in Algorithm 2. The expected norm of the resulting squashed vector can be estimated by noticing that the distribution of || ˆPj||22is a scaled Chi-square distribution. Hence the expected value of || ˆPj||22=

σ2D. Since _1+qq , q > 0 is a concave function, Jensen’s inequality can be used to give an upper bound:

E[||P_j||] ≤ E[|| ˆP_j||2] 1 + E[|| ˆPj||2] = σ 2_D 1 + σ2_D (3.4)

For a lower bound a lower estimate of the squash function can be formulated by:

||_Pˆ_j||2 1 + || ˆPj||2 ≥        0, if || ˆPj||2≤q0 q0 1+q0, otherwise (3.5)

By again using the properties of the chi-square distribution and the above equa-tion, it can be concluded that:

E[||P_j||] ≥ sup q0 q0 1 + q0 χ2(x ≥ q0 σ2) = sup_q 0 q0 1 + q0      1 − γ(D/2, q0 2σ2) (D/2 − 1)!       (3.6)

Where γ(N , x) is the incomplete gamma function. In Figure 3.1 the upper and lower bound is plotted for different values of sigma to give an intuitive sense for how the norm behaves. As the lower bound and upper bound are quite tight

(27)

P_j will for the rest of this section approximated by the upper bound, i.e. Pj ∼

N_(0, (σ2D)2

(1+σ2_D)2_DI)

Finally Bji = hPj, ˆPjii. First, it can be seen that Pj is dependent on ˆPji by

con-struction. Decomposing Pj to Pij+ P

¬_i

j , where by ¬i means all indices except for

i. The Bji = hPij+ P

¬_i

j , ˆPjii, where it is easy to see that the expression is a sum of a

scaled χ2+ some other distribution. Since ˆPji, P¬_jiare independent, the variance

is finite, and assuming that D >> 1 the Central Limit Theorem can be applied1, henceforth simply CLT, to approximate the other distribution as a Gaussian. The sum of these distributions in a strict sense is neither Gaussian nor χ2, how-ever it will be approximated by the Gaussian distribution shifted by the expected mean of the χ2_{distribution.}

Since Pj i is isotropic Gaussian, the expected mean of the scaled χ2 distribution

is given by the standard deviation of Pi_j and ˆPjimultiplied by the capsule

dimen-sionality. Cov(P_ji) ≈_(1+σ(σ22_D)D)22|_C|_DI, Cov( ˆPji) = α 2_I E h h_Pi j, ˆPjii i = _(1+σ(σ₂2D)α D) √ |_C| √ D · D = (σ2_D)α√_D (1+σ2_D)√|_C|

Using CLT for D >> 1, the resulting Gaussian distributions covariance is given by the covariance of P¬_ji ≈_P_j_{and ˆP}_ji_{multiplied by the capsule dimensionality.} N 0,_(1+σ(σ22D)_D)22_D· α2· D = N 0,(σ_(1+σ2D)22_D)α22 Bji ∼ N (σ2D)α √ D (1 + σ2_D)√|_C|, (σ2D)2α2 (1 + σ2_D)2 ! (3.7) To show that equation 3.7 is quite accurate the distribution for some different choices of α was simulated, and the results are shown in Figure 3.2.

It can clearly be seen that α linearly affects both the standard deviation and the mean of the estimate. It can also be seen that almost all variables in some way affect the resulting distribution, in a very intricate manner.

Until now only a single iteration of the update has been analyzed. An important question to answer is what happens with continued iteration of the algorithm.

3.4.3 Log Coupling Update

A full theoretical analysis of the the convergence of routing is quite difficult. All factors depend on previous iterations in a complicated non-linear way. In this

1_{In the strict sense D → ∞ is required for the distribution to convergence to a Gaussian} distribu-tion almost surely, but in practice the distribudistribu-tion will usually converge to a Gaussian very quickly

(28)

Figure 3.2: Simulated routing iterations for different α, for low α there is a slight overestimate in the variance because of the upper bound approxima-tion. Note the scale change of Bji

(29)

thesis a simpler route was chosen and instead some properties of the process with some toy examples was illustrated.

Assume that ˆPji ∼ N(µ, σ2I) i.i.d. This assumption may not hold true for the

predictions made by the child capsules, but it serves to illustrate the properties with the update. Two cases will be considered, one case where the underlying distribution is zero-centered, and another where it is not. Both these cases exhibit inherent uncertainty. In the first case there is no agreement in any parent capsule, and in the second case there is a high agreement in all capsules.

It can be argued that it is undesirable that a routing algorithm should discrim-inate between different parent capsules in either of these cases, since there is inherent uncertainty over which parent capsule is actually present in the image. Since the norms will differ quite a bit in the former case, it could be argued that it is less detrimental if discrimination occurs in that case.

In Figure 3.4 the results of applying the Dynamic Routing algorithm for a fixed number of iterations is presented. For µ = 0 an interesting phenomenon can be observed, each cluster eventually gets a certain amount of childrenhard assigned2 , which causes the sum to approach an integer.

Additionally, the per-child parent coupling distribution after the final iteration is plotted in Figure 3.3, to further show the issue of coupling polarization. Here it is easy to see that for µ = 0, all children end up with hard assignments (Rji = 0/1).

A much faster convergence for the zero centered distribution can also be observed, which intuitively makes sense since the high agreement in the low variance exam-ple will cause all log-priors to grow in a similar way, making the discrimination slower.

The conclusion is that the algorithm amplifies the noise inherent in the data to force out uncertainty. This may be detrimental to learning, since high routing iterations in early learning will give arbitrary weights to certain capsules, without any underlying cause.

In [50], it is shown that the update of the coupling strength is regularized by the Kullback Leibler Divergence, henceforth KL-divergence, between the previ-ous couplings and the new ones. From the empirical analysis a clear connection can be seen. The coupling coefficients between iterations is always similar. How-ever this also makes the algorithm inherently unstable. Constantly shifting the regularizing distribution makes the algorithm overconfident, and as can be em-pirically observed force hard assignments to children. In this regard [35] come to similar conclusions as in this thesis, where they highlight the issue with finding non-polarized solutions.

3.4.4 Iterative Norm Increase

A final issue is the issue of norms increasing as a direct result of the polarization of the coupling coefficients. To begin, a short illustrative example in the case of

(30)

Figure 3.3:Child parent coupling coefficients after 20 iterations of Dynamic Routing. Top: Coefficients for µ = 0, Bottom: Coefficients for µ = 1.

(31)

Figure 3.4:In both plots the per-parent sum of coupling coefficients for Dy-namic Routing is plotted by iteration. The number of children was chosen as 64, the number of parents as 16, and the dimensionality as 16. The Top plot shows the dynamics of an uncertain estimate where all child predictions are sampled from a distribution with a 0 mean and a high variance, the Bottom plot shows the dynamics when the predictions come from a non-zero mean distribution with low variance.

complete polarization of coefficients will be presented, later good normalization values for a specific set of parameters will be found. Deriving a full analytic formula for the general case is left for future work.

Consider the case where the capsule dimension is 1. The inner product is then positive if the sign of the parent is the same as the child. If completely polar solutions are assumed, the parents can be modelled as a sum of random variables with the same sign. Further it is assumed that the sign of any parent matches the sign of any prediction completely at random. Hence each child is expected to choose between n = |P|/2 parents for its polarized connection, and that the weight connected to that parent is the largest out of the possible weights giving the correct sign, which is approximated by an upper bound of the expectation maxj∈S|Mji| ≤αp2 log(2n) (where S is the set of weights with appropriate sign).

(32)

Proof: Let Y = maxi|Xi|, where Xi ∼ N(0, α2), then:

exp(tE[Y ]) ≤ E[exp(tY )] ≤X

i

E[exp(t|X_i|)] (3.8) Here Jensen’s inequality was used, as well as the union bound. Now, the defini-tion of the moment generating funcdefini-tion of |Xi|will be used.

X i E[exp(t|X_i|)] = 2n exp(t 2_α2 2 )Φ(αt) ⇐⇒ E[Y ] ≤ log(2n) t + log(Φ(αt)) t + t2α2 2t (3.9) Here Φ is the cumulative Gaussian distribution. Since the log of a cumulative distribution must be negative it can be safely ignored. Setting t =

√

2 log(2n)

α gives

the desired result3.

The sum of the variables which will be assumed to be 1/|P| of all children will add up to _|2|C|

P| √

2π following usual assumptions of normal distribution with variance 1.

The expected norm of each predicted parent is hence:

E[| ˆP_j|] ≈ α p 2 log(|P|) 2|C| |_P| √ 2π (3.10)

Since the original form had an expected absolute value of α

√ |_C| |_P| this implies a factor of q log(|P|)|C|

π in norm increase. From simulation, see figure 3.5, it was

observed that this estimate is an overestimate. This makes sense since the upper bound was used as approximation.

The special case of D = 1 is extreme in the sense that the norm increases much more than for larger D, but it clearly illustrates that the routing process increases the norm and justifies further investigation.

3.4.5 The Squash Function

The usage of the squash function inside of the routing procedure was previously discussed, there it reduced the size of variance of the log coupling coefficients. Now the squash function will be analyzed in the context of emphasizing the cor-rect parents, and suppressing the others.

It is intuitive to want the squash function to reduce the norm of incorrect cap-sules to around 0, while keeping the norm of the correct capsule to 1. However, there are two main issues with this view. Firstly, if the norms of all capsules are quite high, squashing ends up setting each to the exact same norm, as discussed previously. At the opposite end, having low norms of the vectors results in all of

(33)

Figure 3.5: Predicted norm increase vs 512 Monte Carlo simulations. The filled region represent the unbiased estimate of 2 standard deviations in the simulation. The proportion of children to parents have been kept constant.

them being squashed close to 0. The difference however, is that the capsule with the largest norm will still beproportionally much larger than the others.

Note that the squashing function never takes into account the norm of the other capsules, which is one of the issues that causes these problems. The conclusion is therefore that under conditions when many capsules are likely, an almost uni-form prediction is produced. When all norms are already small, the norms are all reduced to almost zero, but the largest one will be quadratically larger than the rest. For the following capsule layers this means that the result is almost uni-form coupling coefficients, where the child capsule with the highest norm will unilaterally decide the output.

3.4.6 Connections to Denève’s Scheme and Bias

As discussed in chapter 2, there are some deep connections between dynamic routing in capsule networks and decoding of population codes. To begin, a brief description of Denève’s scheme [3] will be given, the actual number of parameters to estimate is irrelevant in the context of this Thesis since only thesmoothing of the activations is of interest. The algorithm is presented in Algorithm 3.

For S = 1, µ = 1 the form is similar to dynamic routing. There are however some differences. The weight matrix is applied multiple times inside the loop, which can be interpreted as that the weight matrix is a correlation matrix between the

(34)

Algorithm 3Denève’s Scheme [3]

Input: o, The activations of the neurons in the population

Output: o, The smoothed activations of the neurons in the population

1: procedure Denève’s Scheme(o)

2: fork iterations do 3: u= Wo 4: oj = u2_j S+µP ku2k 5: end for 6: return o 7: end procedure

neurons4. In fact, the weight matrix can be thought of as dual to the overlap of the tuning curves in the sense that neurons with tuning curves with high overlap will have large association weights in the weight matrix. The denominator uses the sum of squares over all neurons, while dynamic routing uses no sum5. The pose-space itself is also only implicitly defined by the tuning curves of the neurons and requires further decoding steps to explicitly represent, this stands in contrast to dynamic routing where the pose of each capsule is explicitly represented and adjusted during the iterations.

As discussed by Felsberg et al. [4] Denève’s scheme is inherently biased towards the tuning curve centers. In dynamic routing this problem is in some sense fur-ther amplified because the weight matrix is replaced by the identity matrix, how-ever the squaring in Denève’s scheme stacks because the squared output of one iteration is used as input in the next iteration, while dynamic routing only implic-itly uses this squaring by way of coupling coefficients.

That there is no weight matrix interaction between the capsules in dynamic rout-ing can be interpreted as an orthogonality assumption, i.e. it is assumed that there exist no correlation between the the different capsules. However, this does not mean that the process is not biased. In fact it can be interpreted as an induc-tivesparsity bias, i.e. the network prefers the representation to contain a few very probable capsules even when this may not be the case.

The parallels between capsule networks, dynamic routing and population codes is probably deeper than what is explored in this thesis, and it is likely that further understanding may come from further investigation of this topic.

4_{This interpretation is well supported by the form that is chosen in the original paper for the} weight matrix [3]

5_{However the method proposed by Wang et al. [50] which will later be discussed uses a sum in} the denominator in a way similar to Denève

(35)

3.4.7 Suggested Weight Initialization and Normalization

Schemes

In the previous sections, several intricacies with dynamic routing have been pre-sented. Now some theoretically motivated adjustments to stabilize the routing algorithm will be presented.

Firstly the prediction matrices. α is set so as to get the mean of ˆPjto be

approxi-mately 1. From the previous analysis it can be seen that this implies α = √|P|

D|C|.

Secondly the increase in norm occurring from repeated routing iterations has to be accounted for. This is more difficult to produce an exact form for, so instead Monte Carlo simulations on fixed parameters were run to gain an approximate value, this however limits the general applicability of this method.

The increase in norm for |P| = 16, D = 16, |C| = 144, α = √|P|

D|C| is simulated, and

results are presented in Figure 3.6.

A close to linear process which eventually converges to a maximal value can be observed. Empirically it was found that if ˆPj at each iteration is divided with

k+1

2 , where k is the current iteration, the norm is stabilized. The verification can

also be seen in Figure 3.6. It is clear that the norm remains close to 1 until ≈ 10 iterations where the dynamics change as the non-adjusted curve flattens out.

Figure 3.6: 512 runs of Monte Carlo simulation of the proposed stabilized norm vs the original formulation over routing iterations. The filled area rep-resent unbiased estimates of 2 standard deviations of the simulation.

(36)

norm of 1. One way of ensuring this is a normalization similar to batch normal-ization where the input is divided by the average norm of all capsules over the entire batch6.

3.4.8 Conclusions

Dynamic routing has several interesting underlying ideas. The first is of course the idea to iteratively infer the pose of parent capsules by finding the best agree-ment amount child votes. And the other idea seems to be that the quadratic squash function should quicklyforce out parent capsules which are unlikely, and emphasize the likely ones.

There are however some issues. The algorithm is quite sensitive to the norm of the weight matrices. Setting the norm too high will saturate the squashing functions, resulting in uniform predictions, while setting the norm too low has other issues with one of them being that the routing iterations becomes pointless as the log coupling coefficients variance collapses to 0, and the other being that the squashing reduces the norm of the output which causes issues in later layers. The routing itself is also quite unstable. As has been shown, even almost com-pletely uniform predictions can be vastly different after a few routing iterations. The connections between dynamic routing and population codes have also been investigated, and it was found that there exists some deep similarities between previously proposed decoding methods and dynamic routing, which could be further investigated.

Finally a few tricks to stabilize the routing and make sure that the squash func-tions do not get saturated were presented, as well as methods preventing the norm from increasing by the iterative process itself.

If the algorithm is assessed with the specifications that were set out at the start of the chapter it can observed that only the first specification is fulfilled, dis-crimination is strong. However, the instability and inability for representation of uncertainty makes the algorithm fickle and unstable.

It can therefore be concluded that while there are several good ideas, there is a definite need for improvement of the algorithm.

3.5 Wang Routing

An idea for improving and formalizing dynamic routing came from Wang et al. [50], who addressed some of the issues that have been discussed previously.

3.5.1 The Algorithm

In algortihm 4 the Wang routing algorithm is presented.

(37)

3.5 Wang Routing 25

Algorithm 4Wang Routing [50] Input: C, The set of child capsules Output: P, The set of parent capsules

1: procedure Wang Routing(C)

2: Pˆji = MjiCi ||_M_ji||F 3: Bji = 0 4: fork iterations do 5: Rji = exp(Bji) P jexp(Bji) 6: Pˆ_j=P iRjiPˆji 7: P_j= _||ˆ_PPˆj j||2 8: Bji = αhPj, ˆPjii 9: end for 10: P_j = squash( ˆPj) 11: return P 12: end procedure

The squash function is given by squash( ˆPj) =

ˆ Pj

1+maxj||Pˆj||

α is a hyper-parameter which intuitively indicates how much the log coupling coefficients are allowed to deviate from 0. Here the same symbol as the scaling of the weight matrix in dynamic routing is used, because they share some clear similarities.

In Wang Routing, the squashing function is only used at the end, the main moti-vation behind this seems to be that formulating a proper optimization objective requires it [50].

3.5.2 Prediction Matrices

In contrast to dynamic routing [42], Wang et al. propose to normalize the predic-tion matrices using the frobenius norm. They also normalize the resulting parent capsules instead of using a squashing function. With the same assumptions as for dynamic routing it can be shown that: ˆPji ∼ N 0,

E[M_jiIMT ji] D||Mji||2F ! = N0,_D12I . It is immediately clear that the expected resulting norm will be√1

D.

Since Pjis normalized, the resulting inner-product Bji = αhPj, ˆPjiis:

Bji ∼ N √ α D √ |_C|, α2 D2 ! (3.11) As can clearly be seen, increasing the dimensionality of the capsules will decrease

(38)

Figure 3.7:Simulated Wang routing iterations for different D. Note the scale change of Bji

the variance of the log coupling coefficients. In practice this means that higher di-mensionality require a higher α to have the same amount of variance as for lower dimensions. As for dynamic routing experiments which confirms the validity of 3.11 are provided. The results can be seen in Figure 3.7.

3.5.3 Log Coupling Update

It could be argued that one of the main motivations of Wang’s procedure is the independence of the regularization to previous iterations of the algorithm. For dy-namic routing strong evidence that the algorithm is unable to maintain diversity in the output estimate and degrades to polarized solutions was given previously. In contrast, the update procedure in Wang et al. instead uses a coupling strength entropy regularizer, this causes the update to be maintain a diverse output, as is shown empirically in Figure 3.8. The same method as for dynamic routing was used. Note that the mean and variance is unrealistic in this setting since Wang routing uses a normalized weight matrix which will realistically produce smaller values.

From the figure it is immediately obvious that the routing converges much quicker than dynamic routing. It can also been seen in Figure 3.9 that the coupling

(39)

co-3.5 Wang Routing 27

Figure 3.8: In both plots the per-parent sum of coupling coefficients for Wang Routing is plotted by iteration. The number of children was chosen as 64, the number of parents as 16, and the dimensionality as 16. The Top plot shows the dynamics of an uncertain estimate where all child predictions are sampled from a distribution with a 0 mean and a high variance, the Bot-tomplot shows the dynamics when the predictions come from a non-zero mean distribution with low variance.

efficients are much more homogeneous, especially in the low variance case, and fewer hard assignments are made.

3.5.4 The

α Parameter

In the analysis a fixed alpha has been used. However as [50] point out, this is a scalable hyperparameter. They suggest using a low alpha early during learning, and gradually increasing it. From an optimization/statistical point of view this simply means that there is a stronger prior belief in uniform coupling coefficients early during learning.

A reasonable interpretation is that the network shouldexplore more earlier during learning, whileexploiting in the later stages, using the familiar terminology from e.g. [47].

(40)

Figure 3.9: Child parent coupling coefficients after 20 iterations of Wang routing. Top: Coefficients for µ = 0, Bottom: Coefficients for µ = 1.

(41)

3.6 Time-complexity Issues 29

Further, if α is trainable , then this scheme is quite similar to Weight Normaliza-tion introduced by Salimans et al. [43], where the weights are normalized and a seperate parameter trained for the scale. The difference however, is that α only affects the coupling coefficients.

3.5.5 The Relative Squash Function

Wang et al. use an interesting relative form of the squash function. Instead of independently squashing the parent capsules, the joint maximum over the par-ents is used in the denominator to regularize all parpar-ents. This keeps the norm of the parents capsules from collapsing when no agreement is reached. There are however some unprincipled components.

The "1" in the denominator of the relative squash function is not motivated in [50]. For a single iteration, using the same assumptions as usual, Pj ∼ N(0,

|_C| |_P|2_D2) in

expectation, so the norm is

√ |_C| |_P|

√

D in expectation. When C >> P , hence {wj} ≈

[0, 1], but otherwise wj<< 1 often.

3.5.6 Conclusions

It can be observed that detaching the weight scale and squashing function from the routing iterations has some nice properties. Weight scaling makes the routing process more predictable, while putting the squash function outside the loop has the main benefit of framing the problem as maximizing a proper optimization objective. The change of regularization from an unstable KL-divergence term to a proper entropy regularizer also helps stabilize the routing.

However there are also some issues, normalizing the weights with the frobenius norm will generally cause the output to shrink in norm. This problem is exacer-bated by the relative squash function which for small output will further reduce the output norms.

With respect to the specifications introduced in the beginning of the chapter, it can concluded that the second and third specifications are better fulfilled by Wang routing, with some sacrifices being made with regard to the discrimination between capsules.

In conclusion, while Wang routing proposes several interesting solutions to the issues of dynamic routing, it introduces others and it seems there is still plenty of room for improvement.

3.6 Time-complexity Issues

An important issue that has not been discussed is that the routing iterations are computationally expensive. Running more than ≈ 3 routing iterations is in prac-tice not feasible for most applications, since the computational graph scales lin-early with the number of iterations.

(42)

It is therefore of interest to find an algorithm which produces results similar to routing, but uses less computational resources.

3.7 Proposed Routing

In the previous sections, an analysis of some previously proposed routing rithms under uncertainty was conducted. In this section a novel routing algo-rithm will be proposed.

3.7.1 Motivation

Both dynamic- and Wang routing exhibit both positive and negative properties. Dynamic routing has a possibility for fast discrimination between capsules, us-ing the squashus-ing function in the loop, while Wang routus-ing is more theoretically grounded. Both however suffer from issues with norm shrinking, and especially Dynamic Routing suffers from this because of the usage of the squash function. Dynamic Routing is also extremely sensitive to initial conditions. Small changes may give very different results, which goes against the specifications that were postulated at the beginning of the chapter. In comparison, linear routing is quick and more stable than the other algorithms. The obvious issue is that multiple such layers are equivalent to a single layer because of its linear nature. It is there-fore of great interest to find a routing algorithm which is stable and fulfills the previously proposed specifications, while still being non-linear in a meaningful way.

Another issue is that both Wang and dynamic routing use an iterative routing procedure. This gives a linear increase in rtime, which is computationally un-desirable. This motivates that the indivudual couplings should be discarded, and to instead use the unweighted sum prediction as sufficient statistics for the rout-ing process. Workrout-ing with only the sums greatly simplifies the routrout-ing, however the discriminative power should be increased compared to linear routing.

3.7.2 EntMin

Based on the above observations, a novel capsule-wise entropy minimizing non-linearity EntMin will now be proposed, the non-linearity is given in equation 3.12.

EntMin( ˆPj) =

||_Pˆ_j||α_Pˆ_jE[|| ˆP_j||] E[|| ˆP_j||α+1]

(3.12) Here α > 0 is a hyper-parameter which intuitively represents how much the en-tropy of the distribution should be reduced (in fact setting α < 0 increases the entropy!).

(43)

3.7 Proposed Routing 31

The result is normalized to ensure that the sum of norms remains the same, where the expectation is taken over all parent capsules, hence the capsules are not squashed, but rather the entropy of an implicit distribution is reduced. In fact, this operation always reduces entropy except for uniform and one-hot distri-butions, a general proof is given below.

Proof: The following proof is an adaption of the proof in [34]. Let pj be some distribution. Now let 0 ≤ aj.

Consider the transformation pj → pjaxj

P

kpkaxk := qj(x) for x ∈ [0, 1]. We want to show

conditions when the derivative w.r.t. x at x=0 of the entropy is negative.

∂H[q(x)] ∂x _x=0 = −X j ∂ ∂xqj(x) log(qj(x)) _x=0 = −X j (1 + log(qj(x))) ∂qj(x) ∂x _x=0 = −X j (1 + log(qj(x))) pjaxj P kpkax_k log(aj) − P kpkaxnlog(ak) P kpkax_k ! _x=0 = −X j pj(1 + log(pj))        log(aj) − X k pklog(ak)       

= −E[log(pj) log(aj)] − E[log(pj)]E[log(aj)] = −Cov(log(pj), log(aj)) (3.13)

It can be seen that if the covariance between log(aj) and log(pj) > 0 the entropy

will be reduced. Consider now the special case of aj = pjαwhere α > 0.

Cov(log(pj), log(pαj)) = αCov(log(pj), log(pj)) ≥ 0 (3.14)

There is equality in cases where there does not exist any pj < 1 such that pj >

pk∀k , j. These distributions are characterized by having a set of 0 or more zeros

mixed with a uniform distribution over the rest of the indices.

3.7.3 The Algorithm

It can be noticed that the hyper parameter α is similar to the routing iteration parameter in routing by agreement. A non-iterative version of routing combined with the proposed new non-linearity to get the proposed algorithm, which is pre-sented in Algorithm 5.

3.7.4 Analysis

In a similar fashion to Wang and dynamic routing, an investigation on the effect of routing on random data is investigated. Previously the focus was mainly on the coupling coefficients, but now the actual output norms will be investigated, since

Towards Understanding Capsule Networks

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2020