Activation Regions as a Proxy for Understanding Neural Networks

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Activation Regions as a Proxy for Understanding Neural Networks

ADRIAN CHMIELEWSKI-ANDERS

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Activation Regions as a Proxy for Understanding Neural

Networks

ADRIAN CHMIELEWSKI-ANDERS

Master in Machine Learning Date: July 3, 2020

Supervisor: Prof. Josephine Sullivan Examiner: Prof. Hedvig Kjellström

School of Electrical Engineering and Computer Science

Swedish title: En inblick i neurala nätverk via analys av

aktiveringsregioner

(4)

(5)

iii

Abstract

Despite the empirical success and widespread adoption of deep neural net- works, methods for systematic reasoning about architectures based on mea- surable properties of neural networks is comparatively deficient. As measured by upper bounds on the theoretical maximum number of activation regions, deeper networks are more expressive than their shallow counterparts. How- ever, recent work has shown that in practice for a constant number of neurons the average number of activation regions achieved is independent of architec- ture, and lower than theoretical maximums. This thesis further examines the number of activation regions achieved during and after training and the conse- quences of label noise on the number of activation regions, largely confirming previous results, not indicating any significant difference with architecture.

Additionally, this work investigates the density of activation regions around

data points throughout training and examines whether this metric varies with

depth or correlates at all with generalization. Ultimately, experiments show

this is not a strong predictor of generalization, though does differ with choices

of architecture.

(6)

Sammanfattning

Trots den empiriska framgången och utbredda anammandet av djupa neurala

nätverk, är metoder för systematisk resonemang om arkitekturer baserade på

mätbara egenskaper hos neurala nätverk relativt bristfälliga. Enligt teoretiskt

maximala övre gränser för antalet aktiveringsregioner, är djupare nätverk mer

uttrycksfulla än deras mindre djupa motsvarigheter. Emellertid har ny forsk-

ning visat att i praktiken, för ett konstant antal neuroner, är det genomsnittliga

antalet uppnådda aktiveringsregioner oberoende av arkitektur och lägre än te-

oretiska övre gränser. Denna avhandling fortsätter i denna riktning och under-

söker antalet aktiveringsregioner som uppnåtts under och efter träningen och

hur fel i etiketter(labels) påverkar antalet aktiveringsregioner. Det vi finner är

till stor del överens med tidigare resultat, vilka inte indikerar någon signifikant

skillnad mellan arkitekturer. Dessutom undersöker detta arbete tätheten av ak-

tiveringsregioner runt datapunkter under träningen och undersöker om detta

mått varierar med djupet eller korrelerar alls med generalisering. I slutändan

visar experiment att detta mått inte är en stark förutsägare för generalisering,

men skiljer sig dock inte för olika arkitekturer.

(7)

Acknowledgements

I would like to sincerely thank my supervisor Prof. Josephine Sullivan for her helpful guidance, experience and insights while I was working on my thesis.

I am grateful to Prof. Josephine Sullivan and Prof. Hossein Azizpour for granting me access to their servers, on which I conducted many experiments for this thesis. Additionally, many thanks for allowing me to take part in weekly meetings and reading group sessions with Prof. Josephine Sullivan, Prof. Hossein Azizpuor, and Prof. Stefan Carlsson’s research group within the Robotics, Perceptron, and Learning (RPL) group at KTH. I would like to ex- tend my thanks to all its members who provided me with feedback along the way. Especially helpful to me during this time was Matteo Gamba, who took the time to meet with me, help me understand relevant literature, and answer any questions I had.

vii

(10)

(11)

Chapter 1 Introduction

The following chapter describes in broad strokes the existing research on neu- ral network expressiveness and its relevance. Next, it systematically presents this thesis’ research questions and contributions and well as a review of lit- erature and related work. It introduces the approach to exppressiveness via activation regions and fits this view into the existing body of work. Lastly, it details the organization of the remainder of the thesis.

1.1 General research area

Neural networks have proven to have high efficacy on a multitude of problems in academic and commercial settings over the past decade (e.g., [1,

2, 3, 4]),

leading to a resurgence and widespread adoption. Though neural netorks as a model have existed for some time, it was not until recently, with the advent of higher powered computers and larger datasets, that interests in deeper model architecture has asrisen. In this context, deep refers to the number of hidden layers, whereas width refers to the number of neurons at a hidden layer. Empir- ically, deeper networks achieve higher performance than their shallow coun- terparts for image recognition tasks and competitions such as ILSVRC (see [5,

6]). Research on theoretical properties of neural networks, on the other hand,

has largely fallen behind empirically driven results. Knowing the underlying properties of network architecture on expressivity would be helpful as it could elimate guesswork in research.

One avenue of research trying to explain the observed superiority of deeper neural networks is the investigation of expressivity of networks. Weights of networks that make use of piecewise linear functions (i.e. [7,

8]) partition the

input space into different regions on which different functions are computed;

1

(12)

Figure 1.1 shows a small example on a two-dimensional input space. A natural measure of expressivity is how many regions networks split the input space into, since with a higher density of regions, the computed function looks more smooth.

1 0 1

x

1

1.5 1.0 0.5 0.0 0.5 1.0 1.5

x

2

Regions at initialization: 1644

1 0 1

x

1

1.5 1.0 0.5 0.0 0.5 1.0 1.5

x

2

Regions at end of training: 1858

Figure 1.1: Activation regions along with their counts of a three-layer neural network with input dimension n

⁰

= 2 which uses the ReLU activation function at initialization (left) and after the network has been trained on the synthetic spiral dataset (right). The pattern of regions at initialization is due to the choice of initialization scheme where all biases are sampled from a normal distribution with zero mean and small variance (10

⁻⁶

), forcing lines to pass near the origin. Chapter 2 explains activation regions and Chapter 3 details the spiral dataset and how regions are counted.

A long-established result proves that even a single-layer neural network—

albeit potentially requiring very large width—is a universal approximator, hint- ing that maximum expressivity alone does not seem a likely candidate to ex- plain the success of deeper networks [9]. Moreover, recent work on network distillation (sometimes called model compression) shows that smaller (i.e.

fewer total number of neurons) or wider (i.e. fewer layers but more neurons per layer) networks can be trained to perform well if they are trained in an al- tered regime, wherein the smaller network is trained to emulate a larger one [10,

11,12]. This may suggest that the empirically observed benefits of depth

lays in the process of optimization rather than limits of expressiveness.

Still, investigation of the construction of regions in input space may pro-

(13)

CHAPTER 1. INTRODUCTION 3

vide insight into other properties of neural networks such as amenability to certain training regimes and more. Research in this area could show certain architectures are favorable, or better suited to different training regimes. Ul- timately, though, the study of such regions provides a geometric view of the weights of neural networks.

1.2 Research questions and contributions

This thesis seeks to investigate activation regions of fully connected neural networks with the ReLU activation function primarily by empirical means.

Specifically, this thesis will investigate the following:

• How does the number of activation regions evolve over training?

• Are networks trained in practice biased toward having more or fewer activation regions? How is this affected by factors such as architecture and label noise?

• Are activation regions more dense around the data manifold (i.e. where the data is)? Is this a good predictor of generalization?

Plainly, the research question can be summarized as: What insight can activa- tion regions provide about neural networks?

This thesis’ results generally provide supplementary evidence corroborat- ing the results of Hanin and Rolnick [13] and Novak et al. [14] while aiming to analyze any edge cases. Concretely, this work contributes the following:

• Corroborates [13] by showing the number of regions only depends on the total number of neurons.

• Along paths generated by an autoencoder between training (or valida- tion) points, the number of activation regions is less dense than along a similar linear path, and deeper networks are less dense in this regard for a fixed number of neurons.

• Label noise does induce networks to have slightly more activation re- gions, which are more dense. Overfitting with label noise is similar and loosely ties together density of activation regions and generalization gap.

It also describes in detail a method to precisely count activation regions in

Chapter 3. Some limitations of the results of this thesis are that results are

(14)

just for fully connected ReLU networks, trained with stochastic gradient de- scent (SGD), on small datasets. Moreover, this thesis does not explore datasets outside the realm of images (e.g. audio data).

1.3 Literature review

There exists a wealth of literature which concerns approximating powers of neural networks. A classic result by Cybenko [9] is that even single layer net- works are universal approximators. In response to the explosion of empirical results linking depth to better expressiveness, more work emerged connecting architecture choices and expressivity, confirming that indeed depth does have certain benefits with regard to expressivitiy. For example, a result from Telgar- sky [15] shows that there exist networks with Θ(k

³

) layers, which cannot be approximated by networks with O(k) layers unless the width is exponential.

In addition, Rolnick and Tegmark [16] show that polynomials are represented much more easily by deeper networks compared with shallower networks.

In other metrics of representational power, Raghu et al. [17] show that networks increase exponentially in complexity with increasing depth. Specif- ically, the expected value of the image of the arc length of a trajectory in input space grows exponentially with network depth.

Another measure for representational power for networks is the number of linear regions (or activation regions, see Chapter 2 for differences) that a net- work defines on its input space. For this measure, there exist a myriad of dif- ferent bounds. They all usually fall into one of four categories: lower bounds on the maximal number of regions, upper bounds on the maximal number of regions, empirical bounds, or bounds in expectation.

A pioneering study on the upper bounds of the maximal number of re- gions is detailed by Pascanu, Montufar, and Bengio [18] and is later improved upon and expanded by Montufar et al. [19]. Both show that the number of lin- ear regions computed by fully connected neural networks grows exponentially with the depth but only polynomially with the width, for fixed width networks.

Their results provide a lower bound on the maximal regions computed. Both also establish that the per-parameter expressivity, measured as a ratio of linear regions to parameter count, favors deeper networks. In fact, they show that deeper networks are exponentially more efficient in this regard.

Raghu et al. [17] provide an upper bound on the maximal number of re-

gions. Namely, O(n

ⁿ⁰^k

) for a k-layer network with fixed width of n and input

dimension n

0

. This bound is tight and is exponential in depth. For networks

with non-constant widths, Serra, Tjandraatmadja, and Ramalingam [20] pro-

(15)

CHAPTER 1. INTRODUCTION 5

vide an upper bound for the maximal number of regions which is equal to Raghu et al.’s bound in the fixed-width case. One property of the bound is that removing a neuron from an earlier layer will decrease the bound larger than or equal to removing a neuron from the next layer, if both have the same width. An additional property is for high input dimension, the bound is larger for shallow (single-layer) networks, compared with deeper networks of simi- lar total number of neurons. The same work presents a tighter lower bound on the maximal number of regions. The authors also provide empirical bounds which are well below the existing upper bounds for the maximal region count.

Their technique involves counting exactly all linear regions by formulating the problem as a mixed-integer linear program (MILP) over a variety of trained networks.

Hanin and Rolnick [13,

21] meticulously investigate bounding the average

number of activation regions with their central result being that, in expectation over the weights, the number of activation regions defined by a k-layer network (with ReLU activation) for any cube C in input space is vol(C)( P

k

i=1

n

_i

)

ⁿ⁰

/(n

₀

!), where n

i

is the number of neurons at layer i. They prove this bound holds at initialization under mild and reasonable assumptions, and empirically show it is valid after training. Surprisingly, the bound is independent of architecture, and only depends on the total number of neurons, in stark contrast to previous work. In [21], they also show that the average distance from a point to the nearest decision boundary is bounded by a constant over the total number of neurons. Hanin and Rolnick’s work ultimately calls into question the value of activation region counts as a measure of network expressivity.

All previously mentioned studies examine fully connected networks. In theory a CNN can be converted into an equivalent fully connected network.

The resulting converted network has shared weights, and many weights which are zero. This leads to different properties of the activation regions and some bounds which are applicable to fully connected weights do not hold for CNNs (for example Hanin and Rolnick’s bound [13]). The study of activation re- gions in convolutional neural networks (CNN) and hyperplane arrangements is a younger and active area of research. One current avenue of research is to try and reason about observed symmetries and/or any bias in CNN activation region and hyperplane arrangements [22].

The problem of generalization in neural networks is not well explained by

classical statistics. Specifically, traditional statistical learning theory and exist-

ing notions of complexity, such as VC-dimension, do not translate in a straight-

forward manner to neural networks since larger networks can have lower gen-

eralization error, and memorize data with random labels [23,

24, 25]. In this

(16)

vein, Novak et al. [14] study empirically correlations between the generaliza- tion capabilities of a very wide range of networks and the number of activation regions. They conclude that there is some but not much correlation between the two. They provide some evidence that activation regions are qualitatively different around training points. That is, activation regions differ in structure in different areas in input space.

Many questions regarding generalization remain unanswered with regard to neural networks. There is furthermore an abundance of literature and areas of research which are tangential to the study of activation regions. For ex- ample, the loss surfaces of neural networks [26,

27], and characterization of

optimization problems [28,

29] among others.

1.4 Thesis structure

This chapter provides a brief introduction to the general research area, review of appropriate literature, and outline of questions and contributions. The re- mainder of the thesis is structured as so:

• Chapter 2 methodically presents mathematical preliminaries for use in later chapters.

• Chapter 3 details methods for counting activation regions.

• Chapter 4 presents conducted experiments and their results.

• Chapter 5 analyzes and contextualizes results from Chapter 4.

• Chapter 6 expands on the significances of results and discusses possible

future research directions.

(17)

Chapter 2 Background

This section briefly introduces the preliminary definitions and concepts needed to develop following sections. It also reviews in more detail some aforemen- tioned results from the literature and sets up basic notation.

2.1 Notation, linear regions and activation re- gions

Definition 2.1 (Fully connected neural network). A k-layer fully connected neural network f : R

ⁿ⁰

→ R

ⁿ^out

with activation function r : R

ⁿ

→ R

ⁿ

is a function of the form

f = f

out

◦ f

_k

◦ r ◦ f

_k−1

· · · ◦ r ◦ f

₁

(2.1) Each f

i

: R

ⁿⁱ⁻¹

→ R

ⁿⁱ

is parametrized by a weight matrix W

i

∈ R

ⁿⁱ^×nⁱ⁻¹

and a bias b

i

∈ R

ⁿⁱ

with the form f

i

(x) = W

_i

x + b

_i

. This implies f has n

i

neurons (equivalently, units, or hidden units) at each layer which is not the input or output layer. The total number of neurons is N , P

k

i=1

n

_i

. The network f is then parameterized by θ , S

k

i=1

{W

i

, b

i

}. The activation function r acts elementwise on its input. The output from a given layer l is a function denoted by f

^(l)

= f

_l

◦ r ◦ f

_l−1

◦ · · · ◦ r ◦ f

₁

. The output from a layer l at the m-th neuron is an element of f

^(l)

, denoted f

m^(l)

.

This work will consider only fully connected neural networks with the rectified linear unit (ReLU) activation function r(x) = max {0, x}, henceforth referred to as ReLU nets. Since each r ◦ f

i

is piecewise linear so is f , because it is the composition of piecewise linear functions. Naturally, then the input space of f is divided up into regions on which a linear function is computed. These are the

7

(18)

linear regions of f . The ReLU activation function has two behaviors. Namely, for inputs less than or equal to zero, it is zero and the identity otherwise. The boundary between these two behaviors in input space is non-differentiable and forms a hyperplane. In a single layer network (f = r ◦ f

1

) each neuron in the hidden layer produces a hyperplane. Therefore, f defines a collection of n

1

hyperplanes in the input space of dimension n

⁰

. In a k-layer network (that is, k hidden layers not counting the input and output layers, in accordance with Definition 2.1) where k > 1, the first layer defines a collection of n

1

hyperplanes and following layers define subsequent hyperplanes as follows.

In layer j of f each neuron may define a single hyperplane in each of the regions formed by layers 1 . . . j − 1. Thus, a neuron in layer j may define the same hyperplane in two adjacent regions, or a different one—effectively bending the hyperplane creating a cusp—or not at all. An equivalent and more formal view of linear regions is to define the regions as connected components of the input space, with the bent hyperplanes removed as is done in [13] which is formally restated in Definition 2.2.

Definition 2.2 (Linear regions as connected components). The linear regions of f are the connected components of the input space with the points of dis- continuity in the gradient of f removed

S = {x ∈ R

ⁿ⁰

| ∇

θ

f (x) undefined} (2.2) linear regions(f ) = Z(f ) = connected components(R

ⁿ⁰

\ S) (2.3) The number of linear regions of a network f is denoted |Z(f )|.

Lemma 2.1. If f

out

is linear, then the output dimension of f , n

out

does not affect the maximal number of linear regions of f .

Proof. See [18] for a proof.

Closely related to linear regions are activation regions, which are the regions in input space that share the same collection of activations at each neuron, called an activation pattern.

Definition 2.3 (Activation pattern). An activation pattern of a network f at a layer l a

l

is a vector in {0, 1}

ⁿ^l

. It has zeros at entries where neurons have post-activation of zero, and ones at entries where neurons have post-activation values strictly larger than zero. An activation pattern A of a k-layer network f is a collection of layer-wise activation patterns for all hidden layers of f

A = {a

_l

| l ∈ 1 . . . k} (2.4)

(19)

CHAPTER 2. BACKGROUND 9

Definition 2.4 (Activation region). For an activation pattern A the activation region R of a k-layer network f , is the set of points in input space which, when fed through the network, give an activation pattern of A

R(f, A) = {x ∈ R

ⁿ⁰

| 1

_R⁺

(r(f

^(l)

(x))) = a

_l

∀ l ∈ 1 . . . k, a

_l

∈ A} (2.5) The number of activation regions of a network f is denoted |R(f )|.

Figure 2.1 visualizes activation regions of two small networks. A single lin- ear region may correspond to a single activation region; however, this is not necessarily true. In fact, the number of activation regions of a network bounds the number of linear regions.

1.0 0.5 0.0 0.5 1.0

x

1

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

x

2

1.0 0.5 0.0 0.5 1.0

x

1

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

x

2

Figure 2.1: The activation regions of two networks in the region [−1, 1]

²

. Left:

a single-layer network with 20 hidden units. Right: a two layer network with 10 hidden units at each layer. The single-layer network’s activation regions form an arrangement of hyperplanes whereas the two layer network defines hyperplanes which may bend at the junction of two adjacent activation regions.

In this figure, weights and biases were initialized by sampling from a uniform distribution centered around 0, with variance 2/fan-in. Fan-in refers to the number of neurons in the previous layer. Coloring is arbitrary and denotes different activation regions.

Lemma 2.2. The number of linear regions of a network f with the ReLU ac- tivation function is bounded above by the number of activation regions

|Z(f )| ≤ |R(f )| (2.6)

(20)

Proof. This lemma is proved rigorously in [13] Intuitively, Lemma 2.2 is true since the same function may be computed on different activation regions.

Another difference is that activation regions are convex for networks with any piecewise linear activation function, whereas linear regions are not nec- essarily convex [13]. The convexity of activation regions on linear functions is remarked to be favorable in optimization problems [13].

Lemma 2.3. The activation regions R(f, A) of a k-layered network f with ReLU activation function r are convex.

Proof. The following is a restatement of the proof in [17]. Consider the base case for a single layered neural network . The i-th neuron at the first hidden layer defines a single hyperplane W

1⁽ⁱ⁾

x + b

⁽¹⁾₁

which splits each existing ac- tivation region it crosses into two. The resulting two half spaces are convex.

Proceeding with induction, assume that the input space has been partitioned into activation regions by layers 1 . . . l − 1 < k, which are all convex. For a fixed activation pattern A and corresponding activation region R(f, A), the neurons at layer l each define a hyperplane which may or may not split the region. The hyperplanes have effective parameters corresponding to the rows of W

l^∗

f

⁽¹⁾

(x) = W

₁

x + b

₁

(2.7)

f

⁽²⁾

(x) = W

₂

r(W

₁

x + b

₁

) + b

₂

(2.8)

= W

₂

S

₁

W

₁

x + W

₂

Sb

₁

+ b

₂

(2.9)

= W

₂^∗

x + b

^∗₂

(2.10)

. . . = . . .

f

^(l)

(x) = W

_l^∗

x + b

^∗_l

(2.11)

Here, S

^l

is a diagonal matrix of the activations (0 or 1) for each neuron at layer

l. If the region is not cut by the i-th neuron at layer l then it is the same and still

convex by the inductive hypothesis. If it is cut by a hyperplane, then it is convex

by the inductive hypothesis, and so all activation regions are convex.

Lemma 2.4. The linear regions Z(f ) of a k-layered network f with ReLU

activation function r are not necessarily convex.

(21)

Intuitively, Lemma 2.4 is correct since there can be two or more adjacent acti- vation regions but the linear function computed is the same, whilst the function computed over the two activation regions has no points of undefined gradient, and thus the activation regions collapse into a single linear region. The result- ing linear region may not be convex.

This thesis focuses exclusively on the activation regions of ReLU nets and not linear regions because activation regions are easier to count and are bounded by linear regions.

2.2 Bounding regions

The question of how many regions are formed by m hyperplanes in R

ⁿ

is a well-studied problem in combinatorics. Zaslavsky’s Theorem [19,

30] pro-

vides an upper bound to this problem.

Theorem 2.5 (Zaslavsky’s Theorem). The maximal number of linear regions formed by m hyperplanes in R

ⁿ

is

n

X

i=0

m i

(2.12)

provided that each hyperplane is in general position. That is, that any hyper- plane may be slightly perturbed by an artibrarily small amount and not change the total number of regions.

Zaslavsky’s Theorem implies that single-layer networks with n

1

hidden units and input dimension n

⁰

have a maximal number of linear regions according to equation 2.12. For k-layer networks, a very simple upper bound of the number of linear regions is 2

^N

. Montufar et al. [19] present a construction of weights for this. An alternative line of reasoning is that combinatorially for ReLU nets,

|R(f )| ≤ 2

^N

and lemma 2.2 then ensures |Z(f )| ≤ 2

^N

. This stems from the possible activation regions encodings, and ReLU nets have two possible codes for each neuron (0 or 1). This upper bound is vacuous since it is not hard to see that it is much greater than equation 2.12 in the single-layer case. Theorem 2.6 restates the tight upper bound presented in [17].

Theorem 2.6. For a k-layer ReLU net f with n units per layer, and input

dimension n

₀

the maximal number of activation regions is bounded from above

by O(n

^kn⁰

).

(22)

Proof. A proof appears in [17]. Serra, Tjandraatmadja, and Ramalingam [20] state and prove a generalization of theorem 2.6 for when the network width is not fixed, which is tight for single-layer networks (i.e. it reduces to Zaslavsky’s theorem). This work will focus strictly on networks of constant width for simplicity and to avoid having to deal with any bottleneck effects where widths are variable.

It is not hard to see that these bounds grow very quickly. Yet empirical ob- servations show that networks do not achieve these bounds in practice. This is apparent even in [20]. Hanin and Rolnick examine the average case scrupu- lously in [13,

21]. Of their many results, two (theorem 5 in [13] and theorem

1 in [21]) are relevant and shortly restated here as Theorems 2.7 and 2.8.

Theorem 2.7 (Restatement of theorem 5 in [13]). For a ReLU net f , with N ≥ n

₀

for any cube C

E

θ

[|R(f ; C)|]

vol(C) ≤ (T N )

ⁿ⁰

n

₀

! (2.13)

Provided that (i) the weights have a density, (ii) the conditional distribution of a collection of biases has a density, (iii) moments of gradient norms of pre- activations of f are bounded, and (iv) the conditional distribution of biases is bounded. T is a constant which depends on the bounds of the gradients and bias distribution. R(f ; C) is slight abuse of notation and refers to the number of activation regions inside of C.

Proof. See [13] for details of conditions and a proof. Theorem 2.8 (Restatement of theorem 1 in [21]). For a ReLU net f for an interval I ⊂ R

E

^θ

[R(f ; I)]

|I| ≈ N (2.14)

Provided that for each neuron in f the mean pre-activation gradient is bounded.

Theorem 2.8 is a result of theorem 3 in [21], where a proof is also provided.

To provide some intuition to Theorems 2.7 and 2.8, consider a network f with n

⁰

= 1. Activation regions of such a network are intervals of R, and the boundaries are points where for some layer l the k-th neuron satisfies

f

_k^(l)

(x) = 0 (2.15)

(23)

For this neuron to split an activation region ϕ ∈ R(f ; I) equation 2.15 would have to have a solution in the activation region (i.e. in the interval ϕ). For this neuron to split many activation regions, it would need to have multiple solutions to equation 2.15 over I thus crossing zero many times, meaning f

k^(l)

must fluctuate a lot over I. For modern initialization schemes, this is not the case, and so over I there should not be many solutions to equation 2.15. This line of reasoning was first explained in [13,

21].

When n

0

= 2, a similar intuitive argument applies. Activation regions are convex polygons in R

²

, and are cut by the m-th neuron at layer l when there is some x ∈ ϕ ∈ R(f ; C) such that a equation 2.15 has a solution and some two vertices have different signs when used as arguments to 2.15. In R

²

the hyperplane f

^m^(l)

can fluctuate in two directions, but still at initialization its gradient and total variation are not large, and so it is not likely that this hyperplane fluctuates a lot over some rectangle of R

²

. Thus, while the upper bounds for the number of activation regions scales more rapidly with depth than width, at initialization time this is not the case.

After training, there is no guarantee that the conditions in Theorem 2.7 will hold; nor is it guaranteed that the neurons at later layers will not be highly oscillatory. Empirically, however, the conditions seem to be true after training [13].

It is also reiterate to note that Theorems 2.7 and 2.8 do not apply to the whole input space, though one could select a large enough hypercube to en- compass the data when doing counting. Furthermore, counting away from the data (for example on images with some pixel intensities larger than 255) is not particularly interesting.

Although it is tempting to apply Theorem 2.7 to CNNs by converting a CNN to a fully connected net, it does not hold as the converted net would have shared weights, thus invalidating an assumption of the theorem.

In future chapters, Theorems 2.7, 2.8 will serve as the main benchmarks

on which results are compared. This is because they are proven to hold at

initialization in [13,

21], and thus form a good starting point to examine any

deviations.

(24)

Methods

There are many ways to count the activation regions of fully connected neural networks. One way to count all regions is to formulate counting as a mixed- inter linear program (MILP) as in [20,

31]. However, optimization of MILPs

is NP-hard. An iterative method for exact counting, which this work uses and describes in this chapter, may be faster for small networks though grows slower with increasing input space dimension. This work does not aim to find em- pirical bounds on the number of activation regions over the whole input space since this is too computationally expensive. Instead, this work will focus on methods to approximate density of activity regions in certain areas of input space. The central method involves counting regions traversed by lines in input space. This proxy–or variants–for approximating density of activation regions is used by, for example Raghu et al. [17] and Novak et al. [14] as a measure of complexity. Furthermore, Theorem 2.8 gives a bound in expectation which is applicable even if the line is in a higher-dimensional input space.

3.1 Motivation

The choice of exactly counting all regions in input space has history as being used as a measure of complexity in neural networks which have piecewise linear activation functions (see e.g., [18,

19]). A piecewise linear function

approximating a non-linear function should have lower approximation error if there are many more linear regions. The function is still piecewise linear, but if there are enough linear pieces, the function should be able to approximate the non-linear function better. This is the general intuition using linear regions or activation regions as a proxy for complexity of neural networks.

Exactly counting all regions is prohibitively expensive in terms of compute

14

(25)

CHAPTER 3. METHODS 15

resources, therefore it is reasonable to count regions along a lower dimensional space. This work choses counting regions along lines, which has been previ- ously used as a measure of complexity and sensitivity [14,

17,21]. Novak et

al. [14] remark that counting the number of regions along some path is an ap- propriate approximation of the curvature of the network since piecewise linear functions have constant derivative except at transition boundaries.

3.2 Exact counting

To count all of the activation regions within a hypercube of R

ⁿ⁰

the counting algorithm must maintain a set of vertices of all regions, and sequentially go through each neuron, layer by layer, to check if the hyperplane it defines splits the existing regions. Upon a split, the algorithm then cuts the region and adds the updated set of vertices to its collection. The algorithm also needs to main- tain, as properties of each activity region, a set of its vertices, edges and the activation pattern for each layer that defines the region. This same method is used by Hanin and Rolnick [13].

Concretely, the procedure starts with the collection of activation regions as a single region with edges and vertices of the given cube with 2

ⁿ⁰

vertices in R

ⁿ⁰

. Starting at the first hidden layer and moving to the last, neuron by neuron, for each activation region in the collection of regions calculate the effective weight and bias terms w

^∗

, b

^∗

for the current activation region. For the first layer, the terms are given by the rows of W

1

and b

1

. Knowing the activation pattern, for a particular layer l of f the effective W

^∗

and b

^∗

(for all neurons) can be found by iteratively multiplying by a diagonal matrix S

l

whose diagonal is the string of 0s and 1s of the current activation pattern. Letting W

i^∗

and b

^∗i

denote the effective parameters for layer i the outputs at layer l are

f

⁽¹⁾

(x) = W

₁

x + b

₁

(3.1)

f

⁽²⁾

(x) = W

2

r(W

1

x + b

1

) + b

2

(3.2)

= W

₂

S

_l

W

₁

x + W

₂

S

_l

b

₁

+ b

₂

(3.3)

= W

₂^∗

x + b

^∗₂

(3.4)

f

^(l)

(x) = W

_l

S

_l−1

W

_l−1^∗

+ W

_l

S

_l−1

b

^∗_l−1

+ b

_l

(3.5)

= W

_l^∗

x + b

^∗_l

(3.6)

so,

W

_l

= W

_l

S

_l−1

W

_l−1^∗

b

_l

= W

_l

S

_l−1

b

^∗_l−1

+ b

_l

(3.7)

(26)

The effective parameters thus determine the hyperplanes defined by neurons at each layer for a particular activation region. Next, the procedure checks all of vertices v ∈ A by evaluating f

^(l)

(v), and if the signs are not all the same, then the region is cut since at least one of the vertices is on the other side of the hyperplane defined by f

^(l)

(v) = 0. To determine which new vertices are created and added to the two regions, the algorithm needs to find where the current hyperplane w

^∗Tl

x + b

^∗_l

= 0 intersects the others. This amounts to solving the system of equations

w

_l^∗T

w

^∗T

x = −b

^∗_l

−b

^∗

(3.8) where w

^∗

, b

^∗

are the effective parameters for one of the other hyperplanes which define the current activation region. The system has two equations and n

₀

unknowns, and so is underdetermined when n

0

> 2. If A , w

l^∗T

w

^∗T

T

does not have full row rank then the two hyperplanes must be collinear and do not intersect. If A has full row rank there are infinitely many solutions in the form of x = x

p

+ N (A)y, with y ∈ R

ⁿ⁰⁻²

and N (A) ∈ R

⁽ⁿ⁰^)×(n⁰⁻²⁾

. The vertices to add are given by checking where x

p

+ N (A)y intersects the lines for the hyperplane given by parameters w

^∗

, b

^∗

. This process is then repeated for the remainder of the network. The procedure is summarized in Algorithm 3.1.

3.3 Line counting

Line counting refers to taking two points in the input space, and counting how many activation regions were crossed when traversing linearly from one point to the second point. Even though this method counts along a linear trajectory, with a high enough sampling rate, non-linear trajectories can be approximated with many linear components.

The general procedure is as follows. Given two points x

0

and x

1

, compute the activation patterns of f at each point. Starting from x

0

move just enough in the direction d toward x

¹

so that the current activation pattern changes (here, kdk

₂

= 1 . This marks a traversal of the activation region in which x

0

is located and one adjacent. Repeat this process until the current activation pat- tern matches the activation pattern of the activation region that x

1

is in. The number of activity regions along the line is the number of times the activation pattern changed plus one.

Finding the activation pattern of a point corresponds to a single feed for-

ward operation. To find the amount λ that x

0

needs to be moved in the direction

(27)

Algorithm 3.1 Exact counting of activation regions

1:

function Effective-Params(f, l, region)

2:

W ← f.W

1

, b ← f.b

1

3:

for k = 2 . . . l do

4:

S ← diag(region.act_pattern[k])

5:

W ← f.W

k

· S · W

6:

b ← f.W

k

· S · b + f.b

k

7:

return W, b

8:

function Split-Region(region, W, β, l, n )

9:

let w, b be the n-th rows of W, β respectively

10:

let s

0

be the set of vertices from regions for which w

^T

x + b < 0

11:

let s

₁

be the set of vertices from regions for which w

^T

x + b ≥ 0

12:

if |s

₁

| > 0 ∧ |s

₀

| > 0 then . region is cut

13:

solve equation 3.8 to get two new regions

14:

return the two regions which result from a cut by w, b

15:

else . region is not cut

16:

return region

17:

function Find-Activation-Regions(f, bounds)

18:

let regions be the list of vertices and activation patterns

19:

for l = 1 . . . length(f.layers) do

20:

for n = 1 . . . f.n

l

do

21:

for m = 1 . . . length(regions) do

22:

W, b ← Effective-Params(f, l, regions[m])

23:

new_regions ← Split-Region(regions[m], W, b, l, n)

24:

regions[m] ← new_regions[1]

25:

if length(new_regions) = 2 then

26:

regions.add(new_regions[2])

27:

return regions

(28)

d toward x

1

one needs to go layer by layer to calculate W

^∗

and b

^∗

as in equation 3.7, then for each layer l solve the problem

minimize λ

subject to (−1)

¹^l

(W

_l^∗

(x + λd) + b

^∗_l

) > (−1)

¹^l

0 (3.9)

< λ

letting x = x

0

and where 1

l

is the activation pattern of layer l, thus (−1)

¹^l

∈ {1, −1}

ⁿ^l

and is multiplied elementwise to flip the inequality. The optimiza- tion problem is to break the inequalities which define the polytope that x is in so the inequality is reversed. Since the problem amounts to finding the small- est value of λ which breaks a strict inequality, computationally there needs to be a tolerance of to push x by enough to move over the nearest boundary. If a neuron which defines a hyperplane with w

^T

x + b where w is a row of W

l^∗

and b is an entry of b

^∗l

has activation 0 (i.e. w

^T

x + b < 0) then the problem is minimize λ

subject to w

^T

(x + λd) + b > 0 (3.10)

< λ

There are three cases for solutions. If w

^T

d > 0 then the solution is λ > −b − w

^T

x

w

^T

d (3.11)

if the above λ < the solution (λ) is ignored. If w

^T

d = 0, then there is no solution and if w

^T

d < 0, then the inequality flips and the solution has the form

λ < −b − w

^T

x

w

^T

d (3.12)

Here, there are two cases. The first occurs when right hand side (RHS) < in which case the best λ is the RHS and it is ignored since it is below the tolerated threshold for region width. The second occurs when RHS ≥ in which case the solution (numerically) is and again is ignored since the region has width less than or equal to . If the neuron has activation of 1, then a similar line of reasoning follows and in either case the solution is taken as

λ = −b − w

^T

x

w

^T

d + (3.13)

This work choses to be 10

⁻⁹

which is larger than machine epsilon so as to

avoid any numerical precision issues and empirically smaller values did not

cause region counts to be larger. Algorithm 3.2 summarizes the procedure.

(29)

Algorithm 3.2 Exact counting of activation region transitions along a line

1:

function Find-Lambda(f, x, d, p)

2:

W ← W

₁

, b ← b

₁

, λ ← ∞, λ ← ∞, j ← 1

3:

for l = 1 . . . f.layers do

4:

candidate_lambdas ← (−b − W x)/(W d)

5:

candidate_lambdas[candidate_lambdas < ] ← ∞

6:

m ← min(candidate_lambdas)

7:

if m < λ then

8:

λ ←, j ← l

9:

W, b ← Update-Effective-Params(f, p, l, l + 1, W, b)

10:

return λ, j

11:

function Find-Num-Transitions(f, x

0

, x

1

)

12:

p ← Activation-Pattern(f, x

0

)

13:

p

₁

← Activtion-Pattern(f, x

1

)

14:

d ←

_kx^x¹^−x⁰

1−x₀k₂

15:

n ← 0, x ← x

₀

16:

while p

₁

6= p do

17:

λ, j ← Find-Lambda(f, x, d, p)

18:

if λ = ∞ then

19:

break

20:

x ← x + λd

21:

n ← n + 1

22:

if kx

0

− x

₁

k

₂

< kx − x

₀

k

₂

then . Overshot x

1

23:

break

24:

p ← Activation-Pattern(f, x, j)

25:

return n

(30)

In Algorithm 3.2, the function Update-Effective-Params just computes equa- tion 3.7 from layer l to layer l + 1. Line 24 can be optimized since not all ac- tivation patterns need to be updated. Only the activation patterns of the layer that corresponds to the change in neuron activations for the smallest value of λ and later layers need to be updated.

3.4 A simpler counting strategy?

Of course, a much more simple way of counting the activation regions defined by a network is to just compute activation regions of a lot of points in the input space and count the number of unique activation patterns. One could impose a grid over the input space with points spaced equidistantly for some resolution, then perform a single feed forward operation per point to obtain the activation patterns, and then efficiently calculate the number of unique activation patterns.

The issue with this strategy is that even within a small subset of the input space, some networks do have many regions, so the resolution of such a grid would need to be very fine-grained. Figure 3.1 shows an example of the two ap- proaches. The first problem is that the simple strategy undercounts the number of activation regions even for a small network on a small subset of input space.

It undercounts because the resolution is not high enough to capture some small activation regions. This disparity is magnified for larger networks and larger input dimension. The second issue is that the simple strategy over the domain [−1, 1]

²

and resolution 0.005 has a grid of size 200 × 200 so needs to compute activation patterns for 40,000 points. In contrast, the exact counting method in Algorithm 3.1 needs only for each neuron in the network to check for inter- secting hyperplanes in the previously defined regions. Thanks to Theorem 2.7 in expectation there are O(N

ⁿ⁰

/n

0

!) operations which check for intersecting hyperplanes for exact counting. In contrast the simple approach has Θ(c) feed forward operations where c is the number of points to sample.

The same difficulties may be encountered when counting along lines. Though because c does not scale with the input dimension, the difference is smaller. In similar fashion to exact counting, thanks to Theorem 2.8 there are O(N ) line intersection operations in expectation versus Θ(c) feed forward operations for the approximate method. To count regions along a line of length `, the sim- ple strategy would need to perform `/ feed forward operations to match the resolution of Algorithm 3.2

Therefore, for all experiments this thesis uses the exact counting Algo-

rithms 3.1 and 3.2.

(31)

Figure 3.1: Two different counting strategies to count activation regions ex- actly for a two-layer network with 10 neurons at each layer on [−1, 1]

²

. Left:

a simple counting scheme wherein points are sampled and then counted by checking how many unique activation patterns exist. A resolution of 0.005 is used. Some polygon edges look curved or jagged because the resolution is too low to capture some details. Right: counting exactly as per Algorithm 3.1.

Regions are colored randomly.

3.5 Data

This uses of two datasets: a synthetic spiral dataset, and MNIST [32]. The spiral dataset is a binary classification dataset, has a two-dimensional input, and is sampled 100 times (50 points for each class) from the parametric curve (x, y) = (α cos(t + ρ

i

)e

^βt

, α sin(t + ρ

i

)e

^βt

) (3.14) with α = 0.8, β = 0.25, ρ = 0 π

and t = ln 0.5/ ln 1.33 . . . ln 2π/ ln 1.33.

The test set is 15 points and is generated by sampling points at random from the curve. For all experiments, the test set is the same since the random num- ber generator used is seeded consistently. The data is then standardized to be centered around the origin. Figure 3.2 shows the dataset. There is no additive noise in the dataset.

The spiral dataset provides a good testing ground for experiments as it

is not very costly to train models on. It is possible to do exact counting for

many models in a reasonable amount of time, without excessive computational

resources. It also makes it easy to bypass any expressivity issues and focus on

(32)

1.5 1.0 0.5 0.0 0.5 1.0 1.5 x1

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

x2

Figure 3.2: The synthetic spiral dataset.

the evolution of activation regions over training. Most importantly, it is not costly to train models where N n

0

, thereby being applicable to Theorem 2.7 when doing exact counting. However, it is possible that some results may not carry over to the regime wherein n

⁰

N .

The MNIST dataset contains 28 × 28 images of handwritten digits (0-9).

Its training set is 60,000 images and the test set has 10,000 images. This work standardizes the dataset elementwise to be centered around the origin, and has the same validation set of size 1000 images across all runs.

This work investigates results of corrupting the labels on both datasets.

This is done by selecting a random percentage of the labels and then uniformly choosing an incorrect label amongst the remaining incorrect classes.

3.6 Line types

It is important to note that to be applicable to Theorem 2.8 selected lines to do counting over may not be not biased toward being in certain areas of the input space. For this purpose, this work chooses lines which pass through a random training point and the origin and then extend from the origin to have a length of 2kmk

₂

where m is the point in the dataset with the largest two-norm. Through counting these lines, Theorem 2.8 can be evaluated and used as a reference.

Even though Theorem 2.8 is a good result to keep in mind, it does not

fully describe the layout of activation regions since different parts of the input

space may have different arrangements of activation regions. For example, the

(33)

MNIST dataset has an input dimension of 28

²

= 784 and a large majority of the input space (i.e. possible inputs) is just noise. To investigate any difference in distribution of activation regions in the input space, this work reports counts of the density of paths generated by linear interpolation in the latent space of an autoencoder. The purpose of which is an attempt to generating trajectories which stay closer to the data manifold when compared to lines through the data.

This is by no means a perfect solution as autoencoders do not always gen- erate perfect images. Though, visually paths generated are more similar to images than doing linear interpolation between data points. The problem of generating or sampling values in input space does not enjoy a one-size-fits-all solution. Empirically, this work found that this approximation worked well enough to make some broad conclusions.

Figure 3.3 shows sample paths generated from an autoencoder whose en- coder has three hidden layers of dimension 128, 32, 12 (i.e. 12-dimensional latent space) and decoder layers reversed. This autoencoder was trained with the Adam optimizer [33] with learning rate α = 10

⁻³

as well as the default values of β

¹

, β

2

= (0.9, 0.999) and = 10

⁻⁸

as suggested in the original work [33]. The autoencoder loss function is the binary cross-entropy loss and the model is trained for 20 epochs. This same autoencoder is used for all experi- ments described in Chapter 4.

The experiments presented normalize the counts by the Euclidean distance of the total path, since it is possible—and indeed in general empirically true–

that the paths from the autoencoder have different length than the interpolated paths. That is to say that the counts over a line ` are

R(f ; `)

k`k

₂

(3.15)

(34)

Figure 3.3: Six sample paths between random MNIST training points. Top:

linear interpolation between the points. Images which fall along the line be-

tween the points—especially if they do not belong to the same class—are vi-

sually not representative of actual training data. Bottom: linear interpolation

in latent space of an autoencoder. Images along this path look visually closer

to being real training images

(35)

Chapter 4 Results

This chapter reports on two main types of experiments with changing the ar- chitecture of ReLU nets. The first experiment operates on a fixed budget of neurons and varies the depth, allowing the width to vary accordingly so that all layers have a fixed width. The second investigates the effects of adding layers of fixed width. Here, networks vary in the total number of neurons. Both ex- periments are run on the spiral and MNIST datasets, and various levels of label corruption. Where possible, this work tries to select appropriate network sizes and hyperparameters—selected via hyperparameter tuning beforehand—such that as many of the configurations can overfit, or achieve similar accuracy, on the training set so as to avoid an entanglement of expressivity.

4.1 Fixed neuron budget

This experiment operates on a fixed budget of N neurons and considers depths of k = 1 . . . L. This way, each layer l has N/L neurons. For simplicity, N and L are chosen such that L always divides N.

4.1.1 Spiral dataset

For the spiral dataset, the two-dimensional activation regions are counted ex- actly using Algorithm 3.1. For the spiral dataset, n

0

= 2 and N = 60 there- fore, Theorem 2.7 predicts the number of regions to be bounded in expectation by

60

²

2! = 1, 800 (4.1)

25

(36)

0.2 0.4 0.6 0.8 1.0

Update step 1e5

1000 1500 2000 2500 3000

Regions0.00% label noise

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Train accuracy

1000 1500 2000 2500 3000

Regions

0.2 0.4 0.6 0.8 1.0

Update step 1e5

1000 1500 2000 2500 3000

Regions15.00% label noise

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Train accuracy

1000 1500 2000 2500 3000

Regions

Layers

1 2 3 4 5

Figure 4.1: Region counting for five models with a fixed budget of 60 neurons on the spiral dataset with no label noise and 15% label noise for 5,000 epochs.

All networks except the single-layer network can achieve 100% training accu- racy. Training is over 5 independent runs and averaged, the shading around the average is one standard deviation. The red line marks 1,800 regions. Please consult Figure A.1 in appendix A for additional plots and noise levels.

All networks are trained for 5,000 epochs—regardless of final accuracy—with standard stochastic gradient descent (SGD) and learning rate 0.0018 and batch size of 4. All network weights are initialized with uniform He initialization [34] and variance 2/fan-in as suggested in [35] to avoid a failure mode of ReLU nets due to extreme values for mean activation length. The network biases are initialized from sampling a normal distribution with mean zero and variance 10

⁻⁶

. The results are averaged over 5 independent runs.

Figure 4.1 shows the evolution of the number of activation regions over

time as well as how accuracy on the training set correlates with the number of

activation regions. It shows that, for a fixed number of neurons, the number of

regions is well predicted by Theorem 2.7 and much fewer than the maximum

from Theorem 2.6. This is clear even after training. With label noise, all

networks but the single layer network still overfit the data and after training

there is generally a larger number of activation regions, though the growth it

is not exponential.

(37)

CHAPTER 4. RESULTS 27

4.1.2 MNIST datset

For the MNIST datset, this work reports on two sizes of networks. The first considers a budget of 180 neurons and depths k = 1 . . . 5. Networks are trained over 100 epochs with batch size 100 and learning rate 0.0005, which were selected from hyperparameter tuning. All networks were trained with SGD with a cyclical learning rate with maximum 0.01 over 4 cycles as in [36].

Cyclical learning rate was used as it required less hyperparameter tuning and sped up training, especially for larger models. For the small configuration, re- sults are averaged over three training runs and for the large configuration there is only a single training run. For all MNIST runs across all experiments line counts are averaged across 30 lines, and for the autoencoder paths a resolution of 12 is used.

Figre 4.2 shows that there is not too much difference in the number of regions along a line for the various network depths. Moreover, the number of regions generally remains constant near 180. Figure 4.3 highlights that the paths generated by the autoencoder are less dense in number of regions when compared to the linear trajectories. Nevertheless, the pattern remains that the number of regions is more dense with less noise. The region density follows a U-shaped curve when plotted against accuracy which reflects the initial drop in density that is found in the first column of Figure 4.3. This peculiar trend has been observed before, for example in [21], though remains unexplained.

In an effort to investigate whether networks trained to near 100% accu- racy with larger amounts of label noise leads to higher region count, another set of experiments is conducted this time with a budget of 360 neurons and k = 1 . . . 5 for 500 epochs with 20 learning rate cycles. Otherwise, the train- ing regime is unchanged. Figures 4.7 and 4.4 show the results. Surprisingly, even achieving high accuracy on MNIST with 20% label noise, the number of regions does not greatly exceed 360. Moreover, the trends in densities are consistent with the smaller models.

Activation Regions as a Proxy for Understanding Neural Networks

Activation Regions as a Proxy for Understanding Neural Networks

ADRIAN CHMIELEWSKI-ANDERS

Activation Regions as a Proxy for Understanding Neural

Networks

ADRIAN CHMIELEWSKI-ANDERS

Master in Machine Learning Date: July 3, 2020

Supervisor: Prof. Josephine Sullivan Examiner: Prof. Hedvig Kjellström

School of Electrical Engineering and Computer Science

Swedish title: En inblick i neurala nätverk via analys av

aktiveringsregioner

Abstract

Additionally, this work investigates the density of activation regions around

data points throughout training and examines whether this metric varies with

depth or correlates at all with generalization. Ultimately, experiments show

this is not a strong predictor of generalization, though does differ with choices

of architecture.

Sammanfattning

Trots den empiriska framgången och utbredda anammandet av djupa neurala

nätverk, är metoder för systematisk resonemang om arkitekturer baserade på

mätbara egenskaper hos neurala nätverk relativt bristfälliga. Enligt teoretiskt

maximala övre gränser för antalet aktiveringsregioner, är djupare nätverk mer

uttrycksfulla än deras mindre djupa motsvarigheter. Emellertid har ny forsk-

ning visat att i praktiken, för ett konstant antal neuroner, är det genomsnittliga

antalet uppnådda aktiveringsregioner oberoende av arkitektur och lägre än te-

oretiska övre gränser. Denna avhandling fortsätter i denna riktning och under-

söker antalet aktiveringsregioner som uppnåtts under och efter träningen och

hur fel i etiketter(labels) påverkar antalet aktiveringsregioner. Det vi finner är

till stor del överens med tidigare resultat, vilka inte indikerar någon signifikant

skillnad mellan arkitekturer. Dessutom undersöker detta arbete tätheten av ak-

tiveringsregioner runt datapunkter under träningen och undersöker om detta

mått varierar med djupet eller korrelerar alls med generalisering. I slutändan

visar experiment att detta mått inte är en stark förutsägare för generalisering,

men skiljer sig dock inte för olika arkitekturer.

Contents

1

1.1 General research area . . . . 1

1.2 Research questions and contributions . . . . 3

1.3 Literature review . . . . 4

1.4 Thesis structure . . . . 6

7 2.1 Notation, linear regions and activation regions . . . . 7

2.2 Bounding regions . . . . 11

14 3.1 Motivation . . . . 14

3.2 Exact counting . . . . 15

3.3 Line counting . . . . 16

3.4 A simpler counting strategy? . . . . 20

3.5 Data . . . . 21

3.6 Line types . . . . 22

25 4.1 Fixed neuron budget . . . . 25

4.1.1 Spiral dataset . . . . 25

4.1.2 MNIST datset . . . . 27

4.2 Adding layers . . . . 27

4.2.1 Spiral dataset . . . . 27

4.2.2 MNIST dataset . . . . 28

33 5.1 Bounds . . . . 33

5.2 Density of activation regions as a predictor of generalization . 34 5.3 Initial behavior of activation regions . . . . 37

v

40 6.1 Summary . . . . 40 6.2 Future work . . . . 40 6.3 Societal and ethical aspects . . . . 41

42

46

51

Acknowledgements

I would like to sincerely thank my supervisor Prof. Josephine Sullivan for her helpful guidance, experience and insights while I was working on my thesis.

vii

Chapter 1 Introduction

1.1 General research area

Neural networks have proven to have high efficacy on a multitude of problems in academic and commercial settings over the past decade (e.g., [1,

has largely fallen behind empirically driven results. Knowing the underlying properties of network architecture on expressivity would be helpful as it could elimate guesswork in research.

One avenue of research trying to explain the observed superiority of deeper neural networks is the investigation of expressivity of networks. Weights of networks that make use of piecewise linear functions (i.e. [7,

input space into different regions on which different functions are computed;

1

Figure 1.1 shows a small example on a two-dimensional input space. A natural measure of expressivity is how many regions networks split the input space into, since with a higher density of regions, the computed function looks more smooth.

1 0 1

x

1.5 1.0 0.5 0.0 0.5 1.0 1.5

x

Regions at initialization: 1644

1 0 1

x

1.5 1.0 0.5 0.0 0.5 1.0 1.5