Navigating Deep Classifiers

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Navigating Deep Classifiers

A Geometric Study Of Connections Between Adversarial Examples And Discriminative Features In Deep Neural Networks

JOHANNES LEONHARD RÜTHER

(2)

Abstract

Although deep networks are powerful and effective in numerous applications, their high vulnerability to adversarial perturbations remains a critical limitation in domains such as security, personalized medicine or autonomous systems. While the sensitivity to adversarial perturbations is generally viewed as a bug of deep classifiers, recent research suggests that they are actually a manifestation of non-robust features that deep classifiers exploit for predictive accuracy. In this work, we therefore systematically compute and analyze these perturbations to understand how they relate to discriminative features that models use.

Most of the insights obtained in this work take a geometrical perspective on classifiers, specifically the location of decision boundaries in the vicinity of samples. Perturbations that successfully flip classification decisions are conceived as directions in which samples can be moved to transition into other classification regions. Thereby we reveal that navigating classification spaces is surprisingly simple: Any sample can be moved into a target region within a small distance by following a single direction extracted from adversarial perturbations. Moreover, we reveal that for simple data sets such as MNIST, discriminative features used by deep classifiers with standard training are indeed composed of elements found in adversarial examples.

Finally, our results also demonstrate that adversarial training fundamentally changes classifier geometry in the vicinity of samples, yielding more diverse and complex decision boundaries.

Abstract in Swedish

Även om djupa neurala nät är kraftfulla och effektiva i många användningar, är deras stora sår- barhet för medvetna störningar (adversarial perturbations) fortfarande en kritisk begränsning inom områden som säkerhet, individanpassad medicin eller autonoma system. Även om känsligheten för medvetna störningar i allmänhet betraktas som en brist hos klassifierare baserade på djupa nät, tyder färsk forskning på att de i själva verket är ett uttryck för orobusta features som klassifierarna utnyttjar för att göra exakta prediktioner. I detta arbete beräknar och analyserar vi därför system- atiskt dessa störningar för att förstå hur de förhåller sig till diskriminativa features som modellerna använder.

De flesta insikter som erhålls i detta arbete har ett geometriskt perspektiv på klassificerare, särskilt placeringen av beslutsgränserna i närheten av datasamplen. Störningar som framgångsrikt kan ändra på klassificeringsbeslut utformas som riktning där datasamplen kan flyttas in till andra klassificeringsregioner. På så sätt avslöjar vi att det är förvånansvärt enkelt att navigera i klassi- ficeringsrymden: Ett godtyckligt sampel kan flyttas till en annan närliggande klassificeringsregion genom att man följer riktningen som extraherats från medvetna störningar. Dessutom avslöjar vi att när det gäller enkla datauppsättningar som MNIST, består de diskriminerande features som an- vänds av djupa klassifierare, tränade med standardmetoder, faktiskt av element som återfinns bland de medvetna störningsexemplen.

Slutligen visar våra resultat också att medvetna störningar i grunden förändrar klassificerargeometrin i närheten av datasampel, vilket ger mer varierande och komplexa beslutsgränser.

(3)

1 Introduction

Deep neural networks are currently the state-of-the-art in various domains and tasks, like image classification. Despite their impressive performance, they have been proved extremely vulnerable to adversarial perturbations: small alterations of the input images, often imperceptible to the human eye, that surprisingly suﬀice to change a network’s classification decision with high confidence [24].

Designed as attacks on classification systems, these perturbations can pose critical threats in domains such as security, personalized medicine, or autonomous systems.

Original Sample Label: frog

Adversarial Noise kδk²= 0.5

Perturbed Sample Label: airplane

Figure 1.1: Example of an adversarial attack on ResNet18 with standard training on CIFAR10. The added perturbation (center) is scaled up for better visibility.

The phenomenon of adversarial attacks is a symptom of two central shortcomings in state-of-the-art classification models:

1. Lack of Robustness:

“Networks do not discriminate the way humans do.”

From a human perspective, robustness implies that changes to samples that flip the model’s classification decision align with what “actually” changes the class when properly representing the problem. Adversarial examples, in contrast, change the classifier’s decision despite being semantically meaningless or even imperceptible.

2. Lack of Interpretability:

“We do not know how networks discriminate.”

If the decision guidelines of neural network models were perfectly interpretable, we would know exactly why and how adversarial examples work. The fact that we cannot explain their decisions makes it almost impossible to understand when and why their reasoning is in error.

Although adversarial perturbations are generally viewed as bugs of deep classifiers, recent research ([10], [9]) suggests that they actually are a manifestation of non-robust features that deep classifiers exploit for predictive accuracy. These findings spark the interest in studying adversarial examples explicitly and how they relate to the discriminative features that models use. While many previous works are aimed at building mechanisms to prevent attacks, this thesis focuses on what conclusions we can draw from investigating perturbations themselves. In this project, we therefore propose to systematically compute and analyze adversarial perturbations to better understand their nature.

Many of these insights are obtained by taking a geometrical perspective on classifiers and the location of decision boundaries with respect to samples. We conceive perturbations as directions in which samples can be moved to transition into other classification regions. By doing that, we reveal that navigating to target regions is surprisingly simple in standard models.

(5)

Moreover, we attempt to answer the question whether these directions describe the image features which networks consider discriminative. Thereby, we take the idea of “adversarial examples as features” presented in [10] and [9] to the limit: If adversarial perturbations incorporate the features that need to be changed to push samples over decision boundaries, do they constitute the elements that networks really use to distinguish classes? Can we consequently compute bases for subspaces in which deep classifiers really base their decisions?

The main contribution of this work lies in a thorough analysis of the geometrical properties of adversarial examples and how they are connected to discriminativeness. We show that for simple data sets such as MNIST, the discriminative elements used by deep classifiers with standard training are indeed constituted by the elements found in adversarial examples. Our analysis also reveals that in standard classifiers, perturbations that push samples of any class into the same decision region are highly correlated. This allows the computation of directions along which any sample can be moved into a target region within a small distance, thus making them targeted universal perturbations.

2 Literature Review

2.1 Interpretability

Irrespective of the high predictive accuracy of a model, it is advantageous and sometimes critical to confirm that it represents the classification problem properly. Tools for interpretability can also serve to extract patterns that provide new insights instead of just automating decisions based on prior knowledge. Many steps have therefore been taken towards more interpretable models [16].

A first look “into” deep neural networks is provided by visualizing units at different layers of neural network models to see what is happening “inside” of the model. This approach finds visualizations of features by tweaking inputs to maximize the activation of neurons representing them [3].

The activation maximization approach can be enhanced by introducing regularizers that make these visualizations resemble the original data as good as possible (e.g.[14], [19]). Although these methods visualize the internal representations of images by a network in general, they do not deliver explanations for individual classification decisions.

Explaining a model’s decision means to answer what criteria a data sample fulfills to be of a certain class. Viewing a sample as a collection of features, we can describe a decision by decomposing the combination of features that were relevant to the decision; that is we assign a score to each feature of a given data point, i.e. to each pixel in an image.

Most simply, one can compute the gradient of the model’s output function with respect to every input component (i.e. pixel). This sensitivity analysis has been used to create saliency maps of deep neural networks [23]. Given that this method works with derivatives of the classification function, it helps us understand “what makes this image more/less a car” rather than the more basic question

“what makes this image a car”, as [16] points out.

The idea of relevance scores has been leveraged further by decomposing how relevance scores propa- gate throughout the network ([12],[1]). Pooling scores along multiple pixels allows further abstraction by connecting relevant regions and drawing bounding boxes around elements that are important to the model.

The two above-mentioned techniques of neuron-visualization and attribution can be combined into “semantic dictionaries”. They visualize features represented by specific neurons and how strongly the network detects them at a particular position in an image [20].

(6)

2.2 Robustness

The development of attack and defense mechanisms to make networks more resilient against adversarial attacks has been pursued with much effort in recent years. Adversarial training [15] has been shown to produce models that are more robust and reliable under attacking scenarios than classically trained models.

Although other methods for defending against adversarial attacks have been proposed, many of them are able to protect models only against specific types of attacks, as e.g. [2] shows with the example of distillation [8]. Adversarial training, therefore, continues to be the mainstay of robust model building for now. Experiments in this work compare the adversarial examples produced by standard and adversarially trained models, outlining some of the major differences.

In [9], researchers claimed that adversarial examples are not bugs of the system, but features that the model exploits for improving the predictive accuracy. According to their reasoning, models use both robust and non-robust features, all of which are useful for generalization. Using an adversarially trained model they were able to disentangle the robust and the non-robust features of a dataset. The idea of adversarial examples being features is one of the main inspirations for this work and pushes its ideas further (see Sec. 3). They showed that training a model on a version of the dataset that contains only robust features is enough to achieve good robust accuracy. They also showed that using only non-robust features is enough to train a model that generalizes well to the original data (with low robustness though).

Recently, another work [22] showed that adversarially trained classifiers also have benefits in terms of interpretability compared to their cleanly trained counterparts. This illustrates, how closely connected robustness and interpretability are and that adversarial examples are a connecting element between the two. Moreover, adversarially trained models are better suited for computer vision tasks like image generation, inpainting, or denoising. According to this line of reasoning, they argued that adversarial training not only improves robustness, but also promotes the system to use human- aligned features.

2.3 Geometry of Neural Network Classifiers

The existence of adversarial examples implies that data samples lie very close to the decision boundary of neural network classifiers with standard training. The local geometric properties of these boundaries can be investigated by looking at the directions of minimal adversarial perturbation, which are orthogonal to the boundary [4]. In [5] researchers found that classification boundaries are relatively flat along most directions of the input space, but strongly curved in some directions.

They empirically measured that the curvature profile of boundaries is highly sparse. Moreover, they showed that classifiers are most vulnerable to adversarial perturbations in these curved directions.

Another property of decision boundaries is revealed through the scope of universal adversarial perturbations [17], which flip a classifier’s decision on any sample using a “one-fits-all” perturbation.

These attacks are based on the notion that the orientations of decision boundaries in the vicinity of data samples are highly correlated. Therefore it is possible to compute one direction that brings any sample out of its original classification region within a small distance. The subspace of directions in which decision boundaries can be successfully crossed is relatively low-dimensional, as was measured in [17]. This work is going to extend the notion of correlated directions to universal targeted perturbations: Sec. 7 will illustrate the existence of shared directions that transition into any chosen target class instead of just leaving the region of the true class.

During the literature review towards the end of this project, I have become aware of [10], which contains findings that are similar to my thesis. In particular the authors describe the notion of

(7)

adversarial examples spanning low-dimensional subspaces of discriminative features. They validate this idea by projecting samples onto suspected discriminative subspaces, similarly as is in this work.

There is a general alignment of results in [10] and Sec. 6, although some experiments made in [10]

will be critically discussed and improved. Additionally, this work is based on the more simple, straightforward technique of computing suspected discriminative subspaces directly from adversarial examples. Moreover, this work provides additional relevant geometrical insights (see Sec. 5) and investigates the properties of robust models in this regard.

Finally, the authors of [21] have shown that the margin between samples and the closest class boundaries is small in directions that a model considers discriminative [21]. In consequence, adding samples with small perturbations to the training set caused the model to develop larger margins in the direction of the perturbations.

(8)

3 Motivation

In [9], the authors suggested that adversarial examples are a result of highly predictive, but non- robust features. To provide an intuition on why adversarial examples point at discriminative directions, consider this simple example:

Suppose we want to classify a two-dimensional dataset with two classes. Samples of both classes are Gaussian-distributed and have different means such that a simple linear classifier separates both classes perfectly(see Fig. 3.1).

Discriminative direction

Dec ision

boun dary

Figure 3.1: 2D Data distribution consisting of two Gaussians, separated by a Linear SVM. High- lighted points and connected arrows represent adversarial attacks. From these adversarial directions we can derive a general discriminative direction and an orthogonal, non-discriminative direction.

If we were to compute adversarial attacks on samples, they would point in the discriminative direction of the classifier, perpendicular to the decision boundary. We could thereby find the basis for a discriminative subspace, spanned by the one direction that all perturbations that flip the clas- sification decision have in common. Projecting all samples onto this subspace would be enough to successfully classify them, because the predictive information in this subspace is high. Knowing this discriminative subspace would allow us to interpret the decision that a classifier made on a specific sample because we would know exactly in what subspace it was based. Additionally, we would know what subspace did not play a role in the decision finding of the classifier, namely the one that is orthogonal to the discriminative subspace.

The research explained in Sec. 2.3 suggests that the geometry of deep classifiers might be relatively simple, exhibiting sparse curvature and connected classification regions [5]. Assuming that adversarial directions actually point at discriminative features in the data, similarly as in the example above, it seems promising to systematically discover these directions, along with the subspaces they live in. Being aware of all discriminative and non-discriminative directions around a data point could enable us to explain a particular classification decision if we have access to the relevant geometric features of the network around it.

In this project we aim to systematically analyze the bundle properties of adversarial examples of a network by looking at their main geometrical features. In this context, we will analyze the ergodicity of the attacking process: Do multiple orthogonal attacks on few samples look different

(9)

than single attacks on all available samples? Do they use different subspaces to transition samples out of their original decision region? We are going to discuss these questions in Sec. 5.4.

If adversarial examples really express discriminative information of a dataset, can we extract a set of discriminative features by collecting all adversarial directions?

By abstracting the common information in these directions, can we find a meaningful basis that can characterize the subspaces in which the investigated neural networks base their decisions? These questions are going to be elaborated on in Sec. 6.

The authors of [17] have found that in the vicinity of natural images, the orientation of close decision boundaries is very correlated for many data points, which is why adversarial attacks often exploit the same direction to flip the label. Therefore, they were able to find directions in which all samples quickly reach a decision boundary, regardless of their location. The perspective of the researchers in [17] focuses on the decision boundary as the border of true classification region. In this project, instead of asking whether there are common directions to leave a classification region, we are looking for those that let us enter a new region. If these directions are correlated, could they be used as universal targeted perturbations? The possible phenomenon on universal targeted attacks will be investigated in Sec. 7.

(10)

4 Methodology

4.1 Systematic Discovery of Adversarial Directions

Let (x, y) ∈ D be some data sample belonging to a classification dataset D with c classes. Also, let f : R^d→ R^c denote the function realized by a neural network before the softmax layer. We call the function, which assigns a label to a sample x hypothesis function hθ.

h_θ(x) = arg max

i

[f (x)]i. (1)

Following the intuition of [21], models consider information from particular subspaces to make their classification decision. Other subspaces are less relevant to the decision as the classifier is in- variant to changes in these subspaces. If adversarial examples point at discriminative features, as the authors of [9] suggest, the collection of many adversarial examples for every sample might reveal the full subspace that a model uses to discriminate one class from another. To increase the information out our disposal, we will develop a methodology to compute multiple orthogonal adversarial attacks for each sample and hence augment our view of the neural network landscape.

Now, let S(x) ⊆ R^d be the subspace of discriminative features used to correctly classify x by f. We approximate this subspace using Algorithm 1. A similar construction was used in [6] to find the curvature profile of f around x and study universal adversarial perturbations. Here, we propose interpreting S(x) as the full subspace of discriminative features for interpretability. The strategy to compute each individual attack is described in Sec. 4.2. Examples of multiple attacks on the same sample can be be found in Fig.4.1.

True label: 7 Label: 9 Label: 9 Label: 3

(a) LeNet, standard training on MNIST

True label: 7 Label: 9 Label: 9 Label: 7

(b) LeNet, adversarial training on MNIST True label: horse Label: bird Label: dog Label: bird

(c) ResNet18, standard training on CIFAR10

True label: horse Label: bird Label: deer Label: deer

(d) ResNet18, adversarial training on CIFAR10 Figure 4.1: Examples of multiple orthogonal attacks on single samples of MNIST and CIFAR10.

More examples can be found in Fig. A.2

4.2 Projected Gradient Descent on Subspaces

Individual adversarial attacks are found by maximizing the loss ℓ of the classifier using an additive perturbation δ.

maximize

δ∈∆ ℓ(hθ(x + δ), y) (2)

where ∆ denotes the set of allowed perturbations. In this project, ∆ has two main constraints.

First, we bound the perturbation δ to lie within an ℓ2 ball of size ε around the sample x. Second, δ must lie in the subspace that is orthogonal to the subspace spanned by all previously found attacks

(11)

Algorithm 1 Discriminative Subspace Discovery Algorithm.

1: Input: f, (x, y), ϵ, kmax 2: Output: S(x)

3: S(x) ← {0}, k ← 0

4: while k ≤ kmaxdo

5: N ← R^d\ S(x)

6: r←SubspacePGD(f, x, y; ϵ, N ) ▷ PGD attack [15] on subspace N (see Sec. 4.2)

7: S(x) ← S(x) ⊕ span({r})

8: k← k + 1

9: end while

on a sample. We find the solution to (2) iteratively through Projected Gradient Descent (PGD) [15]

with additional constraints. At every iteration we take a step into the direction of the steepest ascent of ℓ:

δ^′= PN(∇δℓ(hθ(x + δ), y)) (3)

The projection PN onto the subspace N = R^d\ S(x)ensures that the update lies in the subspace that is orthogonal to all previous attacks. The update δ^′ is normalized by its ℓ²-norm and multiplied with the step size α before it is added to δ:

δ= Pε

(

δ+ α δ^′

∥δ^′∥

) (4)

Pε represents the projection onto the ℓ²-ball of size ε. We update δ until the maximum of specified iterations is reached. The number of iterations is chosen empirically and should be high enough such that the perturbation can reach its maximum norm ε.

4.3 Attacked Datasets and Models

Throughout this work, CIFAR10 and MNIST are used as example datasets. Attacks will be computed on two different architectures: ResNet18 [7] one trained on CIFAR10 [11] and LeNet trained on MNIST [13].

While CIFAR10 consists of 32x32 RGB-images, MNIST are of size 28x28 and grayscale. Both sets comprise 50,000 samples for training and 10,000 samples for testing. Orthogonal attacks and their analysis are calculated solely on test samples.

For each trained network there exists a “robust” version based on L2 adversarial training with ε = 1 and 2 for CIFAR10 and MNIST respectively. Throughout the following experiments, classifiers with standard and adversarial training will be compared regarding the properties of adversarial attacks against them.

(12)

truck ship horse frog dog deer cat bird automobile airplane

truck: 4.9 % ship: 2.7 % horse: 1.6 % frog: 26.8 % dog: 3.4 % deer: 8.0 % cat: 9.7 % bird: 24.2 % automobile: 12.2 % airplane: 6.5 % Unsuccessful: 0.0 %

(a) ResNet18, standard training on CIFAR10

9 8 7 6 5 4 3 2 1 0

9: 15.6 % 8: 13.3 % 7: 8.8 % 6: 3.4 % 5: 14.8 % 4: 9.6 % 3: 14.5 % 2: 10.2 % 1: 2.2 % 0: 4.0 % Unsuccessful: 3.6 %

(b) LeNet, standard training on MNIST

truck ship horse frog dog deer cat bird automobile airplane

truck: 11.6 % ship: 8.9 % horse: 8.5 % frog: 12.0 % dog: 9.9 % deer: 7.5 % cat: 7.0 % bird: 8.6 % automobile: 8.5 % airplane: 8.6 % Unsuccessful: 9.0 %

(c) ResNet18, adversarial training on CIFAR10

9 8 7 6 5 4 3 2 1 0

9: 8.2 % 8: 8.1 % 7: 6.1 % 6: 2.8 % 5: 6.4 % 4: 6.8 % 3: 7.4 % 2: 6.8 % 1: 1.3 % 0: 2.3 % Unsuccessful: 43.7 %

(d) LeNet, adversarial training on MNIST Figure 5.1: Outcome of 100,000 adversarial attacks (10 PGD attacks (ε = 4) on each of 10,000 test samples) on CIFAR10 and MNIST. True classes of samples are on the left side of each Sankey diagram, labels of perturbed samples on the right. If true class and label of perturbed sample are equal, the attack counts as unsuccessful.

5 Geometrical Properties Derived From Adversarial Attacks

5.1 Attack Statistics And Preferred Target Labels

The first insight on the collection of adversarial attacks is obtained from looking at the labels that classifiers assign to perturbed samples. This provides us with an intuition on the vulnerability of networks trained with different settings and data sets in the allowed attack range (ε = 4). The results of all 100,000 attacks (10 attacks on 10,000 test samples) are summarized in Fig. 5.1.

As expected, networks with standard training are very vulnerable to attacks, leading to a low rate of unsuccessful perturbations. Given the large attack radius of ε = 4, the rate of successful attacks stays relatively high in adversarially trained models, although robustness increases significantly. Adversarial training yields a particularly strong improvement for LeNet on MNIST, where 43.7%of attacks are unsuccessful despite a large ε (see Fig. 5.1d). The big advantage in robustness over ResNet18 on CIFAR10 can be ascribed to the simplicity of handwritten digits in comparison to natural images.

While all classes usually appear similarly vulnerable to attacks, we can identify preferred targets, i.e. labels that are assigned more often to perturbed samples than others. This behavior is generally expected, since some classes are more similar than others. E.g. on MNIST, 3’s and 6’s share image

(13)

(a) Color-coding by target label (b) Color-coding by true label

Figure 5.2: Separability of successful adversarial attacks on a ResNet18 with standard training on CIFAR10, visualized by t-SNE dimensionality reduction. Color-coding by the label assigned to the perturbed sample in (a) and by the true label of the perturbed sample in (a).

elements with 5’s (the bottom arc), which is why even in a perfectly human-aligned classifier, it should take only small changes to flip the label from 3 to 5.

Source classes on MNIST all have different preferred targets that are somewhat interpretable due to semantic proximity. For a standard ResNet18 on CIFAR10 however, all classes share the same preferred targets: More than 50% of all perturbed samples are labeled as bird or frog, while other labels (e.g. horse, ship) rarely occur at all. Moreover, preferred targets can generally not be interpreted from a semantic similarity perspective (with a few exceptions).

This observation corroborates the findings described in [17] which suggest that in standard classifiers dominant labels occupy large areas of the decision landscape and are therefore in close reach of every sample. This idea is going to be revisited later on in Sec. 7.1.1 and 7.2.3.

Adversarial training strongly mitigates the effect of dominant preferred targets, as Fig. 5.1c illustrates. In the robust ResNet18, many strong connections between labels of original and perturbed samples seem well justifiable: automobiles should be easily transitioned into trucks, airplanes into ships and cats into dogs. Flipped decisions from perturbations are often interpretable on this clas- sifier because they introduce semantically meaningful changes. See e.g. attacks in Fig. 4.1d, where antlers are added to a horse and Fig. A.2, where airplane wings are added to a ship. This observation confirms the notion that robust models use more human-aligned features, as discussed in [22]. It is particularly visible in the shown attacks because of the high norm of allowed perturbation.

5.2 Separability of Perturbations

In standard classifiers, the collection of adversarial noise is very easy to separate by the target of the attack, i.e. the label assigned to the perturbed sample. This property can be visualized by generating a t-distributed Stochastic Neighbor Embedding (t-SNE, [25]), reducing the set of perturbations to two dimensions. Color-coding the t-SNE plot by the target of attacks exposes quite distinct clusters that are shown in Fig. 5.2a for a standard ResNet18 on CIFAR10. Note, that the size of clusters is highly different and reflects the distribution of preferred targets seen in Fig. 5.1a, such that the dominant classes have particularly large clusters.

(14)

(a) Standard LeNet on MNIST (b) Robust LeNet on MNIST (c) Robust ResNet18 on CIFAR10 Figure 5.3: Separability of successful adversarial attacks on different models, visualized by t-SNE dimensionality reduction. Color-coding by labels assigned to perturbed samples.

In contrast this, color-coding data points by the true class of perturbed samples appears much less significant for separation (see Fig. 5.2a). This establishes the assumption, that regardless of the original class, perturbations that move a sample into the same target decision region have similar properties.

This effect is similarly visible in attacks on a standard LeNet for MNIST (see Fig. 5.3a). The visual assessment is confirmed by training a simple classifier on perturbations and their labels (targets). Fitting a linear SVM lets one predict the target of an attack given the perturbation with over 97% and 92% for CIFAR10 and MNIST respectively. That means that we can predict the label of a perturbed sample just by looking at the noise and not the sample itself. For reference, t-SNE plots of the original data sets have been in included in Fig. A.1. They show that on MNIST the separation of elements is generally easy, which is why the high separability of perturbations might not seem special. CIFAR10 however shows very few clusters if any in a t-SNE.

The separability of perturbations by target, regardless of the original class, suggests that the directions in which specific decision boundaries are situated around a sample are very correlated. This implies that a target class is reachable from any sample in the dataset by moving in a target-specific direction or subspace. The authors of [17] have explained that there exist correlated directions that move any sample out of its original decision region within a small distance. Therefore they could compute universal perturbations, that fool the classifier about decisions on any sample. The separability of perturbations hints at a possibility to extend this idea to targeted universal perturbations.

In Sec. 7 we will investigate whether this is feasible and how large these sample-agnostic, targeted perturbations need to be to flip decisions.

Furthermore, this described property gives rise to the idea that all attacks leading to the same target label share important features and might “live” in different subspaces. If that is true, we might be able to extract class-specific discriminative information from all attacks, using the following intuition: If there are distinguishable features whose presence lead the classifier to a certain classification decision, these features might be the discriminative key-features for that class. This idea will be examined in detail in Sec. 6.

In adversarially trained versions of both classifiers, the separability property does not hold:

individual directions from a sample to a new classification region are not as similar as in standard networks. Adversarial examples on robust models thus seem to be more diverse regarding the

(15)

features that they add to or remove from perturbed samples. Comparing the results of the following experiments for standard and robust models will shed more light on this difference.

5.3 Principle Components of Perturbations

The previous section has shown that perturbations that cause a classifier to assign a certain target label when added to the classifier. In the following we are going to study these subsets of perturbations by performing a singular value decomposition on each of them. Fig. 5.4 shows the first principal components and the distribution of singular values for example classes. The components of every target class and different settings can be found in Fig. A.4 - A.7.

In the following, these principle components will be referred to as {t0, t1, ...t_D}. t0 is the component corresponding to the highest singular value (i.e. the most dominant) and tD to the lowest, where D is the dimensionality of samples, i.e. D = 784 for MNIST and D = 3072 for CIFAR10.

Since these components are target-specific, a second subscript is used to denote the target class (e.g.

the top right element in Fig. A.6 is referred to as t⁰,airplane). The subspace that is spanned by the top x components of class airplane is named Tx,airplane.

The composition of attacks are strongly dominated by the first few components, especially on classifiers with standard training. For CIFAR10, the first respective component has the biggest influence on the distribution by far, while for MNIST several components have about similar influence.

In both datasets, 60% of elements of the classification space have no relevance at all. They represent subspaces in which adversarial examples are found very rarely.

In [17] and [18], the authors described that the directions of adversarial attacks are strongly correlated, i.e. the first few principal components are very relevant, while the later ones are not at all.

While confirming this observation for all adversarial attacks (see Fig. A.3), it can be extended to the subsets of attacks that have the same target.

The principal components for attacks on MNIST often display elements that are recognizable to humans. Most perturbed samples labeled as 8’s contain a cross in the center, which is in alignment with human understanding. In the case of class 5 (see Fig. A.4f) it appears intuitive that the top right dash is a discriminative element for this class, as usually only (handwritten) 5’s contain it. On adversarially trained networks, these elements become even more human-aligned, usually contain- ing a relatively complete “summary” of the respective target class. This indicates that adversarial attacks on robust classifiers comprise more human-aligned changes.

The main elements of attacks on ResNet18 with clean training on CIFAR10 are not at all ex- plainable with a human intuition. Given the much larger diversity of the CIFAR10 dataset, it is not surprising that the principal components of CIFAR10 and MNIST attacks differ in interpretability.

The top components of the adversarially trained CIFAR10 classifier however expose some features that look much more interpretable to the human eye. This observation supports the claim made by the authors of [22] and [9] that adversarially trained classifiers give more attention to human-aligned features.

5.4 Ergodicity of the Attacking Process

Sec. 3 has motivated the choice of calculating multiple attacks on samples to achieve richer insights into the geometry around them. This The following experiment tries to assess to what degree this helps the performed analysis. The previous two sections have shown evidence that in standard classifiers, perturbations leading to the same target are strongly correlated. This poses the question whether it is really necessary to compute multiple attacks on the same samples to extract the

(16)

0 500 1000 1500 2000 2500 3000 Principal component

0 500 1000 1500 2000 2500

Singularvalue

0 20

0 1000 2000

(a) ResNet18, standard training on CIFAR10, target class: airplane

0 500 1000 1500 2000 2500 3000

Principal component 0

500 1000 1500

Singularvalue

0 20

0 500 1000 1500

(b) ResNet18, adversarial training on CIFAR10, target class: airplane

0 200 400 600 800

2000 4000 6000 8000 10000

Singularvalue

0 20

0 5000 10000

(c) LeNet, standard training on MNIST, target class: 8

0 200 400 600 800

2000 4000 6000

Singularvalue

0 20

0 2000 4000 6000

(d) LeNet, adversarial training on MNIST, target class: 8

Figure 5.4: Principal components of all perturbations that when added cause a deep classifier to assign a specific label (target). Left: First eight principal components (Note that all principal components have the same norm of 1 and were normalized for visualization). Right: corresponding singular values. See principal components of all classes in Fig. A.4 through A.7,

(17)

(a) N = 10000, K = 10 (b) N = 10000, K = 10

(c) N = 10000, K = 1 (d) N = 10000, K = 1

(e) N = 1000, K = 10 (f) N = 1000, K = 10

Figure 5.5: Most dominant principal components of attacks with the same target cat on a standard ResNet18 (left) and a robust ResNet18 (right). In every row, PCs were computed from K attacks on N samples.

important properties in perturbations. Moreover, do K attacks on the N samples yield the same information as one a attack on N · K samples?

An intuition on this question can be provided by looking at the first principle components {t0,target, t1,target, ...}. An example for standard and robust ResNet18 on CIFAR10 and ti,cat is provided in Fig. 5.5 for three setups: High N and high K, high N and low K and vice versa. Note that only about 10% of perturbed labels are assigned class cat, i.e. the shown PCs are calculated on 0.1 · N · K perturbations. Regarding the PCs obtained from attacks on the standard classifier, the most dominant element is almost exactly the same in all three setups. The subsequent elements are slightly varied and appear in different order (i.e. the corresponding singular values are different) but they still seem very similar visually.

Since a lot of subsequent experiments in Sec. 6 and 7 are based on subspaces spanned by these components, it is unsurprising that many of the following experiments yield similar results for different N and K.

This leads to the conclusion, that adversarial attacks on standard networks resemble ergodic pro- cesses.

For the adversarially trained model, the principal components are more unlike when changing the number of attacks or samples. This supports the previously mentioned notion that attacks on robust models expose higher variance and are not ergodic.

(18)

6 Principal Components of Perturbations as Discriminative Subspaces

Previous research has shown, that adversarial examples point at curved directions in the classification space. Moreover, the directions of curvature are those that are most discriminative. The results of Sec. 5.3 have shown that per target class, a handful of components in the attacks dominate the collection of perturbations. This leads to the assumption that these components suﬀice to convince a classifier that samples are of a certain class. In this section, we are trying to answer the question whether these dominating components indeed span the subspace of discriminative features for each target class. By discriminative subspace we denote a subset of directions in the input space that are relevant to the classification decision. This implies that all other directions contain information that is less or not at all relevant to the classifier.

The discriminative power of the principal components of adversarial noise is going to be assessed by creating two different types of subspaces:

1. Class-specific discriminative subspaces (see Sec. 6.1) 2. Class-agnostic discriminative subspaces (see Sec. 6.2)

6.1 Class-specific Discriminative Subspaces

Class-specific subspaces are spanned by the respective top x principal components of additive perturbations that cause a classifier to assign a specific label (target). There are three basic intuitive approaches to measure the discriminative power of class-respective subspaces:

1. Keep only the top x components of each samples’ respective class (see Sec. 6.1.1):

If the top x class-specific components are discriminative, keeping only these components should suﬀice to classify samples correctly. This experiment is performed by projecting each sample onto the subspace Tx,target that is spanned by the top x principal components {t0...tx} of perturbations that lead to each target.

2. Remove only the top x components of each samples’ respective class (see Sec. 6.1.2):

If the top x class-specific components are discriminative, removing these components should cause the classifier’s accuracy to decrease significantly. This is achieved by projecting on the orthogonal complement of Tx,target.

3. Remove the top x components of one target class from all samples (see Sec.6.1.3):

If the top x class-specific components are discriminative, removing the discriminative elements of one target class from all samples should hurt the accuracy of this target class exclusively.

For each experiment, the class-wise and overall accuracy with respect to the number of top components used to span the subspaces should be assessed. Moreover, the obtained accuracy-curves need to be compared with reasonable baselines to evaluate the actual “diﬀiculty” of the task.

Despite being intuitively reasonable, the first two of the above-mentioned experiments expose some major issues, which we are going to investigate in the following.

6.1.1 Keeping Class-respective Components

If the top x class-specific components are discriminative, the projection of a sample onto the subspace they span should intuitively suﬀice to still classify it correctly. The example in Fig. 6.1 proves however, that this property basically comes for free in practice: Every sample in MNIST that is

(19)

projected onto the top 10 components of class 8, looks like an 8. This happens, because all hand- written digits contain some energy in places that 8’s also have energy in. It is therefore not surprising that all projected samples are classified as 8’s and therefore not proof of the discriminative nature of the class-respective subspaces. It merely depicts that the projection of every sample onto it’s class-respective top components will naturally yield the original results.

However we can draw a promising conclusion from the projections in Fig. 6.1: Out of the projections of all ten random samples, the projection of a sample 8 seems to contain the most energy. This effect is followed up on in a supplementary experiment in Appendix B.

Label: 0 Label: 1 Label: 2 Label: 3 Label: 4 Label: 5 Label: 6 Label: 7 Label: 8 Label: 9

(a) Original samples

(b) Samples projected on top 10 components of class 8

Figure 6.1: Random samples of every class in MNIST projected onto the class-specific subspace spanned by the top 10 principal components of class 8 as described in Sec. 6.1. Note that regardless of the sample class, the projection resembles an 8 and is therefore classified as such.

6.1.2 Removing Class-respective Components

With the previous insight in mind it seems more promising to remove the class-respective elements from all samples. This is done by projecting them onto the orthogonal complement of suspected discriminative subspaces. Removing information from discriminative subspaces should reduce accuracy drastically. Following our general intuition the classification accuracy should decrease more when removing the top 10% class-respective components of attacks than when removing the same amount of random components, if the first components span a discriminative subspace. Although we see this behavior in MNIST-classifiers, the practice looks different for natural images in CIFAR10:

Since the principal components of attacks are made up of relatively high frequencies, natural images contain few energy in the subspaces spanned by them. Removing these high-frequency elements causes perturbations that are relatively small in norm. When removing the same amount of random components, images are degraded far more (see Fig. 6.2). When contrasting the classification accuracy on images with x class-respective and random directions removed we are therefore comparing images that have very different orders of degradation.

As a consequence, we discard this experiment for assessing discriminativeness.

(20)

Label: ship Img. Norm: 26.78

50 PC Label: ship Diff. Norm: 0.74 Proj. Norm: 26.77

100 PC Label: ship Diff. Norm: 1.05 Proj. Norm: 26.76

250 PC Label: frog Diff. Norm: 1.52 Proj. Norm: 26.74

(a) Class-respective PCs of Attacks

Label: ship Img. Norm: 26.78

(b) Random PCs

Figure 6.2: Random sample of class ship in CIFAR10, projected onto the orthogonal complement of subspaces spanned by different amounts of elements. Associated label, norm of the difference between original and projected sample as well as norm of the projected sample indicated above.

Note that the norm of the perturbation is much higher when removing random elements as when removing principal components of attacks.

6.1.3 Removing Target Components From All Samples

The aforementioned issues are avoided when taking a look at “differential” accuracies: Removing discriminative elements of one target class from all samples should degrade the accuracy of this class, but leave other accuracies intact. Removing any element from images naturally degrades image quality. The question is whether the removal of information from a low-dimensional subspace T_x,targetcan hurt the accuracy of the target class but not degrade other images too much to still be classified correctly. If such subspaces exist for every class, these would necessarily contain discriminative information. In this experiment, we assess whether this is the case for subspaces spanned by {t0,target, ...tx,target}.

For every possible target class, all test samples are divided into target and non-target samples.

Classification performance is evaluated for both groups when removing the top x components of perturbations that lead to the target class. If x components of each respective class span discriminative subspaces, accuracy on target samples should go to zero and accuracy on non-target samples should stay at 100% when removing them. Fig. 6.3 displays the measured average accuracies on target and non-target samples for LeNet on MNIST and ResNet18 on CIFAR10 with each standard and adversarial training.

The best available point of comparison for this experiment is the projection on subspaces that are spanned by the principal components of original samples (see Fig. B.2 and B.1).

Results on MNIST

The elbow points of the blue and orange accuracy curves in Fig. 6.3a indicate that for a LeNet with standard training on MNIST, 30 principal components of perturbations leading to a target class span such subspaces: Removing information from T³⁰,target the accuracy on target samples, while maintaining good accuracy on non-target samples. Removing the principal components of original samples hurts target samples similarly, but also harms non-target samples significantly more.

Fig. 6.4 illustrates this on an image level for target class 5: Subspaces spanned by components of perturbations remove discriminative information selectively, in this case the top right dash of 5’s (also visible in Fig. A.4f). In contrast to that, removing the top 30 components of original samples of class 5 degrades all images more (see Fig. 6.4c), causing a wrong classification on some of them.

Interestingly, although in this example the visible 5 is almost entirely removed, the image is still classified correctly. This shows that discriminative information is still present when removing these elements.

(21)

0 20 40 60 80 100 120 140 Top x components

0.0 0.2 0.4 0.6 0.8 1.0

Classificationaccuracy

Original accuracy PCs of Perturbations, Target PCs of Perturbations, Non-Target PCs of Original Data, Target PCs of Original Data, Non-Target

(a) LeNet, standard training on MNIST

0 20 40 60 80 100 120 140

Top x components 0.0

0.2 0.4 0.6 0.8 1.0

Original accuracy PCs of Perturbations, Target PCs of Perturbations, Non-Target PCs of Original Data, Target PCs of Original Data, Non-Target

(b) LeNet, adversarial training on MNIST

0 100 200 300 400 500 600

0.2 0.4 0.6 0.8 1.0

Original accuracy

PCs of Perturbations, Target PCs of Perturbations, Non-Target PCs of Original Data, Target PCs of Original Data, Non-Target

(c) ResNet18, standard training on CIFAR10

0 100 200 300 400 500 600

0.2 0.4 0.6 0.8 1.0

Original accuracy

PCs of Perturbations, Target PCs of Perturbations, Non-Target PCs of Original Data, Target PCs of Original Data, Non-Target

(d) ResNet18, adversarial training on CIFAR10 Figure 6.3: Classification accuracy over number of class-specific directions removed from every sample, comparison of PCs of data vs. PCs of adversarial attacks. If the first x respective components span a discriminative subspace, target accuracy (blue or green curve) is expected to decrease significantly when removing them, while non-target accuracy (orange or red curve) should stay high.

Note that this is only in the case of LeNet with standard training on MNIST (top left). Note that the maximum number of components are 784 and 3072 for MNIST and CIFAR10 respectively and axes were limited for better visibility

(22)

(a) Original samples

(b) Removed top 30 PCs of attacks with target 5

(c) Removed top 30 PCs of all samples of class 5

Figure 6.4: Removing discriminative information of class 5 by projection on orthogonal complement of discriminative subspace. Top row: Original samples. Middle row: Projection of samples on orthogonal complement of subspace spanned by the top 30 principal components of all additive perturbations that lead to classification decision 5. Note that only the suspected discriminative element (top right dash) is removed. Bottom row: Removing principal components of all original 5’s which hurts classification accuracy less selectively.

In conclusion, the top 30 respective components of attacks leading to a target class span class- specific discriminative subspaces for this network that are of crucial relevance to classification performance. Note, that many of the elements visible in Fig. A.4 are understandably discriminative for their class (e.g. bottom right dash for class 2, top left dash for 7, middle cross for 8). It appears reasonable that a network with standard training relies on these predictive features rather than a human-aligned, holistic set of features comprising all elements of a number. The conducted experiment verifies that the network actually relies on these elements for classification.

The adversarially trained LeNet is not susceptible to the previously observed behavior: Fig. 6.3b shows that principal components of perturbations of a target component cannot be removed without also hurting non-target accuracy. Moreover, removing principal components of original samples is more effective in decreasing target accuracy.

Results on CIFAR10

In both CIFAR10 classifiers, there is almost no difference in accuracy on target and non-target samples. This indicates that any suspected low-dimensional subspaces of class-specific discriminative features cannot be discovered using this setup. There are several conceivable explanations for this.

It is possible that the discriminative features for each class are strongly overlapping: While the presence of feature 1 and 2 might be discriminative for class A and feature 1 and 3 for class B, removing the discriminative elements of class A might also hurt B.

Moreover we have not proven that the collected adversarial samples capture all possible changes of discriminative elements. Sec. 5.3 has outlined that attacks are strongly dominated by a small amount of directions who are commonly exploited in attacks. Although these directions are relevant to change classification decisions it is well possible that they are not spanning the full discriminative subspaces. This would mean that removing target-specific directions does not guarantee classification

(23)

0 20 40 60 Top x components 0.0

0.2 0.4 0.6 0.8 1.0

Original accuracy

PCs of Attacks PCs of Original Data Random Basis

(a)

0 20 40 60

0.2 0.4 0.6 0.8 1.0

Original accuracy

(b)

0 100 200 300

0.2 0.4 0.6 0.8 1.0

Original accuracy

(c)

0 100 200 300

0.2 0.4 0.6 0.8 1.0

Original accuracy

(d)

Figure 6.5: Classification accuracy over number of directions per class used to span class-agnostic, discriminative subspaces, comparison of PCs of data, PCs of adversarial attacks and random directions. Note that the subspace used at every x has 10 · x components, since x is counted per class.

(a): LeNet on MNIST, standard training. (b) LeNet on MNIST, adversarial training. (c) ResNet18 on CIFAR10, standard training. (d) ResNet18 on CIFAR10, adversarial training.

failure in target samples and classification success in non-target samples.

6.2 Class-agnostic Discriminative Subspaces

The previous subsection has discussed the question whether the principal components of perturbations leading to a target class are discriminant elements of a class. In this section, we are trying to answer a similar, yet different question: Can we use the collection of adversarial examples to find one class-agnostic subspace in which networks base their decisions?

There are two intuitive options to span such class-agnostic subspaces: First, we could take the principal components of all adversarial attacks, regardless of their target class (see Fig. A.3). The main drawback of this approach is that the distribution of perturbations leading to the same class is highly unbalanced (see Fig. 5.1); e.g. for attacks on ResNet18 with standard training on CIFAR10, over 50% of attacks lead to target classes bird or frog. This means that the principal components of all attacks together would be skewed towards these classes.

Instead we assemble class-agnostic subspaces from the top x class-respective components of each class, which ensures that all classes are equally represented. Respective top x components are con- catenated into one tensor and an orthogonal basis is calculated from them using singular value decomposition.

If the obtained subspace contains all information that a network considers discriminative, the projection of all samples on this subspace should be enough to achieve original accuracy. Subspaces assembled from principal components of the original data and random components serve as a baseline for this experiment. Accuracies for different amounts of components per class are summarized in Fig. 6.5.

Results are similar for both the standard and robust version of LeNet: Original accuracy is reached at a low number of per-class components of perturbations. In a very similar experiment, the authors of [10] concluded from this result that such an assembled subspace had to span the dis- criminative elements that a network exploits for accuracy. Specifically they inferred, that “the most effective directions of adversarial attack are also the directions that contribute the most to the DCN’s classification performance”. However this is also the case for the baseline, i.e. principal components of original samples. Even when using random components, close to original accuracy is reached at only 30 components per class (i.e. a 300-dimensional assembled subspace, original dimensionality of MNIST is 784).

(24)

Label: 4 Img. Norm: 9.12

1 PC Label: 4 Proj. Norm: 2.29

(a) Standard LeNet on MNIST

Label: truck Img. Norm: 33.37

10 PC Label: cat Proj. Norm: 2.16

50 PC Label: frog Proj. Norm: 3.63

300 PC Label: truck Proj. Norm: 33.31

10 PC Label: deer Proj. Norm: 32.91

150 PC Label: automobile Proj. Norm: 23.35

(b) Standard ResNet18 on CIFAR10

Figure 6.6: Projections of samples on assembled class-agnostic subspaces with classifiation decision and norm of projected sample. Respective top row: Subspaces spanned by respective top x components of each class (x = 5 means class-agnostic subspace is spanned by 50 components). Middle row:

Subspaces spanned by principal components of original samples. Bottom row: Subspaces spanned by random components.

On the standard CIFAR10 classifier, many components of perturbations are needed to accom- plish good accuracy (around 2000 of 3072). For this network, projections on principal components of original data are correctly classified using much less elements. Example projections in Fig. 6.6 illustrate a possible explanation, that relates to the phenomenon discussed in Sec. 6.1.2: Since many adversarial perturbations contain high frequencies, projections on these subspaces have very low norm. This raises the question whether the comparison of accuracies in this setup is actually fair.

Principle components of perturbations on the robust CIFAR10 classifier are generally lower in frequency and thus do not expose this behavior.

Navigating Deep Classifiers

Navigating Deep Classifiers

A Geometric Study Of Connections Between Adversarial Examples And Discriminative Features In Deep Neural Networks

JOHANNES LEONHARD RÜTHER

Contents

1 Introduction

2 Literature Review

2.1 Interpretability

2.2 Robustness

2.3 Geometry of Neural Network Classifiers

3 Motivation

4 Methodology

4.1 Systematic Discovery of Adversarial Directions

4.2 Projected Gradient Descent on Subspaces

4.3 Attacked Datasets and Models

5 Geometrical Properties Derived From Adversarial Attacks

5.1 Attack Statistics And Preferred Target Labels

5.2 Separability of Perturbations

5.3 Principle Components of Perturbations

5.4 Ergodicity of the Attacking Process

6 Principal Components of Perturbations as Discriminative Subspaces

6.1 Class-specific Discriminative Subspaces

6.2 Class-agnostic Discriminative Subspaces