Behaviour of logits in adversarial examples: a hypothesis

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2017

Behaviour of logits in

adversarial examples: a

hypothesis

MARTIN SVEDIN

TROLLE GEUNA

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Behaviour of logits in

adversarial examples: a

hypothesis

MARTIN SVEDIN

TROLLE GEUNA

Examensarbete inom datalogi, grundnivå Date: June 12, 2017

Supervisor: Pawel Herman Examiner: Örjan Ekeberg

Swedish title: Beteendet hos logits för kontradiktorisk indata: en hypotes

(4)

(5)

iii

Abstract

It has been suggested that the existence of adversarial examples, i.e. slightly perturbed images that are classified incorrectly, imply that the theory that deep neural networks learn to identify a hierarchy of con-cepts does not hold, or that the network has not managed to learn the true underlying concepts. Previous work has however only reported that adversarial examples are misclassified or the output probabilities of the network, neither of which give a good understanding of the ac-tivations inside the network.

(6)

iv

Sammanfattning

Det har föreslagits att förekomsten av kontradiktorisk indata (adversa-rial examples), dvs indata med en liten förändring som blir felklassifi-cerad, implicerar att teorin att djupa neurala nätverk lär sig att identi-fiera en hierarki av koncept är felaktigt, eller att nätverket inte har lärt sig att identifiera de korrekta koncepten. Tidigare artiklar har emeller-tid endast rapporterat att kontradiktorisk indata blir felklassificerad eller de sannolikheter som nätverket ger som utdata. Inget av dessa mått ger en bra insikt i aktiviteten inuti nätverket.

(7)

Chapter 1 Introduction

Deep neural networks have in the last few years achieved good re-sults on different types of machine learning tasks, often tasks involv-ing complex inputs and tasks that are easy for humans but have tra-ditionally been hard for a computer, such as problems dealing with images or sound (LeCun, Y. Bengio, and G. Hinton 2015).

A type of deep neural network called a convolutional neural net-works is often used for image recognition tasks. Convolutional neu-ral networks have over the last few years significantly improved the state of the art on the dataset from the Imagenet Large Scale Visual Recognition Competition (ILSVRC) (Russakovsky et al. 2015; ILSVRC competition 2012), with some well known results including Krizhevsky, Sutskever, and G. E. Hinton (2012), Christian Szegedy, Liu, et al. (2015), Simonyan and Zisserman (2014), and He et al. (2015).

One of the tasks supported by the ILSVRC dataset is image clas-sification, where the goal is to classify an image as belonging to one of several pre-defined classes, an example image from the dataset is provided in figure 1.1a. Using the official measure of the competition (referred to as top-5 accuracy1_{) the model trained by He et al. (2015)}

achieved an accuracy of 96.43% which is similar to some estimates of human accuracy at 94.9% (Russakovsky et al. 2015).

While convolutional neural networks generalize well and perform well on the test set it turns out that by perturbing an image in a very particular way the output of the network can be changed while keep-ing the two images so similar they are indistkeep-inguishable to a human,

1_{The algorithm is allowed to make 5 guesses, if the true label is among those 5}

guesses, the algorithm classified the image correctly

(10)

2 CHAPTER 1. INTRODUCTION

(a) (b)

Figure 1.1: a) Image from ILSVRC competition (2012)

b) Adversarial example which our model classifies as belonging to the class ”container ship”.

see figure 1.1b for an example. We use the term clean image for the original image and adversarial example for the perturbed image. Chris-tian Szegedy, Zaremba, et al. (2013) were the first to observe this phe-nomea in terms of convolutional neural networks and provided an al-gorithm for generating adversarial examples. Furthermore, they ob-served that adversarial examples generated for one network often are misclassified by other networks, even when the networks are trained on a different dataset or use a different architecture.

Subsequently several additional attack algorithms have been de-veloped. Some attack algorithms attempts to cause the image to be classified as a specific target class (Christian Szegedy, Zaremba, et al. 2013; Carlini and Wagner 2016) while other attack algorithms only at-tempt to cause a misclassification (I. J. Goodfellow, Shlens, and C. Szegedy 2014; Moosavi-Dezfooli, Fawzi, and Frossard 2015). Similar phenomena to adversarial examples have also been found for other tasks than image classification, such as reinforcement learning (S. H. Huang et al. 2017) and semantic segmentation (Fischer et al. 2017).

(11)

under-CHAPTER 1. INTRODUCTION 3

stood, there is some evidence that deep neural networks do to learn some form of hierarchy of human-interpretable concepts, in particu-lar, activation in certain parts of the network seem to correspond to the presence of some concept in the input (I. Goodfellow, Y. Bengio, and Courville 2016; LeCun, Y. Bengio, and G. Hinton 2015).

1.1 Problem definition

The output of a convolutional neural network used for classification is usually a probability distribution over the possible classes. Previous work on adversarial examples has either reported that the adversarial examples are misclassified, or has reported the output probabilities of the network.

At first sight, the existence of adversarial examples is hard to un-derstand if it is true that the neural network learns to identify human-interpretable concepts in the input. Indeed, for instance I. J. Goodfel-low, Shlens, and C. Szegedy (2014) makes this argument and suggest that the network has not managed to learn to identify sensible con-cepts.

However, it turns out that it is possible to cause a misclassifica-tion, and even an increase in the output probability for the target class, without actually increasing activations in the network. We elaborate further on this in section 2.2.2. The network architectures that have been used for studying adversarial examples use a softmax layer at the end of the network in order to output a probability distribution over the different classes. The input to the softmax consists of one real number for each class and we refer to these numbers as the logits.

The logits are more closely related to the hierarchy of concepts view than the probabilities and would thus be more suitable for investigat-ing the relation between adversarial examples and the view that the network learns to identify a hierarchy of concepts. In this report we test the following hypothesis:

When comparing the adversarial example and the clean image, the logit of the target class is unchanged or decreas-ing.

(12)

net-4 CHAPTER 1. INTRODUCTION

work finding more of the target class concept in the image, but rather by the network finding less of the concepts of the other classes.

1.2 Scope

For time reasons we restrict ourselves to investigating only a single network architecture, a slightly modified version of the GoogLeNet architecture by Christian Szegedy, Liu, et al. (2015), further described in section 2.2.1. We also use a single algorithm for generating adver-sarial examples as described in more detail in section 3.2.

Furthermore, only a single model of this architecture is investi-gated and no investigation of how hyper parameters affect the results is done.

1.3 Outline

(13)

Chapter 2 Background

This chapter starts with a description of the dataset used. Section 2.2 provides background on deep neural networks, in particular the GoogLeNet architecture and the relation between probabilities and logits. Then we discuss the hierarchy of concepts view. The next sec-tion then introduce different ways of measuring the distance between two images. Section 2.5 contains background on adversarial examples. Finally the chapter ends with other related work.

2.1 The Imagenet dataset

The Imagenet dataset, in particular the version from the object locali-sation task of the ILSVRC competition (2012), contains 1000 categories and the images are of varying size. Some categories are visually quite similar further adding to the complexity of the task. Figure 2.1 shows some example images from the dataset. The dataset is split into three parts: the training set contains about 1.2 million images, the validation set consists of 50000 images and the test set consists of 100000 images. The true labels for the test set are not publicly available.

2.2 Deep neural networks

This section starts with a quick review of deep neural networks in gen-eral, then we describe the modified GoogLeNet architecture used in this work and the section ends with a discussion of the relation be-tween the logits and the output probabilities of the network.

(14)

6 CHAPTER 2. BACKGROUND

Figure 2.1: Three examples from the Imagenet dataset, belonging to categories ”siberian husky”, ”eskimo dog” and ”container ship”, re-spectively.

We review the basics of deep neural networks in the rest of this sub-section, however we are not able to give a complete and pedagogical description of the relevant concepts and refer to the book by I. Good-fellow, Y. Bengio, and Courville (2016) for additional details.

For our purpuses it is enough to consider feedforward neural net-works. The feedforward neural network consists of a set of layers con-nected together. Each layer is constructed from some inputs which are either the input to the network or other layers. The word feedforward signifies that there are no cycles in the way the layers are computed (I. Goodfellow, Y. Bengio, and Courville 2016, p. 168).

The layers may be viewed as a vector in Rn_{, but as we shall see, it}

will be more appropriate for our purpuses the look at the individual components of these vectors, which in this context are known as units or neurons (I. Goodfellow, Y. Bengio, and Courville 2016, p. 169).

Some layers contain learnable parameters that are updated when the network is trained. We consider three types of layers: fully-connected layers, convolutional layers and pooling layers. Figure 2.2 shows a simple feedforward neural network employing these layers types.

The fully-connected layer with output y and input x is computed by the formula

y = f (W x + b)

where f is a function determined by the network architecture known as the activation function and W, b are a matrix and vector, respectivly, with learnable parameters (I. Goodfellow, Y. Bengio, and Courville 2016, p. 192). The term fully-connected is used because every com-ponent of the input can affect every comcom-ponent of the output.

(15)

CHAPTER 2. BACKGROUND 7

max max max max max

x₁ x₂ x₃ x4 x5 y₁ y₂ y₃ y4 y5 z₁ z₂ z₃ z₄ z₅ w1 w2 w3 w1 w2 w3

fully connected layer

max pooling layer

1-D convolutional layer

input layer

Figure 2.2: A simple feedforward neural network. For the fully con-nected layer, to every arrow there is a weight associated. For the max pooling layer, there are no learnable parameters. For the convolutional layer we have listed some of the weights, all arrows that point in the same direction share the same weight.

sparse, a certain output is not affected by most inputs, and the learn-able parameters are shared between different neurons (I. Goodfellow, Y. Bengio, and Courville 2016, pp. 331-337). In figure 2.2 we display a 1-D convolutional layer, in the rest of this text we will be refering to 2-D convolutional layers where the neurons are arranged in a 2-D grid instead of a line.

Finally, the pooling layer is similar to the convolutional layer in that it employs sparse connections, except we have no learnable parame-ters and instead a function fixed by the network architecture maps the inputs to the outputs.

2.2.1 The GoogLeNet architecture

(16)

8 CHAPTER 2. BACKGROUND 1x1 Convolutions 3x3 Convolutions 1x1 Convolutions 5x5 Convolutions 1x1 Convolutions 1x1 Convolutions 3x3 Max Pooling Concatenation Module Input

Figure 2.3: An inception module with 1x1 convolution, 3x3 convo-lution and 5x5 convoconvo-lution and max pooling layers. Additional 1x1 convolutions have been added to manage time complexity and reduce output size.

by a max pooling layer where these layers are connected in series. With an inception module one could instead connect these two layers in par-allel and have them at the same depth. The inception modules used in the GoogLeNet architecture is shown in figure 2.3.

The GoogLeNet architecture starts with three convolutional layers and two max pooling layers in series. In the original version of the architecture there are also two local response normalization layers in this part of the network, however they are not present in the version used in this report. After this initial part of the network, the bulk of the network follows in the form of 9 inception modules and two 3x3 max pooling layers connected in series. The network ends with an average pooling layer followed by a fully connected layer which create the logits and finally a softmax layer. The full architecture is shown in figure 2.4.

2.2.2 Probabilities & Logits

(17)

CHAPTER 2. BACKGROUND 9 Input 7x7 Convolutions 3x3 Max Pooling 1x1 Convolutions 3x3 Convolutions 3x3 Max Pooling Inception Module Inception Module Inception Module Inception Module Inception Module Inception Module Inception Module Inception Module Inception Module 7x7 Average Pool Fully Connected Softmax 3x3 Max Pooling 3x3 Max Pooling

(18)

probability for class i then

pi =

eli P

jelj

Firstly, we note that going from logits to probabilities or vice versa preserves order. This follows since the exponential function is strictly increasing andP jelj > 0 which gives: li < lj ⇔ eli < elj ⇔ eli P kelk < e lj P kelk ⇔ pi < pj

Thus the class with the largest logit will also be the class with largest probability, and vice versa.

Secondly, we note that some information is lost when going from logits to probabilities. That is, there are many sets of logit values giving the same probabilities. In particular, say we are given the probabilities for all classes and the logit value for some class a, i.e. la, then we can

recover the normalizing constant getting P

je

lj = ela

pa. From this this the rest of the logit values follow, i.e., let i be an arbitrary class, then li = ln(pi·

P

jelj).

To exemplify how different logit values can give rise to the same probabilities, consider the case with three classes and probabilities

p1 = 0.8, p2 = 0.1, p3 = 0.1

Then one possible set of logit values are

l1 = 10, l2 = 7.9, l3 = 7.9

But another possible set is

l1 = 2, l2 =−0.07, l3 =−0.07

Thus from knowing the output probabilities of the network it is not possible to determine the activations of the logits.

2.3 Hierarchy of concepts

(19)

neurons seem to be responsible for identifying the presence of human-interpretable concepts in the input. We are aware of two broad sets of works in this area. The first set is based on finding those images in the dataset that cause a certain neuron to activate. When the images in the dataset that cause a neuron to activate has been found, the simplest ap-proach would be to manually inspect the images for any commonality, and in this fashion try to understand what seem to cause the neuron to activate.

However, some additional techniques have been developed in or-der to unor-derstand more precisely what part of the input cause the ac-tivation. One idea by Zeiler and Fergus (2013) is to cover parts of the input with a grey square. By moving the square around and observing how the activation in the neuron changes one can understand what part of the input is important for the activation. This basic idea was later refined by Zintgraf et al. (2017).

Zeiler and Fergus (2013) also presents a different, more technical approach, for visualizing what part of the input image is responsi-ble for the activation. It is based on what they call a deconvolutional neural network. This allows them to create an approximate inverse of the convolutional neural network and in this way visualize what part of the input was important for causing the activation. Simonyan, Vedaldi, and Zisserman (2013) show that calculating the gradient of the neuron with respect to the network input can be used in a similar way to the deconvolutional network.

The second set of works has focused on generating artificial images that cause activation of a certain neuron. These works do not seem to employ any additional techniques for understanding what part of the generated image is causing the activation, and are instead intended to be used with manual inspection of the images (Simonyan, Vedaldi, and Zisserman 2013; Nguyen et al. 2016).

In both these sets of works we note two ideas. First, the recurring theme is to look for maximal activation in a neuron. Second the as-sumption that each neuron can be interpreted independently, the acti-vation of a single neuron is enough and one does not have to consider the activations in all the neurons of a whole layer.

(20)

an experiment where they identify those images in the test set which cause maximal activation, not of a single neuron, but a randomly cho-sen direction in the space of the neurons in a single layer. They observe that these images seem to posses the same kind of interpretability as those images that cause a single neuron to be activated. We are not aware of any follow up work on this approach, however.

2.4 Norms

There are several different norms for measuring the distance between two images. Most previous work on adversarial have used some form av `p-norm. In particular the `0-norm, the `2-norm and the `∞-norm

have been used. In this subsection we review these norms.

The `2-norm corresponds to the standard Euclidean distance. The

norm is given by kxk2 = X i x2 i

Where we flatten the image to a one dimensional vector and thus take the sum over all pixels and color channels.

Even though the `2-norm may be familiar, in terms of images it is

not always the most suitable norm, since images that human consider visually similar can have similar `2-distance as images that are easily

distinguishable for a human, see figure 2.5 for an example.

Figure 2.5: The three images on the right have the similar `2distance to

the image on the left. Original image licensed under CC-Attribution by Flickr user Zengame. From left to right: Original image, shifted image, black square, darkened

Another common way to measure distance is the `∞-norm, which

is defined by

(21)

Again the image is flattened to a one dimensional vector which one then takes the maximum over. This is a pessimistic measure, if even a single pixel differ significantly between the two images, the norm will be large. Thus there are images with large `∞-norm that seem visually

similar to humans, on the other hand, if the norm is small, we know that all pixels have similar values in the two images.

Lastly some previous work has considered the `0-norm1. The `0

-norm only counts the number of pixels that have been changed, but is not affected by the size of the change in those pixels. If we let 00 _{= 0}

then the norm is given by

kxk0 =

X

i

x0 i

In terms of images this norm allows there to be large and easily distin-guishable differences between two images, but only in a small part of the image.

2.5 Adversarial examples

Christian Szegedy, Zaremba, et al. (2013) showed that while neural networks are stable against random noise, deep neural networks pos-sess the undesirable property that for most images there exists pertur-bations that when applied to the image cause the resulting image to be misclassified, i.e. for most images there exists adversarial examples. Carlini and Wagner (2016) sharpen these results and claim a 100% suc-cess rate for creating adversarial examples while making the minimal possible change in the `∞-norm.

Furthermore, Christian Szegedy, Zaremba, et al. (2013) showed that adversarial examples can transfer between different models, i.e. an ad-versarial example created for one neural network is often adad-versarial for other neural networks. This holds even if the networks use dif-ferent architectures or were trained on difdif-ferent datasets. Moosavi-Dezfooli et al. (2016) later showed that there exists universal adversar-ial perturbations, i.e. perturbations that when applied to almost any image cause the image to be misclassified.

1_{As defined here the `}

0-norm is not a norm as the term is usually defined in

(22)

2.5.1 Explanations

It is currently poorly understood why a small change in the input can result in a large change in the output of a neural network in this fash-ion. Christian Szegedy, Zaremba, et al. (2013) focused on the non-linear behaviour of deep neural networks and suggested that adver-sarial examples are similar to the rationals viewed as a subset of the real numbers, i.e. a subset with small measure but which is still dense in the whole space.

I. J. Goodfellow, Shlens, and C. Szegedy (2014) has instead sug-gested that adversarial examples are caused by excessive linearity in deep neural networks, pointing out that linear behaviour has been considered desirable in order to simplify the training of the network.

The linearity hypothesis has however recently been challenged by Tanay and Griffin (2016) who argue that the problem is related to over-fitting and that the decision boundaries in image space make a very small angle with the submanifold of real images.

2.5.2 Defenses

The most obvious way to try and defend against adversarial examples is to increase the robustness of the network. One idea that has been attempted is to include adversarial examples in the training process, this improves the robustness it has not solved the problem.

Papernot et al. (2015) introduced the technically rather complex de-fence called defensive distillation. This seemed initially promising, however the defence was later broken by the new attack introduced by Carlini and Wagner (2016).

(23)

2.5.3 Attacks

Several different attacks have been introduced, and in this section we review those that are most relevant for our purpuses.

There are several different properties of an attack that is of interest, such as what the exact goal for the attack is. An attack can either be targeted, i.e. trying to cause the adversarial example to be classified as a certain class, or untargeted, where the goal is just to cause any misclassification. Another property of the attack algorithm is which norm the algorithm use for measuring the size of the perturbation. Lastly, most attack algorithms are iterative and take several steps. The number of steps is another property of interest.

We use the following notation, xclean denotes the clean image, δx

denotes the perturbation and xadvdenotes the adversarial example. We

thus have xadv = xclean + δx. The true class is denoted ytrue while the

target class is denoted ytarget. Finally we use J(x, y) for the loss function

used during training of the network for an image x and class y.

L-BFGS

The algorithm by Christian Szegedy, Zaremba, et al. (2013) is a tar-geted attack that minimizes the `2 norm. The L-BFGS optimization

algorithm (Nocedal and Wright 2006) is used as a substep. The attack algorithm repeatedly solves the following optimization problem:

minimize c· kδxk2+ J(xadv, ytarget)

where c > 0 is selected using a line search to be the smallest2 _value

which cause the classification of xadvby the network to be ytarget.

Fast Gradient Sign Method

I. J. Goodfellow, Shlens, and C. Szegedy (2014) introduced a simpler, but untargeted attack. The attack computes the the gradient of the loss function with respect to the input image just a single time. The adversarial examples is created as follows:

xadv= xclean+ α sign(∇xJ(xclean, ytrue))

where α is a tunable parameter controlling the size of the step.

2_{This is how it is described in Christian Szegedy, Zaremba, et al. (2013), however}

(24)

Iterated (Targeted) Fast Gradient Sign Method

The fast gradient sign method was later extended by Kurakin, Ian J. Goodfellow, and S. Bengio (2016) to an iterated version. We use x(i)

for the image at iteration i. If we run the algorithm for n iterations we thus have xclean = x(0) and xadv = x(n). The image is then updated

according to:

x(i) = Clipxclean,(x (i−1)

+ α sign(∇xJ(x(i−1), ytrue)))

They also provide a targeted version, which updates the image accord-ing to

x(i) = Clipxclean,(x (i−1)

− α sign(∇xJ(x(i−1), ytarget)))

Again α is a tunable parameter controlling the size of each step, the parameter provides an upper limit on the `∞-norm of the

perturba-tion.

Carlini & Wagner

Carlini and Wagner (2016) introduced a set of three attack algorithms, targeting the `0-, `2- and `∞-norms. All attacks are iterative and

tar-geted. Their attacks seem to be the current state-of-the-art when it comes to generating adversarial examples with a small perturbation.

We are not able to provide the full details of their attacks here, but note that they observed that the issue with generating adversarial ex-amples using gradient descent is the box-constraint, i.e. the need to constrain the image to lie in a subset of Rn_{. This is solved by}

clip-ping in the Iterative Fast Gradient Sign Method above. They, however, choose a different approach and introducing a variable transform such that the new variable can be an arbitrary real numbers. The new vari-ables can now be optimized using gradient descent or some variation, they use the Adam optimizer (Kingma and Ba 2014) for performance reasons.

2.6 Other related work

(25)

extract the representation for some intermediate layer in the network for this image. This representation will be the target in the algorithm. Secondly, they choose another image and attempt to perturb this im-age such that the intermediate representation for the perturbed imim-age equals the representation of the guide image. This attack is successful in the sense that the `2-norm between the representation of the guide

(26)

Chapter 3 Method

This chapter starts with a description of the environment used for run-ning the experiments. Then we discuss our choice of attack algorithm and our process for generating adversarial examples. Finally we dis-cuss how we measure the change in the logits.

3.1 Environment and model

The experiments were conducted on an ordinary desktop computer equipped with a GeForce GTX 970, AMD Phenom 2 X4 940 processor and 8 gigabyte ram. The operating system used was Ubuntu 16.04.2 LTS and Nvidia’s proprietary driver version 378.13.

The software used for our experiments were written using the Ten-sorFlow library Abadi et al. (2016). TenTen-sorFlow is a library for con-structing and evaluating computational graphs, with good support for executing parts of the graph on a GPU. The library is mostly targeting machine learning applications, but other types of software could also be written using it. TensorFlow was originally created by Google and released as open source on November 9, 2015. To connect and work with TensorFlow we used the Python API.

Besides TensorFlow we used the TF-Slim library for constructing the network architecture as well as their code for preprocessing images prior to feeding them to the network.

The TF-Slim library also provides pretrained models for several common network architectures, ready to use with their library. We used their pretrained model for the GoogLeNet architecture dated 2016-08-28.

(27)

CHAPTER 3. METHOD 19

3.2 Attack algorithm

Since we investigate adversarial examples in the context of the hierar-chy of concepts, it is important that the clean image and adversarial example are visually similar to a human, i.e. they should contain the same concepts. By choosing an attack that minimizes the `∞-norm

we can avoid visually inspecting the created adversarial examples, the norm being small enough is enough to guarantee that the images look indistinguishable.

Since the Imagenet dataset contains a large number of visually sim-ilar classes an untargeted attack would not be suitable. For example an image of an ”eskimo dog” might be misclassified as a”Siberian husky”, see figure 2.1, and these two classes would seem to share many con-cepts to a human.

Given these constraints we are left with the Iterative Targeted Fast Gradient Sign Method and the `∞attack by Carlini and Wagner (2016).

We choose the former, in large part because of increased performance and ease of implementation.

We choose the target class uniformly at random among all classes except the top-5 predictions by the network on the clean image. The algorithm is run for 10 steps and we select parameter of the algorithm such that when the pixel intensities are represented in the range [0, 255] the `∞-norm of the perturbation is at most 2.

3.3 Adversarial examples

In all our experiments we create adversarial examples from the valida-tion set of the ILSVRC 2012 dataset. Since the images in the Imagenet dataset are of varying size we need to generate a set of clean images of the size required by the network before we can generate adversarial examples. This is done using the evaluation version of the preprocess-ing code provided by TF-slim for our network. This code performs a center crop of 87.5% for the height and width, i.e. for a 400x200 pixel image we would crop the piece of size 340x170 located at the center of the image. Finally, this crop is resized to a 224x224 pixel image, which is the size of the input to the network.

(28)

20 CHAPTER 3. METHOD

not agree with the true label. From the remaining clean images we repeatedly sample images with replacement and apply our attack al-gorithm, which is described above in section 3.2.

The resulting image produced by the attack algorithm is then rounded to a valid image with integer values in the range [0, 255]. We are only interested in successful attacks and only keep those adversarial exam-ples where the predicted class equals the target class.

3.4 Measurements

The simplest way to test our hypothesis would be to look at the direc-tion of change of the logit for the target class. However, this may be too simple of a measure, since it is possible for the increase in the logit of the target class to be small compared to the decrease of other logits that that had a large value on the clean image. In this case one could still argue that the hypothesis holds.

We are not aware of any way to capture the change in a possibly very large number of logits, and instead we propose to compare the change of the logit for the target class with the logit for the true class. We feel this is a good measure both because the true class has some intrinsic meaning to the image, but also because the we are only look-ing at cases where the clean image is handled correctly by the network and thus the logit for the true class is the logit with largest value.

We investigate the cases where the logit of the target class increases and the logit of the true class decreases further by making two sets of measurements. In order to describe these measurements in more detail we introduce the following notation: we let lclean

i be the logit for

class i on the clean image, we let ladv

i be the logit for class i on the

adversarial example and finally we let ∆li be the difference in the logit

for class i between the adversarial example and the clean image, that is, ∆li = liadv− lcleani .

First we look at the ratio between the increase in the logit for the target class and the difference between the logit for the true class and the logit for the target class on the clean image, i.e.

∆ltarget

lclean

true − lcleantarget

(29)

CHAPTER 3. METHOD 21

the decrease in the logit for the target class using the following expres-sion:

arctan(∆ltarget,−∆ltrue)

(30)

Chapter 4 Results

Table 4.1 shows the direction of change for logit of the true class and the logit of the target class. In no cases did either logit remain com-pletely unchanged. We note in particular that in no case did we see a decrease in the logit of the target class. In a few cases both the logits increased, with the logit of the target class being so large it still over-took the logit of the true class. Most of the images however belong to the ambiguous case where the logit of the true class decreased while the logit of the target class increased.

Logit change # Images

True logit increased,

Target logit increased 631 True logit increased,

Target logit decreased 0 True logit decreased,

Target logit increased 79369 True logit decreased,

Target logit decreased 0

Total images 80000

Table 4.1: Number of images when looking at the direction of change between the adversarial example and clean image.

We investigate the cases where the logit for the target class in-creased and the logit for the true class dein-creased in more detail. Figure 4.1 show the result of looking at the ratio between on the one hand the

(31)

CHAPTER 4. RESULTS 23

change in the logit for the target class and on the other hand the dif-ference between the logit for the true class and the logit for the target class.

In particular we note that ∆ltarget

lclean

true − lcleantarget

> 1.0_{⇔ l}adv

target− lcleantarget> ltrueclean− lcleantarget⇔ ltargetadv > lcleantrue

This occurs in more than 53% of cases. Another observation is that the ratio is less than 0.5 in less than 5% of cases.

0.0 0.5 1.0 1.5 2.0 2.5

∆ltarget

lclean true −lcleantarget

0.0 0.2 0.4 0.6 0.8 1.0

F

raction

of

cases

Empirical CDF 99% Confidence Band†

Figure 4.1: Comparing the increase in the logit for the target class to the distance between the logit for the true class and the logit for the target class on the clean image. The confidence band is not visible in this figure, we provide the two graphs as separate figures in appendix A.

Finally we also compare the change in the logit for the target class against the change in the logit for the true class, using the arctan-formula as described in section 3.4. The results is displayed in figure 4.2. We note that in less than 11% of cases was the decrease in the logit for the true class larger than the increase in the logit for the target class.

(32)

24 CHAPTER 4. RESULTS

0 10 20 30 40 50 60 70 80 90

arctan(∆l

target

,

−∆l

true

)

0.0 0.2 0.4 0.6 0.8 1.0

F

raction

of

cases

Empirical CDF 99% Confidence Band†

(33)

Chapter 5 Discussion

For our choice of network architecture and attack algorithm the pro-posed hypothesis is not a viable explanation for the existence of ad-versarial examples. We did not find a single case where the logit of the target class decreased, and it does not seem like a viable explanation to claim the increase is very small. In fact, it looks like in a non-trivial number of cases the algorithm may be too successful at increasing the logit for the target class, and the image could be detected as being an adversarial example by the logit being unnaturally large. Some further experiments would have to be performed to confirm this hypothesis, however.

Comparing our results with previous work we note that even de-tecting adversarial examples is at least partially an open problem. It is thus not unexpected that at least for some attacks the logit for the target class does increase, since otherwise adversarial examples could probably be detected simply by having an unnaturally small logit for the predicted class.

Similarly, Sabour et al. (2015) showed that the intermediate repre-sentations in the neural network are vulnerable to attack, which might suggest that it is possible to increase the activation in intermediate neurons. However since their measure is based on the `2-norm in high

dimensional spaces and the distance is not arbitrarily small, it is not clear to us that their results would automatically imply anything re-garding the behaviour of the logits.

Compared to Sabour et al. (2015) we also note that they use L-BFGS, which could run for a large number of iterations. The algorithm used in our work only takes ten steps and thus this is enough to cause an

(34)

26 CHAPTER 5. DISCUSSION

crease in the logit for the target class. Finally they use a network archi-tecture with several large fully connected layers at the end, and their best results are for one of the final fully connected layers. In compar-ison, our network architecture mainly consists of convolutional and max pooling layers with a single small fully connected layer at the end.

For the view that the neural network learns to identify a hierar-chy of concepts our results show that, as claimed by e.g. Christian Szegedy, Zaremba, et al. (2013), adversarial examples do seem to pose a challenge.

Since we have shown that causing an increase in the logit of the target class seems easy, a natural continuation of this work would be to investigate if this holds for arbitrary neurons in the network. Simi-larly, if the network learns a hierarchy of concepts, one would expect that one of the most important causes of the misclassification is the activation of previously inactive neuron. It would thus be interesting to investigate if the attack algorithm manages to activate inactive neu-rons, or if the misclassification is cause by changing the activations in already active neurons.

5.1 Limitations

This work used a pretrained model, this has some benefits such as making the results easier to reproduce and lowers the computational demands. However, it also has the drawback that we are not even aware of the hyperparameters used for training our model.

5.2 Conclusion

(35)

Bibliography

Abadi, Martín et al. (2016). “TensorFlow: A system for large-scale ma-chine learning”. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Geor-gia, USA.

Carlini, Nicholas and David Wagner (2016). “Towards Evaluating the Robustness of Neural Networks”. In: CoRR abs/1608.04644. URL: http://arxiv.org/abs/1608.04644.

Fischer, Volker et al. (2017). “Adversarial Examples for Semantic Image Segmentation”. In: arXiv preprint arXiv:1703.01101.

Goodfellow, I. J., J. Shlens, and C. Szegedy (2014). “Explaining and Harnessing Adversarial Examples”. In: ArXiv e-prints. arXiv: 1412. 6572 [stat.ML]_.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep Learn-ing. http://www.deeplearningbook.org. MIT Press.

He, Kaiming et al. (2015). “Deep Residual Learning for Image Recogni-tion”. In: CoRR abs/1512.03385.URL: http://arxiv.org/abs/ 1512.03385_.

Hendrik Metzen, J. et al. (2017). “On Detecting Adversarial Perturba-tions”. In: ArXiv e-prints. arXiv: 1702.04267 [stat.ML].

Huang, Sandy H. et al. (2017). “Adversarial Attacks on Neural Net-work Policies”. In: CoRR abs/1702.02284. URL: http://arxiv. org/abs/1702.02284.

ILSVRC competition (2012). URL: http : / / www . image - net . org / challenges/LSVRC/2012/(visited on 06/05/2017).

Kingma, Diederik P. and Jimmy Ba (2014). “Adam: A Method for Stochas-tic Optimization”. In: CoRR abs/1412.6980.URL: http://arxiv. org/abs/1412.6980.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton (2012). “Ima-geNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F.

(36)

28 BIBLIOGRAPHY

Pereira et al. Curran Associates, Inc., pp. 1097–1105.URL: http:// papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

Kurakin, Alexey, Ian J. Goodfellow, and Samy Bengio (2016). “Adver-sarial examples in the physical world”. In: CoRR abs/1607.02533. URL: http://arxiv.org/abs/1607.02533.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learn-ing”. In: Nature 521.7553, pp. 436–444.

Moosavi-Dezfooli, Seyed-Mohsen, Alhussein Fawzi, and Pascal Frossard (2015). “DeepFool: a simple and accurate method to fool deep neu-ral networks”. In: CoRR abs/1511.04599. URL: http : / / arxiv . org/abs/1511.04599_.

Moosavi-Dezfooli, Seyed-Mohsen et al. (2016). “Universal adversarial perturbations”. In: arXiv preprint arXiv:1610.08401.

Nguyen, Anh Mai et al. (2016). “Synthesizing the preferred inputs for neurons in neural networks via deep generator networks”. In: CoRR abs/1605.09304.URL: http://arxiv.org/abs/1605.09304. Nocedal, Jorge and S. Wright (2006). Numerical optimization. 2nd ed.

Springer-Verlag New York.

Papernot, Nicolas et al. (2015). “Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks”. In: CoRR abs/1511.04508. URL: http://arxiv.org/abs/1511.04508.

Russakovsky, Olga et al. (2015). “Imagenet large scale visual recog-nition challenge”. In: International Journal of Computer Vision 115.3, pp. 211–252.

Sabour, Sara et al. (2015). “Adversarial Manipulation of Deep Repre-sentations”. In: CoRR abs/1511.05122.URL: http://arxiv.org/ abs/1511.05122_.

Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman (2013). “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”. In: CoRR abs/1312.6034.URL: http: //arxiv.org/abs/1312.6034_.

Simonyan, Karen and Andrew Zisserman (2014). “Very Deep Convo-lutional Networks for Large-Scale Image Recognition”. In: CoRR abs/1409.1556.URL: http://arxiv.org/abs/1409.1556. Szegedy, Christian, Wei Liu, et al. (2015). “Going deeper with

(37)

BIBLIOGRAPHY 29

Szegedy, Christian, Wojciech Zaremba, et al. (2013). “Intriguing prop-erties of neural networks”. In: CoRR abs/1312.6199. URL: http : //arxiv.org/abs/1312.6199.

Tanay, Thomas and Lewis D. Griffin (2016). “A Boundary Tilting Persepec-tive on the Phenomenon of Adversarial Examples”. In:URL: http: //arxiv.org/abs/1608.07690.

Wasserman, Larry (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.

Zeiler, Matthew D. and Rob Fergus (2013). “Visualizing and Under-standing Convolutional Networks”. In: CoRR abs/1311.2901.URL: http://arxiv.org/abs/1311.2901.

(38)

Appendix A

Appendix

0.0 0.5 1.0 1.5 2.0 2.5

∆ltarget

lclean true−lcleantarget

0.0 0.2 0.4 0.6 0.8 1.0 F raction of cases Empirical CDF 0.0 0.5 1.0 1.5 2.0 2.5 ∆ltarget lclean true−lcleantarget

0.0 0.2 0.4 0.6 0.8 1.0 F raction of cases 99% Confidence Band

Figure A.1: Figure 4.1 as two separate graphs. The 99% confidence band is calculated using the Dvoretzky-Kiefer-Wolfowitz-inequality (Wasserman 2013, p. 124).

(39)

APPENDIX A. APPENDIX 31

0 10 20 30 40 50 60 70 80 90

0.0 0.2 0.4 0.6 0.8 1.0 F raction of cases Empirical CDF 0 10 20 30 40 50 60 70 80 90

0.0 0.2 0.4 0.6 0.8 1.0 F raction of cases 99% Confidence Band

(40)

Behaviour of logits in adversarial examples: a hypothesis

Behaviour of logits in

adversarial examples: a

hypothesis

MARTIN SVEDIN

TROLLE GEUNA

Behaviour of logits in

adversarial examples: a

hypothesis

MARTIN SVEDIN

TROLLE GEUNA

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1

Problem definition

1.2

Scope

1.3

Outline

Chapter 2

Background

2.1

The Imagenet dataset

2.2

Deep neural networks

2.2.1

The GoogLeNet architecture

2.2.2

Probabilities & Logits

2.3

Hierarchy of concepts

2.4

Norms

2.5

Adversarial examples

2.5.1

Explanations

2.5.2

Defenses

2.5.3

Attacks

2.6

Other related work

Chapter 3

Method

3.1

Environment and model

3.2

Attack algorithm

3.3

Adversarial examples

3.4

Measurements

Chapter 4

Results

F

raction

of

cases

arctan(∆l

,

−∆l

)

F

raction

of

cases

Chapter 5

Discussion

5.1

Limitations

5.2

Conclusion

Bibliography

Appendix A

Appendix