Towards Multi-Scale Visual Explainability for Convolutional Neural Networks

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Towards Multi-Scale Visual

Explainability for Convolutional Neural Networks

YUE SONG

Date: June 22nd, 2020

Supervisor: Kevin Smith, Nicu Sebe Examiner: Danica Kragic Jansfelt

(2)

Sammanfattning

Förklarbarhetsmetoder försöker ta reda p˚a visuella förklaringar till beslut om neurala nätverk. Befintliga tekniker faller huvudsakligen i tv˚a kate- gorier: backpropagationsbaserade metoder och okklusionsbaserade metoder.

Den förra kategorin belyser selektivt de beräknade gradienterna, medan den senare sl˚ar in ing˚angen för att maximera förvirra klassificera och visualisera de distinkta regionerna. Motiverade av ocklusionsmetoderna föresl˚ar vi en förklarbarhetsmodell som enligt v˚ar kunskap är det första försöket att ex- trahera flerskaliga förklaringar genom att störa de mellanliggande represen- tationerna. Vidare presenterar vi tv˚a visualiseringstekniker som kan smälta multi -skala förklaringar till en enda bild och föresl˚a en utvärderingsmetrik för att bedöma förklaringens kvalitet. B˚ade kvalitativa och kvantitativa ex- perimentella resultat p˚a flera typer av datasätt visar effektiviteten hos v˚ar modell.

(3)

Abstract

Explainability methods seek to find out visual explanations for neural net-

work decisions. Existing techniques mainly fall into two categories: backpropagation- based methods and occlusion-based methods. The former category selec-

tively highlights the computed gradients, while the latter occludes the input to maximally confuse the classifier and visualize the distinct regions. Moti- vated by the occlusion methods, we propose an explainability model which to our knowledge is the first attempt to extract multi-scale explanations by perturbing the intermediate representations. Furthermore, we present two visualization techniques that can fuse the multi-scale explanations into a single image and suggest an general evaluation metric to assess the explanation’s quality. Both qualitative and quantitative experimental results on several kinds of datasets demonstrate the efficacy of our model.

(4)

Chapter 1 Introduction

Convolutional Neural Networks (CNNs) have exhibited superior performances on various vision tasks, including image classification [1–3], object detec- tion [4–6], and instance segmentation [6–8]. Due to their data-driven learning nature, CNNs are often considered as black boxes with weak interpretability.

Except for the final output, it is difficult for people to understand the logic behind CNN’s decisions. In some critical applications like autonomous driv- ing and medical diagnosis, the poor explainability of deep learning models may cause fatal consequences [9]. If the model is explainable, we can interpret the model’s decisions and identify pathological behaviours. Because of these reasons, the research on explainable deep learning models has recently received increasing attention [10–12].

Over the last years, there have been many attempts to improve CNN’s explainability by either visualizing the explanations for a model’s decision [13–

15] or explicitly building into the model some methods of explainability [12, 16]. However, explicitly building explainability models like Interpretable CNN [12] requires specific architecture and sacrifices the performance. An al- ternative approach involves visualization techniques that intend to explain a network’s decision. Although the CNN architecture resembles the human visual system, their learnt representations do not correlate with human- interpretable features [17, 18], which makes the direct visualization of CNN features unfeasible. Disentangling the distinctive CNN features into human- interpretable regions using visual explanations provides a more explanatory understanding of network representations. Given an explanatory mask where the salient regions are highlighted, people can quickly grasp where the network pays attention. It is also preferable to have explanatory maps in mul-

(7)

tiple scales, such that we understand how the CNN representations differ across scales. Most researches focus on identifying explanations in the image space, and relatively little work has been done on inferring multi-scale explanations.

Broadly, there are two approaches to extract visual explanations from CNNs. One method is to selectively visualize the back-propagated gradients and is often referred as backpropagation-based methods [19–21]. This kind of approach enjoys fast inference but often poses architectural constraints. The other occlusion-based methods aim at finding out the informative regions crucial to the network’s decisions by occluding part of the input image [14, 22]. Compared to the backpropagation methods, occlusion methods require extra trainable models to extract explanations but usually have better visual appeal.

Current state-of-the-art explainability methods mainly evaluate the explanations by either referring to localization accuracy [14], comparing with human-labelled segmentation masks [23] or measuring the impact of explanations to network’s decisions [24–26]. However, localization and segmentation are merely human interpretations. As pointed out in [17, 18], the CNNs do not have learnt the same representations as humans and may use co-existing contexts to make predictions. For instance, only the face instead of the body may aid the classification of a person. Hence, localization and segmentation is not likely to capture the attention of CNNs. Measuring the importance of the identified representations can better quantify the explanation’s quality. Nonetheless, existing feature importance metrics [24–26] do not take the variation in the size of explanations into account. Explanations of the larger size typically influence more on network’s predictions. The evaluation should account for the size of the explanations. Furthermore, current metrics only measure the impact of explanations in the image space, and all the evaluations are therefore conducted on the same basis. When it comes to explanations at multiple scales, there exist disproportionate impact of features across layers, i.e. the feature at certain layer contributes more to the decision than at other layers. To fairly evaluate the multi-scale explanations, the impact discrepancy of features between layers has to be eliminated.

(8)

1.1 Research Question

This thesis studies how the multi-scale explanations can be inferred from CNNs by occlusion and how to evaluate the multi-scale explanations fairly.

In detail, we are trying to answer the following research question:

How can we use an occlusion-based approach to explain contributions of individual layers in an arbitrary CNN towards the classification decision? Furthermore, is there a way to fairly measure the performance of visual explanation methods that accounts for variations in the size (support) of the explanation?

1.2 Contributions

This thesis makes several contributions to explainable deep learning. We summarize the main contributions as follow:

• We propose a model-agnostic method that occludes the intermediate representations to extract multi-scale explanations, in order to identify the pathological behaviors for CNN’s decisions. To our knowledge, it is the first attempt to infer multi-scale explanations by perturbing the intermediate feature maps.

• We propose a dynamic hyper-parameter to control the regularization strength. The dynamic parameter is automatically tuned depending on the image and makes our model free of any hand settings.

• We propose a general evaluation metric that can fairly measure the impact of the explanation independent of its layer and size.

• To help people quickly grasp where the attention of network lies, we propose two methods for visualization that can fuse the multi-scale explanations into a single image.

1.3 Sustainability and Ethics

Our project intends to help people understand how neural network decisions are progressively made. It could also potentially promote the application

(9)

of neural networks, in particular for some critical tasks that need visual explanations. The project complies with the Sustainable Development Goals (SDG) of the United Nations. More specifically, our work aims to achieve the following SDGs:

• SDG 9.2: Promote inclusive and sustainable industrialization.

• SDG 9.5: Enhance scientific research, upgrade the technological capa- bilities of industrial sectors.

As discussed above, this work can entail a positive effect on our society.

Explainability models will help improve transparency and fairness in auto- mated systems, which can hopefully reduce undesirable biases (e.g. racial or gender bias). However, all technological innovations can be used for unethical purposes. It is our responsibility to use them ethically. We do not figure out any specific unethical example of the explainability methods.

1.4 Outline

The rest of the thesis is organized as follow:

• Chapter 2 introduces interpretable deep learning and previous work on visual explainability for CNNs.

• Chapter 3 explains and justifies the methods we use.

• Chapter 4 presents both qualitative and quantitative results, as well as comparison with other baselines.

• Chapter 5 concludes the thesis and highlights some key findings.

(10)

Chapter 2 Background

This Chapter provides a brief overview of interpretable deep learning along with a review of the work undertaken in visual explainability for CNNs so far. A brief description of all major methods are provided here, and Chapter 3 will further expand on the techniques used in this thesis.

2.1 Interpretable Deep Learning

Although deep learning has gained the ability to learn rich representations, it suffers from poor interpretability. Except for the network’s output, it is difficult for people to understand the logic behind CNN’s decisions. In recent years, researchers have tried to develop methods to interpret the model’s decisions and identify pathological behaviours. We can roughly divide the work into four kinds of approaches.

• Visual explanations [10, 13–15, 22, 27–29]. These techniques seek to identify regions of each individual input which are most semantically relevant to the classification. In order to extract interpretable explanations, these methods either visualize the distinctive regions of the input [13, 14,30] or reconstruct the image given the intermediate representations [10,15]. We give a detailed introduction of this method in Sec 2.2.

• Finding concept-based global explanations [31–33]. A recent line of research has focused on providing explanations in the form of high-level

(11)

human concepts. Zhou et al. [31] decomposes the prediction of one image into human-interpretable conceptual components pre-trained from a large concept corpus. Their proposed framework can disentangle the evidence encoded in the prediction and quantify the contribution of each piece of evidence. In [32], the authors introduce the concept activation vector and use it to estimate the relative importance of a concept for the classifier prediction. However, the two methods mentioned above require hand-labeled examples of a concept and may introduce human biases. More recently, Ghorbani et al. [34] cluster similar seg- ments of the input images to automatically extract visual concepts.

These methods aid in the understanding of higher-level and human- friendly concepts. In contrast, the backpropagation and occlusion explainability methods [14,28,30] focus on identifying features important for each individual input.

• Explicitly building method of explainability into the model [12,16]. In- foGAN [16], an extension of Generative Adversarial Network (GAN) [35], maximizes the mutual information between certain dimensions of the latent representation and image observation. The latent code is learnt to control some semantics in an unsupervised manner. Zhang et al. [12]

developed interpretable-CNNs, which add regularization loss in deep layers to obtain disentangled representations. The loss is used to regularize the feature map towards the representation of a specific object.

However, these methods often require specific architecture or objective function, which sacrifices model performances.

• Disentangling CNN representations into graphical models [36,37]. These studies mainly disentangle complex representations and transform them into interpretable graphical models. Explanatory graph [36] considers the intermediate representations as the mixture of patterns and use a graph to explain the knowledge hierarchies. Zhang et al. [37] proposed a decision tree to encode the decision modes in the fully connected layer and quantitatively explain the logic of CNN decisions. Given an input image, the decision tree can tell which filter is used for CNN prediction and how much the filter contributes to the decision. Nevertheless, these methods add large computational burdens and do not provide any intuitive understanding.

(12)

2.2 Visual Interpretability for CNNs

As the most direct approach for interpretable deep learning, visualization techniques fall into three categories: backpropagation-based methods, feature inverting methods, and occlusion-based methods. In the following para- graphs, we elaborate on each type and end this Chapter by discussing the explanation’s reliability and quantification.

2.2.1 Backpropagation-based Methods

In general, the gradient-based methods intend to calculate the model’s gradients and selectively visualize them. In [13], the authors approximate the CNNs with linear models using the first-order Taylor expansion:

y ≈ w · x + b (2.1)

where x represents the input image, w and b denote the weight and bias of the model. During the back-propagation phase, the gradient of each pixel can be considered as the weight for final decision:

w ≈ ∂y

∂x (2.2)

Afterwards, the pixels which correspond to the maximum gradients are highlighted. As seen in Figure 2.1, their results are basically a collection of isolated points and sometimes too noisy to interpret. Zeiler et al. [27] attach

Figure 2.1: Some visual examples taken from [13]. The explanations consist of isolated points and sometimes rather noisy.

the DeConvnet [30], a network that consists of multiple transposed convolution layers, to the neural network and extract the visual explanations. The transposed convolution layer reverses the normal order of a convolution layer

(13)

Figure 2.2: Visual example of three backpropagation methods using VGG- 16 [2] model. From left to right: input image, mask generated by vanilla gradient [13], map produced by DeConvNet [27], and map generated by guided- backpropagation [28]. Because DeConvNet approach [27] uses ’switches to record the postion of maximal during the max-pooling operation, the generated mask is conditioned on the image and is likely to be noisy for some images.

and maps the activation back to the input space. In order to perform the reconstruction through the uninvertable max-pooling layers, their method requires first to perform a forward pass of the network to compute ’switches’

- positions of maximal within each pooling region. These switches are then used in the max-unpooling layer to obtain a discriminative reconstruction.

During the back-propagation phase, the definition of ReLU function is modified as:

ReLU^∗(x, p) =

(p, if p > 0,

0, Otherwise (2.3)

where x represents the forward activation, and p denotes the gradient at the activation location. Springenberg et al. [28] proposed the guided- backpropagation to address the aforementioned issue. They replace all the pooling layers by using convolution with stride larger than one. This architecture modification gradually reduces the dimension of the feature maps, without increasing a large number of parameters. Furthermore, the removal of the max-pooling layer greatly reduces the non-linearities of the model.

Their method thus provides more interpretable visualization than previous approaches. They also modify the backward ReLU function to compute the imputed gradients:

ReLU^∗(x, p) =

(p, if p > 0 and x > 0,

0, Otherwise (2.4)

(14)

where x represents the forward activation, and p denotes the gradient at the activation location. Figure 2.2 displays one visual example of the three methods. Due to the use of ’switches’, the DeConvnet [27] is conditioned on the image and does not directly visualize the learned features. The explanations generated by DeConvnet [27] are likely to be noisy depending on the image. In addition to highlighting the crucial regions to the network’s decision, these three backpropagation methods [13, 27, 28] can be used for feature reconstructions. Their usage will be introduced in Sec 2.2.2.

Figure 2.3: Illustration of Class Activation Mapping (CAM) [23]. The predicted class score is mapped back to the previous convolutional layer to generate the importance maps. The image is taken from [23].

Zhou et al. [23] propose the Class Activation Mapping (CAM) to highlight the class-specific image regions. As seen in Figure 2.3, they first perform global average pooling (GAP) on the last convolutional feature map and pass the pooled features to the fully connected layer. Then the weights of the output layer are projected back to the convolution layers and produce the class activation map. Based on CAM [23], Grad-CAM [19] combines the forward activations with the gradients flowing back to generate the multi- scale visual explanations. In [21], the authors use a stochastic sampling process to integrate the forward activations and back-propagated gradients efficiently. Recently, Rebuffi et al. [20] conduct a thorough analysis of the gradient-based explainable methods and propose a unified framework under which several such methods can be combined.

Backpropagation explainability methods enjoy the benefit of efficient inference and do not require extra trainable models. However, as discussed

(15)

Figure 2.4: Reconstructed features that have the most contributions to network’s prediction using different feature inverting methods [13, 27, 28]. We use Grad-CAM [19] to indicate the predicted class for the corresponding explanations. As the modified ReLU in [28] depends on both forward activations and backward gradients, their reconstructions have more clear patterns.

in [20], backpropagation-based methods pose architectural constraints and need dedicated designed rules to combine backward gradients and forward activations for each network component. The designed rules can work with most modern CNN architectures (e.g. VGG [2] and AlexNet [1]). But with ResNet [3], the widely used deep models nowadays, the generated explanatory masks may suffer from ”checkerboard” artifacts. It implies that the backpropagation-based methods are model-specific and may rely on the classifier choice. We will explain this problem using some visual examples in Sec 4.5.2.

2.2.2 Feature Inverting Methods

The feature inverting methods either try to reconstruct the features that have the most contributions to the network’s decision [13, 27,28] or aim at recon- structing the image given the learned representations [10, 15]. The former category [13, 27,28] are mostly used to visualize the important regions, but their initial purpose is to reconstruct the crucial features. Figure 2.4 displays two examples of reconstructed features that contribute most to the class predictions (tiger cat and boxer). As we can see, guided backpropagation [28]

provides clearer patterns than the other two methods [13, 27]. The failure mode is mainly because [13] does not constrain the backward gradients, and

(16)

[27] conditions the feature reconstructions on the image due to the use of

’switches’.

Given the representations, the latter category [10, 15] aims to directly reconstruct the original image out of the features. Mahendran et al. [15] use the image representation and natural image priors to compute an approximate of the inverse representation. Given the feature Φ0 and the model f (·), their method tries to find an image x^∗:

x^∗ = arg min l(f (x^∗), Φ₀) + λ · R(x^∗) (2.5) where the loss l(·) compares the image representation f (x^∗) to the target Φ₀, and the R(·) is the regularizer to capture natural image prior. Instead of optimizing the representation vector, another work [10] directly targets the image reconstruction error. Given a training set of the images and their features {x_i, Φ_i}, they train a DeConvnet [30] f (·) to minimize the reconstruction loss:

ˆ

w = arg min

w

X

i

(x_i− f (Φ_i, w))² (2.6) The DeConvnet [30] f (·) can therefore directly output images from the features. In addition to inverting neural network representations, both methods [10,15] can also invert traditional hand-crafted features such as SIFT [38]

and HOG [39, 40]. Since the objective of feature-inverting methods is to minimize the reconstruction or representation error, both methods generate realistic reconstructions similar to the original image (see Figure 2.5). Nev- ertheless, the reconstructions do not imply the contribution of the learnt features. This kind of approaches is seldom used in practice.

Figure 2.5: Reconstructions from layers of AlexNet [1] with [10] (top) and [15] (bottom). The image is taken from [10].

(17)

2.2.3 Occlusion-based Methods

The occlusion-based approach to explainability operates on the assumption that pertrubing or denying salient information will degrade the performance of the classifier. Fong et al. [14] propose to apply perturbations m on the image x such that the neural network f (x, m) cannot conduct correct classification:

min

m∈[0,1]λ₁· ||m||₁ + λ₂X

u

||Omu||^β_β + f (x, m) (2.7) where the first term controls the mask size, the second term computes the gradients of the masks, and the last term measures the performance drop of the classifier. They aim to identify the smallest regions that are most responsible for classifier decisions. Furthermore, they assess the impact of various kind of perturbations, including Gaussian blur, setting to constant, and adding random noise. The perturbations operations are defined as:

P erturb(x, m) =







R g(m) · xvdv, Gaussian Blur m · x + (1 − m) · µ, Setting to Constant m · x + (1 − m) · η(m), Adding N oise

(2.8)

where g(·) is an isotropic Gaussian kernel, µ is the average color intensity of the image, and η(·) defines the Gaussian noise sample. Their experimental results prove that Gaussian blur occlusion with large variance hurt the classification accuracy most. The algorithm is an iterative optimization process and does not require any trainable models. However, the optimization is time-consuming and requires large number of iterations to achieve the objective minimum.

In [22], the authors think that only subset of the input pixels is sufficient for classification. Based on an encoder-decoder framework, they aim to extract the explanatory mask that satisfies the following two properties:

Smallest sufficient region (SSR): The decoder extracts the smallest explanations to ensure that the perturbed image where the identified region is kept is sufficient for correct classification.

Smallest destroying region (SDR): The decoder extracts the smallest explanations to ensure that the encoder cannot recognize the object in the perturbed image where the identified region is deleted.

The two properties are directly embraced in their objective. Besides, they smooth the explanatory maps by adding a mask variation loss to limit its

(18)

total variation. Similar with [14], they also use Gaussian blur as the perturbation. As can be observed in Figure 2.6, their method achieves real-time explanation inferences and can extract sharp and easily interpretable explanations. Nevertheless, their method has too many hyper-parameters, which are often non-trivial to tune and have substantial impact on the explanation’s quality and size. Moreover, the mask variation loss can smooth the map to some extent but suffers from poor explanability.

Figure 2.6: Visual comparison of different methods taken from [22]. From left to right: input image, generated explanatory maps using [22], explanatory maps produced by vanilla gradient [13] and explanatory masks obtained using [14].

More recently, Zolna et al. [29] argue that distinct classifiers use different parts of the image for decision. Extracting explanatory maps from a certain classifier may not contain all the classification clues. To obtain more generic masks, they maintain a pool of classifiers. In each iteration, a classifier is randomly sampled and used to distinguish whether the object is recognized in the occluded image. More specifically, they use adversarial training to extract the explanatory map:

L = −lCE(y, f (x (1 − m))) + λ · R(m) (2.9) Here, f (·) represents the randomly sampled classifier, denotes the element- wise product and R(·) is the regularization term. The property SSR in [22]

(19)

is excluded, as the informative regions for different classifiers are not consis- tent. Unlike aforementioned occlusion methods [14,22], the explanations are removed by multiplication with the mask instead of Gaussian blur. Their regularization term R(m) is calculated as:

R(m) = ReLU (M ean(y + 1 − y_out) − M ean(m)) (2.10) where y_out is the class prediction of the masked image x(1−m). Again, the model also relies on the hyper-parameter choices. The regularization strength needs to be carefully tuned for every dataset and classifier. Furthermore, the regularization term compares the mask value with the class prediction, which does not make much sense but rather is a hand-engineered setting. It may not work when applying to other kinds of datasets (e.g. multi-class datasets).

Different from the gradient methods, backpropagation methods need trainable models to extract explanations, which bring about extra issues of choosing network architecture and tuning hyper-parameters. The hyper-parameters need to be carefully tuned for every dataset and often have a huge impact on the quality and size of the explanations. On the other hand, occlusion methods are model-agnostic and flexible to any classifier. Moreover, the extra trainable model generates explanations by occluding part of the image, which is often a patch of the connected region. Compared to the explanations consisting of isolated points generated by backpropagation methods, their explanations often have better visual appeal.

2.2.4 Reliability and Quantification

Qualitatively, explainability methods often extract the visual explanations which seem semantically relevant to the classification. However, several studies reveal that some explainable methods are sensitive to factors irrelevant to the model’s predictions, and the visual assessment can be misleading [34, 41, 42]. Kindermans et al. [41] show that simply applying a constant shift to the input may cause some methods to have significant changes in the explanations. In [34], the authors also show that explainable methods can be easily confused by input perturbations. [42] introduces the sanity check, a metric that measures how sensitive an explainable method is to the model weights. Under the sanity check, explainability methods that produce similar outputs independent of the classifier (trained or not) are considered as reliable models.

(20)

To quantify the performance, some methods rely on localization performance [14, 20, 22]. Some methods also use Pointing Game [23] to compare the explanations with the pixel-wise segmentation mask. However, as we have discussed in Sec 1.1, localization or segmentation is merely a proxy for human explanation and may not correctly capture what contributes to the network’s prediction [17, 18]. More reasonable metrics should focus on measuring the impact of each learned representation on the network’s decisions. Kapishnikov et al. [26] measure the Accuracy Information Curve by blurring the image and gradually adding back the explanations. In [24], the authors delete and preserve the explanations to calculate their importance to the network’s decisions. [25] estimates the feature importance by re-training the classifier on the perturbed images. However, these metrics fail to handle explanations of different sizes. Typically, larger explanations often have a higher impact on the classifier’s decisions. The evaluation should account for the size of the explanations. Moreover, these metrics evaluate the explanations where the perturbations are solely in the image space. As for multi-scale explanations where perturbations are on different layers, these metrics can not eliminate the disproportionate impact of features across layers. Our evaluation metric addresses the two issues above and fairly measures the importance of the identified features across layers.

(21)

Chapter 3 Methods

Occlusion-based methods are a popular approach for explainable deep learning. They can work with any classifier and usually generate explanations of good visual appeal. Thus far, occlusion-based methods have concentrated on modifying the input image. This raises the question of whether occlusion- based explainability can be applied to intermediate representations in the network. In this Chapter, we describe our proposed method and the evaluation metric in detail. In addition, we suggest two visualization techniques that fuse the multi-scale explanatory maps into a single image.

3.1 Explainability Model

3.1.1 Multi-scale Explanations Extraction

As shown in the top of Figure 3.1, our framework consists of a backbone classifier and a decoder network. The decoder network is used to extract explanations and acts as a plug-in for any classifier. In the diagram in Fig- ure 3.1, the classifier has five blocks that generate feature maps at five different scales {F_i|i = 1, . . . , 5}, where the representations are downsampled by a factor of two between subsequent scales. During the training, the weights of the classifier are frozen, and we only update the weights of the decoders.

At each scale, the decoder takes in the output representation F_i from the classifier and produce the explanatory mask S_i:

S_i = De_i(F_i), i = 1, . . . , 5 (3.1)

(22)

Figure 3.1: The overview of our multi-scale explanatory extraction framework. Top: For each image, the classifier propagates the input signal through multiple convolution blocks at five different scales (shown in red). We feed the output representation of each scale to the decoder network (gold) and generate an explanatory mask. Applying the mask to the corresponding features will confuse the classifier. Bottom: We randomly select one mask from 5 scales and use it to perturb the feature map at the same scale. If the mask works well, the classifier cannot extract any classification evidence from the occluded representations. During the training, the weights of the classifier are frozen, and we only update the weights of the decoders.

(23)

Figure 3.2: The diagram of our decoder. The feature map (shown in green) first goes through a residual block [3] to refine the representations and squeeze the channels. Subsequently, the 1 × 1 convolution and a logistic function are used to produce the binary explanatory mask. The 1×1 convolutional weights are shared among all the decoders.

where De_i(·) denotes the decoder at scale i. The detailed architecture is shown in Figure 3.2. We first use the residual block [3] to refine the representation and squeeze it to a fixed channel. Then a 1 × 1 convolution is applied to produce the explanatory map. The weights of the 1 × 1 convolution layer are shared among all the decoders, as that convolution is simply a linear mapping from the refined feature map to the explanatory mask, and weight sharing can reduce the number of parameters. A logistic function is further applied to approximate a binary mask. Previous explainable methods [14, 21, 22] enforce soft explanatory masks, i.e. each value in the mask is defined as a continuous variable between 0 and 1. However, the soft mask does not completely remove the informative features. After soft masking, the non-zero occluded features can still contribute to the classifier’s prediction.

Furthermore, the soft mask usually comes along with the high-frequency noise around the object, which severely deteriorates the visual appeal of the explanations. We argue that the hard mask (0 or 1) can eliminate the noise and provide more clear and interpretable patterns. An ideal binary function has only exploding gradient at the threshold and results in numerically un- stable training. Instead, we adopt a steep logistic function to approximate the binary function, in order to obtain a binary mask (see Figure 3.3).

(24)

Figure 3.3: A logistic function is used to approximate the binary function to produce binary maps. The steepness parameter is manually set as 5.

3.1.2 Feature Importance Identification

The bottom of Figure 3 illustrates the process to identify the feature importance. After the maps are produced, we randomly select one map S_i and use it to mask the feature map Fi at the same scale. Afterwards, the perturbed representation is fed into the classifier and generate its class prediction y^out_i as:

y^out_i = f (Fi (1 − Si)), i ∼ U nif orm(1, . . . , 5) (3.2) where f (·) is the classifier and represents the element-wise multiplication.

By evaluating whether a correct classification can be performed on the perturbed representations, the classifier enforces the decoders to identify the informative regions crucial to the network’s decisions. The loss function can be defined as:

L = −l_CE(y_i^out, y) + λ · r(a_i) (3.3) where lCE(·) represents the cross-entropy loss, y denotes the ground truth class label, a_i defines the activation ratio (i.e. the percentage of activated pixels) of the map S_i, and r(·) denotes the non-linear regularization function.

The first term encourages the decoders to remove the classification clues from the feature map and confuse the classifier. The second non-linear regularization term prevents the explanatory mask from converging to a trivial solution of being all high responses.

One crucial principle in occlusion-based methods is that we should use the explanations of minimum size to confuse the classifier maximally. However,

(25)

Figure 3.4: The classification accuracy of VGG [2] on Pascal VOC dataset [43]

under different amount of perturbations on the feature map at scale 3. We adopt both random perturbation that random sets the features to zero and meaningful perturbation which intends to remove the salient features.

any feature inclusion could yield the drop of the classification accuracy, and there is no criterion to choose an appropriate size for the explanations. In practice, the problem is a trade-off between the size of explanations and the decrease of the classification accuracy. If the regularization strength is large, the explanations will only consider the most important features. However, the classification accuracy will be slightly affected. On the other hand, if the regularization is small, the classification accuracy will be heavily influenced, but the explanations will include some useless regions.

One possible approach to address the trade-off is to balance the two terms in Eq 3.3: the regularization strength is tuned for the classification loss.

In this way, the identified features are likely to be most relevant to the classification, and explanations of the minimum size can confuse the classifier maximally. Previous occlusion methods [14, 22] often use l₁ regularization loss, which results in the linear penalization for all possible activation ratios.

However, there is no evidence showing that the classification performance linearly drops as the activation ratio increases. To evaluate how classification performances change versus the amount of perturbations on the features, we design an experiment that applies a different amount of perturbations on the representations, and we can observe how the classifier performances vary.

As seen in Figure 3.4, a small amount of perturbations do not hurt the classifier predictions much, and a large amount of perturbations can cause the classification performance to drop dramatically. Hence, we propose a

(26)

non-linear regularization term r(a_i) defined as:

r(a_i) = log1 + ai

1 − a_i (3.4)

Figure 3.5 plots the curve of our regularization loss as a function of activation ratio. Our goal with this non-linear regularization is to incur little cost for small initial masks, with a near-linear increase in the middle regime, followed by increasingly steeper costs as the mask grows larger.

Figure 3.5: Non-linear regularization loss versus the activation ratio of the explanatory map. The positive part of this curve is defined by a logit function.

We incur little cost for small masks and apply linear strength in the middle regime. As the mask grows larger, the loss becomes increasingly steeper.

3.1.3 Dynamic Hyper-parameter

The occlusion-based methods often explicitly or implicitly set the regularization hyper-parameter as a fixed constant [22,29,44]. However, this setting is sub-optimal, since different images differ a lot in the object size. Also, human bias is introduced in the process. A constant regularization hyper-parameter would enforce the model to produce masks of almost the same size for every image. Choe et. al. [17] pointed out that the empirical settings utilize the validation set and thus allow for implicit supervision. To avoid this issue, we

(27)

Figure 3.6: Some visual examples of explanations of different sizes. Our model allows for explanations of different sizes depending on the objects.

The images are taken from Pascal VOC [43] dataset, and we use VGG [2] as the backbone classifier.

propose a dynamic hyper-parameter as λ = 1

C

Xy · (ˆy − y_i^out) (3.5) where ˆy represents the class prediction of the input image, and C denotes the number of classes the image contains. The term ˆy − y^out_i calculates the performance drop between the clean and perturbed image for every possible class that the dataset has. However, the image is not likely to contain all the classes. We include the ground truth label y to select the correct classes that the image contains and measure the actual performance drop.

The hyper-parameter λ measures the normalized average performance drop between the class prediction of the image and the class prediction of the occluded features. When the decoder outputs good explanations, the parameter λ regularize more to shrink the generated map and removes the useless regions. On the other hand, when inadequate explanations are generated, we have a relatively small λ, which will encourage the decoder network to collect more classification evidence and increase the size of the mask. In Figure 3.7, we illustrate how the dynamic hyper-parameter works in detail. Note that the backward gradient is not calculated for λ during the back-propagation phase. Otherwise, it will push y_i^out to increase and may harm the decoder performance. One major advantage of the dynamic hyper-parameter is that we do not rely on the validation set to empirically choose an appropriate regularization value. During the training, the parameter is automatically tuned depending on the image and the classifier’s decision. Accordingly, the decoders can allow for maps of different sizes (see Figure 3.6). As our parameter is normalized by the number of classes that the image contains, it can work for both the multi-class and single-class image classification datasets.

(28)

Figure 3.7: Illustration of how the dynamic hyper-parameter works. Suppose we have an image that contains two classes horse and dog, and the label is defined as y = [1, 1, 0]. For simplicity, we assume the image is taken from a dataset that has only three classes. The classifier predicts the class of the clean image: ˆy = [0.9, 0.8, 0.2]. In the first iteration, the network generates a large explanation, and the perturbation removes all the salient features.

The classifier cannot correctly classify the perturbed image as there nearly exists no evidence. The parameter λ is therefore kept large, which shrinks the map and tries to remove the irrelevant region. The explanation gets smaller in the second iteration but also omits some classification clues (the head and feet of the horse is not included). Thus, the classifier outputs the class prediction of higher confidence. Our regularization strength λ is kept small and encourages the network to expand the explanation. After several rounds of iterations, the size of explanations is appropriately tuned, and each identified feature is deemed important to the network’s decision.

(29)

3.2 Multi-scale Explanatory Map Fusion

Figure 3.8: Illustration of how multi-scale explanatory maps are fused. Left:

how the multi-scale explanations are fused into the RGB image. Each channel of the fused RGB image represents the explanations of a certain level. Right:

how the colormap is fused from the explanations at multiple resolutions. The color intensity in the fused image indicates the number of scales that deem the feature important.

Although some backpropagation-based methods can successfully visualize the intermediate layers [20, 21], no attempts have been made to fuse the multi-scale explanations. We propose two visualization techniques that can combine the explanations into a single image, in order to help people quickly grasp where the network pays attention throughout the layers. As shown in the right of Figure 3.8, the first technique aims at visualizing the importance of the feature throughout the network. To this end, we propose to count how many scales contribute to the explanations and use a colormap for visualization based on the counted number of scales. The fusion process is defined as:

S^color_{f used}= S₁+ S₂+ S₃+ S₄+ S₅ (3.6) The second fusion approach is illustrated in the left of Figure 3.8. The explanatory maps of the first two scales are combined into the low-level mask by choosing the maximal value at each pixel location. The mid-level mask is generated similarly using maps of scale 3 and scale 4. As for the high- level explanations, we directly take the map at scale 5. The fusion process is

(30)

denoted as:

S_r = max(S₁, S₂) S_g = max(S₃, S₄)

S_b = S₅

(3.7)

Afterwards, we fuse the explanations of three levels into one single RGB image by channel-wise concatenation:

S^rgb_{f used} = concat(S_r, S_g, S_b) (3.8) The first visualization method indicates the importance of the feature by counting how many scales view the feature crucial to the network’s decisions.

It creates a colormap which shows the feature importance and is easily interpretable. However, we lack the knowledge of which scale pays attention and how the attention shifts across layers. The second visualization approach creates an RGB image where each channel represents the explanations of a certain scale. The contribution of each level can be clearly observed in the fused image. Also, we can see how the focus of the network manifests across layers and how the decision is progressively made. Nonetheless, the fused RGB image is not that intuitive and interpretable. People may spend some time figuring out which color stands for what kind of explanations (e.g. low level, middle level, or their combination).

3.3 Evaluation Metric

As discussed in Sec 2.2.4, we would like to measure the importance of the identified representations to the network’s decisions. There have been some metrics proposed to evaluate the feature importance [24–26]. Nevertheless, none of these metrics can be applied to our method, as they are designed to measure the impact of the perturbations in the image space. When perturbing the images, the occlusions are always on the image, and the evaluations are conducted on the same basis. However, when perturbing the intermediate representations, the occlusions are on different scales and may introduce a disproportionate impact of features across layers. Besides, these metrics do not take the size of explanations into consideration. Larger explanations typically influence more the network’s decisions. Thus, the previous metrics cannot be adopted.

To address the issues above, we propose a general evaluation metric which can fairly measure the impact of the explanations regardless of their layers

(31)

Figure 3.9: The evaluation procedure of our metric. We first shuffle explanatory to generate a random mask. Then the classification accuracy of the image perturbed by both the explanatory and random masks is calculated respectively. Finally, the performance gap between the classification accuracy is used as our evaluation metric.

and sizes. Our evaluation metric computes the classification performance gap between a generated mask and its shuffled version. Given the CNN model f (·), the feature maps F_i at scale i, and the corresponding mask S_i, we calculate the class prediction on the occluded representations as:

y_i^out= f (F_i (1 − S_i)) (3.9) where denotes element-wise multiplication. Since larger explanations have higher effect and occluding features in different layers has a disproportionate impact on the classification accuracy, we have to consider the size of the explanations and eliminate the feature discrepancy across layers. As seen in Figure 3.9, we shuffle the explanatory mask and generate its random mask R_i:

R_i = Shuf f le(S_i) (3.10)

Then the impact of a random mask to network’s decisions is computed as:

y^rand_i = f (F_i (1 − R_i)) (3.11) Then the evaluation metric is defined by the performance gap as:

P erf ormance Gap = y^rand_i − y^out_i (3.12) Our quantitative metric eliminates the disproportionate impact of features across layers, accounts for the variation in the size of explanations, and con- ducts an unbiased measure on the impact of network attention.

(32)

Another evaluation approach is to inspect the visual appeal of the explanations qualitatively. Due to the attention difference between the humans and the CNNs [17,18], the generated map does not necessarily need to cover the whole object, but it should highlight the most informative region. We can visually observe whether the explanations fall on any part of the object.

In this Chapter, we have introduced our proposed model and evaluation metrics. In the next Chapter, we will validate our method on different datasets and compare with other baseline models

(33)

Chapter 4 Experiments and Results

In this Chapter, we explain the setup of our experiments. Results are pre- sented and compared with other baselines along with analysis.

4.1 Datasets

We conduct experiments on several benchmark datasets.

• Multi-class Classification Datasets: Pascal VOC [43] and MS- COCO [45]. In these datasets, one image may contain multiple classes, which requires finding explanations for different classes in a mask. This poses a significant challenge for explainable methods.

• Fine-grained Classification Datasets: Oxford Flowers (OF) [46]

and Caltech University Birds (CUB) [47]. These two datasets provide images belonging to multiple subordinate categories of a super-category (flower or bird).

• Natural Adversarial Image Dataset: ImageNet-A [48]. The dataset consists of natural adversarial images which have complex scenes that can fool deep models. On that dataset, classifiers pre-trained on Ima- geNet [49] maintain classification accuracy below 10%.

Table 4.1 summarizes the number of classes and the size of these datasets.

For the datasets OF [46] and ImageNet-A [48], we use the first 90% images as the training set and the last 10% as the validation set. The neural networks pre-trained on ImageNet [49] are fine-tuned and evaluated on the same

(34)

Dataset Number of Classes Training Set Size Validation Set Size

Pascal VOC 2012 [43] 20 5,847 5,847

MS-COCO 2017 [45] 80 118,287 5,000

CUB [47] 200 5,774 5,994

OF [46] 120 7,372 819

ImageNet-A [48] 200 6,750 750

Table 4.1: Publicly available datasets used in our experiments. In the OF [46]

and ImageNet-A [48] dataset, we split the 90 % of the image into the training set and the rest into the validation set.

dataset. The output dimension of the last fully connected layer is changed to fit the number of classes, and we fine-tune the parameters of the whole network. For multi-label classification datasets [43,45] where one image can have more than one class, we use sigmoid activation as the last output function, while softmax is adopted for other datasets. The classification accuracy of the models after fine-tuning is displayed in Table 4.2.

Accuracy (%) VGG-16 [2] ResNet-50 [3]

Pascal VOC 2012 [43] 85.19 88.89

MS-COCO 2017 [45] 76.18 76.55

CUB [47] 83.47 84.87

OF [46] 99.39 99.40

ImageNet-A [48] 98.28 99.18

Table 4.2: Classification accuracy of the models after fine-tuning in the same dataset.

4.2 Baseline Models

We compare our model with other explainable models that can create multi- scale visualizations within the network, the same as ours. Note that all these approaches belong to the backpropagation-based explainability methods.

• Gradient-weighted Class Activation Map (Grad-CAM) [19]

calculates the backward gradients for each class with respect to the feature maps. These gradients flowing back are pooled over the width and height of the feature map to obtain the pixel importance weights.

(35)

A weighted combination of the forward activations is further performed to generate the explanations.

• Excitation BackProp (EBP) [21] integrates backward gradients and forward activations to compute the probability of each pixel. Then the pixels are selectively highlighted to produce the explanatory maps.

• Norm-Grad [20] first computes the spatial contributions of the gradient of convolutional weights, and subsequently transform these contributions into explanatory maps using an aggregation function.

4.3 Implementation Details

All the experiments were implemented using Pytorch. VGG-16 [2] and ResNet-50 [3] pre-trained on ImageNet [49] were fine-tuned on the specific dataset and used as our backbone classifiers. Throughout all the experiments of fine-tuning, we used Adam [50] optimizer with learning rate 0.00005, set the batch size as 32, and trained for 15 epochs. The input images are normalized and reshaped to the spatial size of 224 × 224.

For training the decoders, the batch size was set to 4, and the Adam [50]

optimizer was set with learning rate 0.00001, β₁=0.9 and β₂=0.999. During the training, the weights of the classifier are frozen, and we do not do any modification. Before feeding into the classifier, all the images were normalized and resized to the spatial size of 224 × 224. Accordingly, as displayed in Table 4.3, the spatial dimension of the feature maps between subsequent scales are downsampled by a factor of 2. Notice that different networks use different modules to reduce the feature dimension. VGG [2] applies the 2 × 2 max-pooling operation, while ResNet [3] use convolution with stride 2 between convolution layers.

Image Scale 1 Scale 2 Scale 3 Scale 4 Scale5 224×224 112×112 56×56 28×28 14×14 7×7

Table 4.3: The spatial dimension of the image and the feature map at each scale. Between subsequent scales, the spatial dimension are downsampled by a factor of 2.

Towards Multi-Scale Visual Explainability for Convolutional Neural Networks

Towards Multi-Scale Visual

Explainability for Convolutional Neural Networks

Contents

Chapter 1 Introduction

Chapter 2 Background

Chapter 3 Methods

Chapter 4

Experiments and Results