Notes for<Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps>
This paper addresses the visualisation of image classification models. We consider two visualisation techniques, based on computing the gradient of the class score with respect to the input image. The first one generates an image, which maximises the class score, thus visualising the notion of the class, captured by a ConvNet. The second technique computes a class saliency map, specific to a given image and class.
Class Model Visualisation
Given a learnt classification ConvNet and a class of interest, the visualisation method consists in numerically generating an image.
$$argmax_{I}S_c(I) - \lambda ||I||_{2}^{2}$$
y, let $S_c(I)$ be the score of the class c, computed by the classification layer of the ConvNet for an image I. We would like to find an L2-regularised image, such that the score $S_c$ is high.
A locally-optimal I can be found by the back-propagation method. The procedure is related to the ConvNet training procedure, where the back-propagation is used to optimise the layer weights. The difference is that in our case the optimisation is performed with respect to the input image, while the weights are fixed to those found during the training stage. We initialised the optimisation with the zero image (in our case, the ConvNet was trained on the zero-centred image data), and then added the training set mean image to the result.
Use (unnormalised) class scores Sc, rather than the class posteriors. The reason is that the maximisation of the class
posterior can be achieved by minimising the scores of other classes. Therefore, we optimise Sc to ensure that the optimisation concentrates only on the class in question c.
Image-Specific Class Saliency Visualisation
A classification ConvNet can be queried about the spatial support of a particular class in a given image.
Intuition: magnitude of the derivative indicates which pixels need to be changed the least to affect the class score the most. One can expect that such pixels correspond to the object location in the image.
- Class Saliency Extraction (computer the gradient with respect to the image)
- Weakly Supervised Object Localisation
Given an image and the corresponding class saliency map, we compute the object segmentation mask using the GraphCut colour segmentation. The use of the colour segmentation is motivated by the fact that the saliency map might capture only the most discriminative part of an object, so saliency thresholding might not be able to highlight the whole object. Therefore, it is important to be able to propagate the thresholded map to other parts of the object, which we aim to achieve here using the colour continuity cues. Foreground and background colour models were set to be the Gaussian Mixture Models. The foreground model was estimated from the pixels with the saliency higher than a threshold, set to the 95% quantile of the saliency distribution in the image; the background model was estimated from the pixels with the saliency smaller than the 30% quantile
Relation to Deconvolutional Networks
DeconvNet-based reconstruction of the n-th layer input $X_n$ is either equivalent or similar to computing the gradient of the visualised neuron activity f with respect to Xn, so DeconvNet effectively corresponds to the gradient back-propagation through a ConvNet.