Notes for <Visualizing and Understanding Convolutional Networks>

文章目录

1. Pipeline
2. Some findings
3. Experiments

A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite.
Original Paper
One implementation
One blog

Pipeline

An input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively
(i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation.This is then repeated until input pixel space is reached.

Unpooling: save the switches in the pooling and in unpooling, using switches which record the
location of the local max in each pooling region and set the other locations to be zero. See the above image
Rectification: after the unpooling(actually it is form the deconv in the above), the feature map could have negative values. But the original RELU layer are all >=0. To match that, we pass the reconstruction value to RELU.
Filtering(deconv): The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions ofthe same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally.

Some findings

Feature Visualization: (In the images, showed in this paper.For each neuron, find the images in the validation which have top 9 activation sccores, use the deconv net to propogate back to the original image and also the correspoing patch). We can see that layer 2,3,4 seem to discribe attributes.
Feature Evolution during Training: The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50)
Feature Invariance: images being translated, rotated and scaled by varying degrees while looking at the changes in the feature vectors from the top and bottom layers of the model, relative to the untransformed feature. Small transformations have a
dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasilinear
for translation & scaling. The network output is stable to translations and scalings. In general, the output is not invariant to rotation
Architecture Selection: While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. E.g. t. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. It improves the classification performance
Occlusion Sensitivity: Answer whether the model is truly identifying the location of the object in the image, or just using the surrounding context?occluding different portions of the input image with a grey square, and monitoring the output of the classifier.
Correspondence Analysis: Answer whether CNN can model correspondence between specific object parts in different images. we take 5 randomly drawn dog images with frontal pose and systematically mask out the same part of the face in each image. Then computer the difference of feature vectors at layer l original and occluded images respectively

Experiments

Accoridng to the visualizetion to change the structure, improves the classfication accuracy for ImageNet 2012
Varying the model size: by adjusting the size of layers, or removing them entirely. Removing the fully connected layers (6,7) only gives a slight increase in error. ). This is surprising, given that they contain the majority of model parameters. Removing two of the middle convolutional layers also makes a relatively small difference to the error rate. However, removing both the middle convolution layers and the fully connected layers yields a model with only 4 layers whose performance is dramatically worse. This would suggest that the overall depth of the model is important for obtaining good performance. We then modify our model, shown in Fig. 3. Changing the size of the fully connected layers makes little difference to performance (same for model of Krizhevsky et al. [18]). However, increasing the size of the middle convolution layers goes give a useful gain in performance. But increasing these, while also enlarging the fully connected layers results in over-fitting.
Feature Generalization: Pretrain-Finetune works good in Caltech-101 and Caltech-256, not very good for PASCAL 2012. Since images in Imagenet and PASCAL are so different.
Feature Analysis:

Be a geek

梦想一定要有的，万一见鬼了呢

Notes for <Visualizing and Understanding Convolutional Networks>

Pipeline

Some findings

Experiments