[cs231n]Transfer Learning and Fine-tuning Convolutional Neural Networks
Lecture Note
Lecture Slides It contains extra material
Reference
- Caffe Model Zoo: Where people share their network.
- Understanding the source of ConvNet performance Visualizing and Understanding Convolutional Networks
- CNN Features off-the-shelf: an Astounding Baseline for Recognition trains SVMs on features from ImageNet-pretrained ConvNet and reports several state of the art results.
- DeCAF reported similar findings in 2013. The framework in this paper (DeCAF) was a Python-based precursor to the C++ Caffe library.
- How transferable are features in deep neural networks? studies the transfer learning performance in detail, including some unintuitive findings about layer co-adaptations.
Transfer Learning
it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. The three major Transfer Learning scenarios look as follows:
- ConvNet as fixed feature extractor
Remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. In an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. We call these features CNN codes. It is important for performance that these codes are ReLUd (i.e. thresholded at zero) if they were also thresholded during the training of the ConvNet on ImageNet (as is usually the case). Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset. - Fine-tuning the ConvNet
Small or None learning rate for lower convolution layer, larger learning rate for fully connected layer
The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset. - When and how to fine-tune?
- New dataset is small and similar to original dataset: directly use the CNN codes to train SVM
- New dataset is large and similar to the original dataset: Fine tune the network
- New dataset is small but very different from the original dataset: might work better to train the SVM classifier from activations somewhere earlier in the network.
- New dataset is large and very different from the original dataset: train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model.
Practical Advice
Learning rates. It’s common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the (randomly-initialized) weights for the new linear classifier that computes the class scores of your new dataset.
Summary: What makes ConvNets tick?
- depth
- small filter sizes
- Conv layers > FC layers