文章目录
  1. 1. Rules of thumb for setting the learning rate
  2. 2. Default parameter setting of Convolution layer
  3. 3. Loss functions
  4. 4. Data Layers

Convnet Benchmark
The above link compares the speed of existing CNN implementation

Rules of thumb for setting the learning rate

$ \alpha $ and momentum $ \mu $

A good strategy for deep learning with SGD is to initialize the learning rate $ \alpha $ to a value around $ \alpha \approx 0.01 = 10^{-2} $, and dropping it by a constant factor (e.g., 10) throughout training when the loss begins to reach an apparent “plateau”, repeating this several times.
Generally, you probably want to use a momentum $ \mu = 0.9 $ or similar value.
By smoothing the weight updates across iterations, momentum tends to make deep learning with SGD both stabler and faster.

To use a learning rate policy like this, you can put the following lines somewhere in your solver prototxt file:

base_lr: 0.01     # begin training at a learning rate of 0.01 = 1e-2

lr_policy: "step" # learning rate policy: drop the learning rate in "steps"
                  # by a factor of gamma every stepsize iterations

gamma: 0.1        # drop the learning rate by a factor of 10
                  # (i.e., multiply it by a factor of gamma = 0.1)

stepsize: 100000  # drop the learning rate every 100K iterations

max_iter: 350000  # train for 350K iterations total

momentum: 0.9

Under the above settings, we’ll always use momentum $ \mu = 0.9 $
We’ll begin training at a base_lr of $ \alpha = 0.01 = 10^{-2} $ for the first 100,000 iterations, then multiply the learning rate by gamma ($ \gamma $) and train at $ \alpha’ = \alpha \gamma = (0.01) (0.1) = 0.001 = 10^{-3} $ for iterations 100K-200K, then at $ \alpha’’ = 10^{-4} $ for iterations 200K-300K, and finally train until iteration 350K (since we have max_iter: 350000) at $ \alpha’’’ = 10^{-5} $.

Note that the momentum setting $ \mu $ effectively multiplies the size of your updates by a factor of $ \frac{1}{1 - \mu} $ after many iterations of training, so if you increase $ \mu $, it may be a good idea to decrease $ \alpha $ accordingly (and vice versa).

For example, with $ \mu = 0.9 $, we have an effective update size multiplier of $ \frac{1}{1 - 0.9} = 10 $.
If we increased the momentum to $ \mu = 0.99 $, we’ve increased our update size multiplier to 100, so we should drop $ \alpha $ (base_lr) by a factor of 10.

Note also that the above settings are merely guidelines, and they’re definitely not guaranteed to be optimal (or even work at all!) in every situation.
If learning diverges (e.g., you start to see very large or NaN or inf loss values or outputs), try dropping the base_lr (e.g., base_lr: 0.001) and re-training, repeating this until you find a base_lr value that works.

Default parameter setting of Convolution layer

  • weight_filler [default type: ‘constant’ value: 0]
  • bias_term [default true]: specifies whether to learn and apply a set of additive biases to the filter outputs
  • pad (or pad_h and pad_w) [default 0]: specifies the number of pixels to (implicitly) add to each side of the input
  • stride (or stride_h and stride_w) [default 1]: specifies the intervals at which to apply the filters to the input
  • group (g) [default 1]: If g > 1, we restrict the connectivity of each filter to a subset of the input. Specifically, the input and output channels are separated into g groups, and the ith output group channels will be only connected to the ith input group channels.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
layers {
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
blobs_lr: 1 # learning rate multiplier for the filters
blobs_lr: 2 # learning rate multiplier for the biases
weight_decay: 1 # weight decay multiplier for the filters
weight_decay: 0 # weight decay multiplier for the biases
convolution_param {
num_output: 96 # learn 96 filters
kernel_size: 11 # each filter is 11x11
stride: 4 # step 4 pixels between each filter application
weight_filler {
type: "gaussian" # initialize the filters from a Gaussian
std: 0.01 # distribution with stdev 0.01 (default mean: 0)
}
bias_filler {
type: "constant" # initialize the biases to zero (0)
value: 0
}
}
}

Loss functions

  1. SoftMax(LayerType: SOFTMAX_LOSS)
  2. Sum-of-Squares / Euclidean(LayerType: EUCLIDEAN_LOSS)
  3. Hinger/Margin(LayerType: HINGE_LOSS)
  4. Sigmoid Cross-Entropy(SIGMOID_CROSS_ENTROPY_LOSS)
  5. Infogain(INFOGAIN_LOSS)

Data Layers

Data enters Caffe through data layers: they lie at the bottom of nets. Data can come from efficient databases (LevelDB or LMDB), directly from memory, or, when efficiency is not critical, from files on disk in HDF5 or common image formats.

Common input preprocessing (mean subtraction, scaling, random cropping, and mirroring) is available by specifying TransformationParameters.

文章目录
  1. 1. Rules of thumb for setting the learning rate
  2. 2. Default parameter setting of Convolution layer
  3. 3. Loss functions
  4. 4. Data Layers