文章目录

1. Gradient Check
2. Before learning: sanity checks Tips/Tricks
3. Babysitting the learning process
4. Parameter updates

Hyperparameter optimization (see the orginal notes)

1. Evaluation

Lecture Note
Hinton Note on the same topic
Reference
Read

SGD tips and tricks from Leon Bottou
Efficient BackProp (pdf) from Yann LeCun
Practical Recommendations for Gradient-Based Training of Deep Architectures from Yoshua Bengio

About Nesterov’s Accelerated Momentum (NAG)
Advances in optimizing Recurrent Networks
Ilya Sutskever’s thesis

L-BFGS VS SGD

On Optimization Methods for Deep Learning from Le et al. is a paper from 2011 comparing SGD vs. L-BFGS. Some of its conclusions have since been challenged.
Large Scale Distributed Deep Networks is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization.
SFO algorithm strives to combine the advantages of SGD with advantages of L-BFGS.

Gradient Check

Use the centered formula:
$$
\frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)}
$$
$$
\frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)}
$$

Use relative error for the comparison
$$
\frac{\mid f’_a - f’_n \mid}{\mid f’_a \mid + \mid f’_n \mid}
$$
where h is a very small number, in practice approximately 1e-5 or so
Notice that normally the relative error formula only includes one of the two terms (either one), but I prefer to add both to make it symmetric and to prevent underflow in the case where one of the two is zero. However, one must explicitly keep track of the case where both are zero (this can often be the case with ReLUs for example) and pass the gradient check in that edge case. In practice:

relative error > 1e-2 usually means the gradient is probably wrong
1e-2 > relative error > 1e-4 should make you feel uncomfortable
1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high.
1e-7 and less you should be happy.
the deeper the network, the higher the relative errors will be. So if you are gradient checking the input data for a 10 layer network, a relative error of 1e-2 might be okay because the errors build up on the way.
Kinks in the objective:
Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU ($max(0,x)$), or the SVM loss, Maxout neurons. It is not trivial in the large network.
Be careful with the step size h: If it is not correct all the time, try 1e-4 or 1e-6
Gradcheck during a “characteristic” mode of operation Check multiple points instead of just one. Sometimes it is better se a short burn-in time during which the network is allowed to learn and perform the gradient check after the loss starts to go down
Don’t let the regularization overwhelm the data One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently
Remember to turn off dropout/augmentationsperforming gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations

Before learning: sanity checks Tips/Tricks

Look for correct loss at chance performance: In the begining, w is small random variable. So the loss should be like a random guess. So now, calculate the loss. if it is close to random guess and you probably right.
Overfit a tiny subset of data before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost(Turn off regularization)

Babysitting the learning process

Monitor useful quantities during training of a neural network.
Epochs: The x-axis of the plots below are always in units of epochs, which measure how many times every example has been seen during training in expectation (e.g. one epoch means that every example has been seen once). It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size.

Loss Functions

The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate:

Train/Val accuracy

The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model:

Ratio of weights:updates

Track is the ratio of the update magnitudes to to the value magnitudes. updates, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). You might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high.
You can track the norm or (min, max) of W. For details, see original one

Activation / Gradient distributions per layer

Keep track of is the variance of the activations and their gradients on each layer. If the activations vanish extremely quickly in the higher layers of the network. This will in turn lead to weight gradients near zero, since during backprop they are dependent multiplicatively on the activations. By correctly normalizing the weights, the variances look much more uniform:

First-layer Visualizations

Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually:

Parameter updates

SGD and bells and whistles

Vanilla update: fixed learning rate

1 2	# Vanilla update x += - learning_rate * dx

Momentum update: Almost always enjoys better converge rates(than vanilla update) on deep networks.Momentum simply adds a fraction m of the previous weight update to the current one. It is often necessary to reduce the global learning rate µ when using a lot of momentum (m close to 1). If you combine a high learning rate with a lot of momentum, you will rush past the minimum with huge steps!

When the gradient keeps changing direction, momentum will smooth out the variations. This is particularly useful when the network is not well-conditioned. (#gradient 方向总在变，zigzag，加了momentum收敛更快)

v variable that is initialized at zero, and an additional hyperparameter (mu) when cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99]

1 2	v = mu * v - learning_rate * dx # integrate velocity x += v # integrate position

Nesterov Momentum： slightly better in practice than the momentum update. The idea is to update the parameter with the gradient at the predicted (peeked-ahead) parameter

x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v

In practice, people prefer to express the update to look as similar to vanilla SGD or to the previous momentum update as possible, expressing the update in terms of x_ahead instead of x.Then get

1
2
3

v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form

Annealing the learning rate:Intuition is easy to get. First large step size and then small.

Three common types of implementing the learning rate decay:

Step decay* Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving.
Exponential decay. has the mathematical form $\alpha = \alpha_0 e^{-k t}$, where $\alpha_0, k$ are hyperparameters and $t$ is the iteration number (but you can also use units of epochs).
1/t decay has the mathematical form $\alpha = \alpha_0 / (1 + k t )$ where $a_0, k$ are hyperparameters and $t$ is the iteration number.
In practice, we find that the step decay dropout is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter $k$.

Second order methods

Newton or L-BFGS
In practice, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov’s) momentum are more standard because they are simpler and scale more easily.

Per-parameter adaptive learning rates (Adagrad, RMSProp)

Adagrad:(see the orginal note)

1
2
3

# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / np.sqrt(cache + 1e-8)

RMSprop:(see hinton slides)

1 2	cache = decay_rate * cache + (1 - decay_rate) * dx*2 x += - learning_rate dx / np.sqrt(cache + 1e-8)

Hyperparameter optimization (see the orginal notes)

the initial learning rate
learning rate decay schedule (such as the decay constant)
regularization strength (L2 penalty, dropout strength)
strategies:

Prefer one validation fold to cross-validation.
Hyperparameter ranges: Search for hyperparameters on log scale. For example, a typical sampling of the learning rate would look as follows: learning_rate = 10 ** uniform(-6, 1). That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10. The same strategy should be used for the regularization strength. As mentioned, for some parameters such as momentum, it is more common to search over a fixed set of values such as [0.5, 0.9, 0.95, 0.99] .
Prefer random search to grid search
Careful with best values on border: if the best one is in border, probably need to add more range
Stage your search from coarse to fine.
Bayesian Hyperparameter Optimization ?(seems not used in ConvNets)

Evaluation

Model Ensembles gives better performance.

Same model, different initializations
Top models discovered during cross-validation.
Different checkpoints of a single model. I

One disadvantage of model ensembles is that they take longer to evaluate on test example. An interested reader may find the recent work from Geoff Hinton on “Dark Knowledge” inspiring, where the idea is to “distill” a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective.

Be a geek

梦想一定要有的，万一见鬼了呢

[cs231n]Neural Networks Part 3: Learning and Evaluation

Gradient Check

Before learning: sanity checks Tips/Tricks

Babysitting the learning process

Loss Functions

Train/Val accuracy

Ratio of weights:updates

Activation / Gradient distributions per layer

First-layer Visualizations

Parameter updates

SGD and bells and whistles

Annealing the learning rate:Intuition is easy to get. First large step size and then small.

Second order methods

Per-parameter adaptive learning rates (Adagrad, RMSProp)

Hyperparameter optimization (see the orginal notes)

Evaluation