[cs231n]Neural Networks Part 3: Learning and Evaluation
- 1. Gradient Check
- 2. Before learning: sanity checks Tips/Tricks
- 3. Babysitting the learning process
- 4. Parameter updates
Lecture Note
Hinton Note on the same topic
Reference
Read
- SGD tips and tricks from Leon Bottou
- Efficient BackProp (pdf) from Yann LeCun
- Practical Recommendations for Gradient-Based Training of Deep Architectures from Yoshua Bengio
About Nesterov’s Accelerated Momentum (NAG)
Advances in optimizing Recurrent Networks
Ilya Sutskever’s thesis
L-BFGS VS SGD
- On Optimization Methods for Deep Learning from Le et al. is a paper from 2011 comparing SGD vs. L-BFGS. Some of its conclusions have since been challenged.
- Large Scale Distributed Deep Networks is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization.
- SFO algorithm strives to combine the advantages of SGD with advantages of L-BFGS.
Gradient Check
Use the centered formula:
$$
\frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)}
$$
$$
\frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)}
$$
Use relative error for the comparison
$$
\frac{\mid f’_a - f’_n \mid}{\mid f’_a \mid + \mid f’_n \mid}
$$
where h
is a very small number, in practice approximately 1e-5 or so
Notice that normally the relative error formula only includes one of the two terms (either one), but I prefer to add both to make it symmetric and to prevent underflow in the case where one of the two is zero. However, one must explicitly keep track of the case where both are zero (this can often be the case with ReLUs for example) and pass the gradient check in that edge case. In practice:
- relative error > 1e-2 usually means the gradient is probably wrong
- 1e-2 > relative error > 1e-4 should make you feel uncomfortable
- 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high.
- 1e-7 and less you should be happy.
- the deeper the network, the higher the relative errors will be. So if you are gradient checking the input data for a 10 layer network, a relative error of 1e-2 might be okay because the errors build up on the way.
Kinks in the objective:
Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (\(max(0,x)\)), or the SVM loss, Maxout neurons. It is not trivial in the large network.
Be careful with the step size h: If it is not correct all the time, try 1e-4 or 1e-6
Gradcheck during a “characteristic” mode of operation Check multiple points instead of just one. Sometimes it is better se a short burn-in time during which the network is allowed to learn and perform the gradient check after the loss starts to go down
Don’t let the regularization overwhelm the data One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently
Remember to turn off dropout/augmentationsperforming gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations
Before learning: sanity checks Tips/Tricks
- Look for correct loss at chance performance: In the begining, w is small random variable. So the loss should be like a random guess. So now, calculate the loss. if it is close to random guess and you probably right.
- Overfit a tiny subset of data before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost(Turn off regularization)
Babysitting the learning process
Monitor useful quantities during training of a neural network.
Epochs: The x-axis of the plots below are always in units of epochs, which measure how many times every example has been seen during training in expectation (e.g. one epoch means that every example has been seen once). It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size.
Loss Functions
The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate:
Train/Val accuracy
The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model:
Ratio of weights:updates
Track is the ratio of the update magnitudes to to the value magnitudes. updates, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). You might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high.
You can track the norm or (min, max) of W. For details, see original one
Activation / Gradient distributions per layer
Keep track of is the variance of the activations and their gradients on each layer. If the activations vanish extremely quickly in the higher layers of the network. This will in turn lead to weight gradients near zero, since during backprop they are dependent multiplicatively on the activations. By correctly normalizing the weights, the variances look much more uniform:
First-layer Visualizations
Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually:
Parameter updates
SGD and bells and whistles
Vanilla update: fixed learning rate1
2# Vanilla update
x += - learning_rate * dx
Momentum update: Almost always enjoys better converge rates(than vanilla update) on deep networks.Momentum simply adds a fraction m of the previous weight update to the current one. It is often necessary to reduce the global learning rate µ when using a lot of momentum (m close to 1). If you combine a high learning rate with a lot of momentum, you will rush past the minimum with huge steps!
When the gradient keeps changing direction, momentum will smooth out the variations. This is particularly useful when the network is not well-conditioned. (#gradient 方向总在变,zigzag,加了momentum收敛更快)
v
variable that is initialized at zero, and an additional hyperparameter (mu)
when cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99]1
2v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position
Nesterov Momentum: slightly better in practice than the momentum update. The idea is to update the parameter with the gradient at the predicted (peeked-ahead) parameter1
2
3
4x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v
In practice, people prefer to express the update to look as similar to vanilla SGD or to the previous momentum update as possible, expressing the update in terms of x_ahead instead of x.Then get1
2
3v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form
Annealing the learning rate:Intuition is easy to get. First large step size and then small.
Three common types of implementing the learning rate decay:
- Step decay* Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving.
- Exponential decay. has the mathematical form \(\alpha = \alpha_0 e^{-k t}\), where \(\alpha_0, k\) are hyperparameters and \(t\) is the iteration number (but you can also use units of epochs).
- 1/t decay has the mathematical form \(\alpha = \alpha_0 / (1 + k t )\) where \(a_0, k\) are hyperparameters and \(t\) is the iteration number.
In practice, we find that the step decay dropout is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter \(k\).
Second order methods
Newton or L-BFGS
In practice, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov’s) momentum are more standard because they are simpler and scale more easily.
Per-parameter adaptive learning rates (Adagrad, RMSProp)
Adagrad:(see the orginal note)1
2
3# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / np.sqrt(cache + 1e-8)
RMSprop:(see hinton slides)1
2cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / np.sqrt(cache + 1e-8)
Hyperparameter optimization (see the orginal notes)
- the initial learning rate
- learning rate decay schedule (such as the decay constant)
- regularization strength (L2 penalty, dropout strength)
strategies:
- Prefer one validation fold to cross-validation.
- Hyperparameter ranges: Search for hyperparameters on log scale. For example, a typical sampling of the learning rate would look as follows: learning_rate = 10 ** uniform(-6, 1). That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10. The same strategy should be used for the regularization strength. As mentioned, for some parameters such as momentum, it is more common to search over a fixed set of values such as [0.5, 0.9, 0.95, 0.99] .
- Prefer random search to grid search
- Careful with best values on border: if the best one is in border, probably need to add more range
- Stage your search from coarse to fine.
- Bayesian Hyperparameter Optimization ?(seems not used in ConvNets)
Evaluation
Model Ensembles gives better performance.
- Same model, different initializations
- Top models discovered during cross-validation.
- Different checkpoints of a single model. I
One disadvantage of model ensembles is that they take longer to evaluate on test example. An interested reader may find the recent work from Geoff Hinton on “Dark Knowledge” inspiring, where the idea is to “distill” a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective.