[cs231n]Convolutional Neural Networks: Architectures, Convolution / Pooling Layers
Lecture Note
References:
- CNN in Matlab
- DeepLearning.net tutorial walks through an implementation of a ConvNet in Theano
- cuda-convnet2 by Alex Krizhevsky is a ConvNet implementation that supports multiple GPUs
- ConvNetJS CIFAR-10 demo allows you to play with ConvNet architectures and see the results and computations in real time, in the browser.
- Caffe, one of the most popular ConvNet libraries.
- Example Torch 7 ConvNet that achieves 7% error on CIFAR-10 with a single model
- Ben Graham’s Sparse ConvNet package, which Ben Graham used to great success to achieve less than 4% error on CIFAR-10.
Every word count in this lecture. I should copy the whole lecture here. So whenever you have doubt with CNN. You should go the the original one. Here just writting some points for memorization.
Convolutional Layer
Overview: The CONV layer’s parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth(#CNN,每一个小的CNN patch 在长和宽是局部的,但是在depth 一定是整个的input volume的depth!!!) of the input volume.
Use of zero-padding. In general, setting zero padding to be \(P = (F - 1)/2\) when the stride is \(S = 1\) ensures that the input volume and output volume will have the same size spatially. It is very common to use zero-padding in this way.
In numpy the code is like this:1
padded_x = np.pad(x,((0,0),(0,0),(pad,pad),(pad,pad)),'constant',constant_values = 0)
Constraints on strides. Note that the spatial arrangement hyperparameters have mutual constraints. For example, when the input has size \(W = 10\), no zero-padding is used \(P = 0\), and the filter size is \(F = 3\), then it would be impossible to use stride \(S = 2\), since \((W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5\), i.e. not an integer, indicating that the neurons don’t “fit” neatly and symmetrically across the input. Therefore, this setting of the hyperparameters is considered to be invalid, and a ConvNet library would likely throw an exception.
Parameter Sharing: CNN assumption is that object can occure anywhere is in the image(General image setting). However, when the input been centered in the image(face). You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a Locally-Connected Layer.
Implementation as Matrix Multiplication, Using(im2col) (OR using FFT for efficiency).
e.g.
The local regions in the input image are stretched out into columns in an operation commonly called im2col. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11*11*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix X_col
of im2col of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns.
Backpropagation. The backward pass for a convolution opteration (for both the data and the weights) is also a convolution (but with spatially-flipped filters).
Summary. To summarize, the Conv Layer:
- Accepts a volume of size \(W_1 \times H_1 \times D_1\)
- Requires four hyperparameters:
- Number of filters \(K\),
- their spatial extent \(F\),
- the stride \(S\),
- the amount of zero padding \(P\).
- Produces a volume of size \(W_2 \times H_2 \times D_2\) where:
- \(W_2 = (W_1 - F + 2P)/S + 1\)
- \(H_2 = (H_1 - F + 2P)/S + 1\) (i.e. width and height are computed equally by symmetry)
- \(D_2 = K\)
- With parameter sharing, it introduces \(F \cdot F \cdot D_1\) weights per filter, for a total of \((F \cdot F \cdot D_1) \cdot K\) weights and \(K\) biases.
- In the output volume, the \(d\)-th depth slice (of size \(W_2 \times H_2\)) is the result of performing a valid convolution of the \(d\)-th filter over the input volume with a stride of \(S\), and then offset by \(d\)-th bias.
A common setting of the hyperparameters is \(F = 3, S = 1, P = 1\). However, there are common conventions and rules of thumb that motivate these hyperparameters.
Pooling Layer
Overview:
The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged
- Accepts a volume of size \(W_1 \times H_1 \times D_1\)
- Requires three hyperparameters:
- their spatial extent \(F\),
- the stride \(S\),
- Produces a volume of size \(W_2 \times H_2 \times D_2\) where:
- \(W_2 = (W_1 - F)/S + 1\)
- \(H_2 = (H_1 - F)/S + 1\)
- \(D_2 = D_1\)
- Introduces zero parameters since it computes a fixed function of the input
- Note that it is not common to use zero-padding for Pooling layers
It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with \(F = 3, S = 2\) (also called overlapping pooling), and more commonly \(F = 2, S = 2\). Pooling sizes with larger receptive fields are too destructive.
Backpropagation. it is common to keep track of the index of the max activation (sometimes also called the switches) so that gradient routing is efficient during backpropagation.
Normalization Layer
Used in Alex Net. Don’t use it now, the contribution is minimal.
Fully-connected layer
Converting FC layers to CONV layers Use a lot in several papers for localization and segmentation. Please see the original lecture for details!!!!
ConvNet Architectures
commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity
the most common ConvNet architecture follows the pattern:
INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
Prefer a stack of small filter CONV to one large receptive field CONV layer.
Stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters(see the original lecture for real examples
). As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation.
all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially
input layer (that contains the image): should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512. If not, warp the image to the desired size.
conv layers s using small filters (e.g. 3x3 or at most 5x5), using a stride of \(S = 1\), and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when \(F = 3\), then using \(P = 1\) will retain the original size of the input. When \(F = 5\), \(P = 2\). For a general \(F\), it can be seen that \(P = (F - 1) / 2\) preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image.
pool layer It is very uncommon to see receptive field sizes for max pooling that are larger than or equal to 3 because the pooling is then too lossy and agressive. This usually leads to worse performance.
Compromising based on memory constraints. Memory build up very quickly, one compromise might be to use a first CONV layer with filter sizes of 7x7 and stride of 2 (as seen in a ZF net). As another example, an AlexNet uses filer sizes of 11x11 and stride of 4.
Why use padding? In addition to the aforementioned benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be “washed away” too quickly.
Why use stride of 1 in CONV?: stride 1 with \(P = (F - 1) / 2\) preserves the input size. Leave all spatial down-sampling to the POOL layers, with the CONV layers only transforming the input volume depth-wise.
Computational Considerations
The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of:
- From the intermediate volume sizes: These are the raw number of activations at every layer of the ConvNet, and also their gradients (of equal size). Usually, most of the activations are on the earlier layers of a ConvNet (i.e. first Conv Layers). These are kept around because they are needed for backpropagation, but a clever implementation that runs a ConvNet only at test time could in principle reduce this by a huge amount, by only storing the current activations at any layer and discarding the previous activations on layers below.
- From the parameter sizes: These are the numbers that hold the network parameters, their gradients during backpropagation, and commonly also a step cache if the optimization is using momentum, Adagrad, or RMSProp. Therefore, the memory to store the parameter vector alone must usually be multiplied by a factor of at least 3 or so.
- Every ConvNet implementation has to maintain miscellaneous memory, such as the image data batches, perhaps their augmented versions, etc.
Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn’t fit, a common heuristic to “make it fit” is to decrease the batch size, since most of the memory is usually consumed by the activations.