[cs231n]Beyond Image Classification
Localization: Model must output:
- class (integer)
- x1,y1,x2,y2 bounding box coordinate
Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan et al., 2014
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
Idea: train a Localization net. Take out Softmax loss, swap in L2 (regression) loss, fine-tune the classification network
predictions: instead of class scores, now interpreted as the 4 bounding box coords also 4D vector from net.
In practice:
- It works better to predict a 4D vector for every class (e.g. 4000D vector for 1000 ImageNet classes). During
training only backprop the loss for the correct class - apply at multiple locations and scales
Detection
Model must output a set of detections:
Each detection has:
- confidence
- class (integer)
- x1,y1,x2,y2 bounding box coordinates
Segmentation
Fully Convolutional Networks for Semantic Segmentation Long, Shelhamer, Darrell
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network [Eigen et al.], 2014
Video Classification
Two-Stream Convolutional Networks for Action Recognition in Videos [Simonyan et al.], 2014
Long-term Recurrent Convolutional Networks for Visual Recognition and Description [Donahue et al.], 2014
Large-scale Video Classification with Convolutional Neural Networks [Karpathy et al.], 2014
Image Captioning
Generating Sequences With Recurrent Neural Networks[Alex Graves, 2014]
Recurrent Neural Network Based Language Model [Tomas Mikolov, 2010]
Sequence to Sequence Learning with Neural Networks [Ilya Sutskever, Oriol Vinyals, Quoc V. Le, 2014]
Image Sentence Datasets:
Microsoft COCO[Tsung-Yi Lin et al. 2014] mscoco.org
currently:
~120K images
~5 sentences each