[cs231n]Beyond Image Classification

文章目录

1. Localization: Model must output:
2. Detection
3. Segmentation
4. Video Classification
5. Image Captioning

Localization: Model must output:

class (integer)
x1,y1,x2,y2 bounding box coordinate

Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan et al., 2014
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Idea: train a Localization net. Take out Softmax loss, swap in L2 (regression) loss, fine-tune the classification network
predictions: instead of class scores, now interpreted as the 4 bounding box coords also 4D vector from net.
In practice:

It works better to predict a 4D vector for every class (e.g. 4000D vector for 1000 ImageNet classes). During
training only backprop the loss for the correct class
apply at multiple locations and scales

Detection

Model must output a set of detections:
Each detection has:

confidence
class (integer)
x1,y1,x2,y2 bounding box coordinates

Segmentation

Fully Convolutional Networks for Semantic Segmentation Long, Shelhamer, Darrell
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network [Eigen et al.], 2014

Video Classification

Two-Stream Convolutional Networks for Action Recognition in Videos [Simonyan et al.], 2014
Long-term Recurrent Convolutional Networks for Visual Recognition and Description [Donahue et al.], 2014
Large-scale Video Classification with Convolutional Neural Networks [Karpathy et al.], 2014

Image Captioning

Generating Sequences With Recurrent Neural Networks[Alex Graves, 2014]
Recurrent Neural Network Based Language Model [Tomas Mikolov, 2010]
Sequence to Sequence Learning with Neural Networks [Ilya Sutskever, Oriol Vinyals, Quoc V. Le, 2014]
Image Sentence Datasets:
Microsoft COCO[Tsung-Yi Lin et al. 2014] mscoco.org
currently:
~120K images
~5 sentences each

Be a geek

梦想一定要有的，万一见鬼了呢