文章目录
  1. 1. Localization: Model must output:
  2. 2. Detection
  3. 3. Segmentation
  4. 4. Video Classification
  5. 5. Image Captioning

slides

Localization: Model must output:

  • class (integer)
  • x1,y1,x2,y2 bounding box coordinate

Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan et al., 2014
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Idea: train a Localization net. Take out Softmax loss, swap in L2 (regression) loss, fine-tune the classification network
predictions: instead of class scores, now interpreted as the 4 bounding box coords also 4D vector from net.
In practice:

  • It works better to predict a 4D vector for every class (e.g. 4000D vector for 1000 ImageNet classes). During
    training only backprop the loss for the correct class
  • apply at multiple locations and scales

Detection

Model must output a set of detections:
Each detection has:

  • confidence
  • class (integer)
  • x1,y1,x2,y2 bounding box coordinates

Segmentation

Fully Convolutional Networks for Semantic Segmentation Long, Shelhamer, Darrell
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network [Eigen et al.], 2014

Video Classification

Two-Stream Convolutional Networks for Action Recognition in Videos [Simonyan et al.], 2014
Long-term Recurrent Convolutional Networks for Visual Recognition and Description [Donahue et al.], 2014
Large-scale Video Classification with Convolutional Neural Networks [Karpathy et al.], 2014

Image Captioning

Generating Sequences With Recurrent Neural Networks[Alex Graves, 2014]
Recurrent Neural Network Based Language Model [Tomas Mikolov, 2010]
Sequence to Sequence Learning with Neural Networks [Ilya Sutskever, Oriol Vinyals, Quoc V. Le, 2014]
Image Sentence Datasets:
Microsoft COCO[Tsung-Yi Lin et al. 2014] mscoco.org
currently:
~120K images
~5 sentences each

文章目录
  1. 1. Localization: Model must output:
  2. 2. Detection
  3. 3. Segmentation
  4. 4. Video Classification
  5. 5. Image Captioning