Notes for paper “Long term RNN for recognition and description”


Donahue, Jeffrey, et al. “Long-term recurrent convolutional networks for visual recognition and description.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

  • Basics
  • Main methods
    • Deep hierarchical  visual feature extractor
      • Video Classification
        • Sequential input, fixed output.
        • We use late fusion to merge per-timestep predictions. The CNN part can use the aImageNet pre-trained model. We also consider both RGB and flow inputs. Flow is computed and transformed into a “flow image” by centering x and y flow values around 128 and multiplying by a scalar such that flow values fall between 0 and 255. A third channel for the flow image is created by calculating the flow magnitude.
      • Image Captioning
        • Fixed input, sequential output.
        • At each timestep, both the image features and the previous word are provided as inputs to the LSTM.
      • Video Captioning
        • Sequential input, sequence output.
        • This is the encoder-decoder style. Input is processed and the decoder outputs are ignored for the first T timesteps, and the predictions are made and “dummy” inputs are ignored for the latter T ′ timesteps.
  • Take home messages
  • Other me

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s