Paper:
Donahue, Jeffrey, et al. “Long-term recurrent convolutional networks for visual recognition and description.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
- Basics
- Main methods
- Deep hierarchical visual feature extractor
- Video Classification
- Sequential input, fixed output.
- We use late fusion to merge per-timestep predictions. The CNN part can use the aImageNet pre-trained model. We also consider both RGB and flow inputs. Flow is computed and transformed into a “flow image” by centering x and y flow values around 128 and multiplying by a scalar such that flow values fall between 0 and 255. A third channel for the flow image is created by calculating the flow magnitude.
- Image Captioning
- Fixed input, sequential output.
- At each timestep, both the image features and the previous word are provided as inputs to the LSTM.
- Video Captioning
- Sequential input, sequence output.
- This is the encoder-decoder style. Input is processed and the decoder outputs are ignored for the first T timesteps, and the predictions are made and “dummy” inputs are ignored for the latter T ′ timesteps.
- Video Classification
- Deep hierarchical visual feature extractor
- Take home messages
- Other me