Paper: Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos.” Advances in neural information processing systems. 2014.
- Basics
- Main methods
- Recognize Action from still frame — Recognize Action from motion
- Spatial stream. — objects in single video frame. — Spatial ConvNet is an image classification architecture
- Temporal stream.
- Input: Consider different kinds of optical flow based input
- stacking optical flow displacement fields between frames.
- Optical flow stacking
- Optical flow stacking samples the displacement vectors at the same location in multiple frames.
- Trajectory stacking
- Trajectory stacking samples the vectors along the trajectory
- Optical flow stacking
- Bi-direction optical flow. — From the latest frame to the previous frame
- Mean flow subtraction. — Importance of camera motion compensation —
- Camera motion compensation. — Use a mean vector for each displacement field.
- stacking optical flow displacement fields between frames.
- Input: Consider different kinds of optical flow based input
- Multi-task learning. — In order to solve the problem that video data is not enough. — In order to combine several datasets together for one learning.
- Additional tasks act as a regulariser. — Two softmax classification layers (For different datasets/ tasks) with own loss function
- Overall training loss is computed as the sum of the losses.
- Temporal stream and spatial stream fusion methods
- Averaging
- Using multi-class linear SVM
- Take home message:
- Input based on optical flow
- Use spatial stream based on image classification and temporal stream based on optical flow. Fusion them together.
- Other methods mentioned
- Spatio-temporal cuboids. Shallow high dimensional encodings of local spatio-temporal features
- Detecting sparse S-T interest points: HOG and HOF (Histogram of optical flow)
- Trajectory based pipeline. — optical flow
- Motion boundary histogram (MBH) — gradient based feature, horizontal and vertical components of optical flow.
- Compensation of camera motion.
- Fisher vector encoding. Deeper variant.
- Deep architecture for video recognition.
- Input is a stack of consecutive video frames.
- First layer learns spatio-temporal features (Predefined filters) — HMAX to get spatial and temporal recognition streams.
- Input is a stack of consecutive video frames.
- Spatio-temporal cuboids. Shallow high dimensional encodings of local spatio-temporal features