Paper:
Wang, Yunbo, et al. “Spatiotemporal Pyramid Network for Video Action Recognition.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
- Basics
- Main methods
- Spatio-temporal pyramid network — Reinforce each other
- Hierarchical strategies for fusion — Using a unified spatiotemporal loss — To maximally complement each other.
- Tackle the problem for two-stream method — Usually for most misclassification cases, there is one stream failing and the other correct. — Present end-to-end pyramid architecture to let the two facilitate each other.
- Temporal part:
- To learn more global (two actions may not be distinct in a short term) video features, use multi-path temporal sub-networks to sample optical flow in a long sequences. — Use different fusion methods to combine the temporal information.
- Enlarge the video chunks(块) by using multiple CNNs with shared network parameters
- Spatio part:
- If two videos have similar background, the spatio cannot tell, because the background is strongest feature.
- Use temporal part as a guidance — Inform the spatio network where the motion happens (help to extract the significant locations on feature maps of spatio network)
- Joint optimization for temporal and spatio
- Compact bilinear fusion strategy.
- Details for compact fusion
- Maximal the information from both parts while maximizing the interactions.
- Bilinear fusion — Lead to high dimensional representations. — Use spatiotemporal compact bilinear (STCB) to transfer to low dimension.
- STCB can preserve the temporal cues to supervise the spatiotemporal attention module.
- Spatiotemporal Attention
- Taking advantage of the motion information to locate salient regions on the feature maps.
- Integrate all the techniques mentioned above in the pyramid architecture.
- Use STCB three times
- Bottom of the pyramid, combine multiple optical flow representations from longer videos. — More global temporal features.
- Spatiotemporal attention subnet — Fuse the spatial feature maps with the motion representations.
- Top, fuse all.
- Use STCB three times
- Take home messages
- Other methods mentioned
- C3D –3D convolution filters and 3D pooling layers operating over space and time simultaneously.
- Two stream networks.
- Note: the Related works part is very good!