Notes for Paper “Spatiotemporal Pyramid Network for Video Action Recognition”

Paper:

Wang, Yunbo, et al. “Spatiotemporal Pyramid Network for Video Action Recognition.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.

  • Basics
  • Main methods
    • Spatio-temporal pyramid network — Reinforce each other
    • Hierarchical strategies for fusion — Using a unified spatiotemporal loss — To maximally complement each other.
    • Tackle the problem for two-stream method — Usually for most misclassification cases, there is one stream failing and the other correct. — Present end-to-end pyramid architecture to let the two facilitate each other.
    • Temporal part:
      • To learn more global (two actions may not be distinct in a short term) video features, use multi-path temporal sub-networks to sample optical flow in a long sequences. — Use different fusion methods to combine the temporal information.
      • Enlarge the video chunks(块) by using multiple CNNs with shared network parameters
    • Spatio part:
      • If two videos have similar background, the spatio cannot tell, because the background is strongest feature.
      • Use temporal part as a guidance — Inform the spatio network where the motion happens (help to extract the significant locations on feature maps of spatio network)
    • Joint optimization for temporal and spatio
      • Compact bilinear fusion strategy.
    • Details for compact fusion
      • Maximal the information from both parts while maximizing the interactions.
      • Bilinear fusion — Lead to high dimensional representations. — Use spatiotemporal compact bilinear (STCB) to  transfer to low dimension.
      • STCB can preserve the temporal cues to supervise the spatiotemporal attention module.
    • Spatiotemporal Attention
      • Taking advantage of the motion information to locate salient regions on the feature maps.
    • Integrate all the techniques mentioned above in the pyramid architecture.
      • Use STCB three times
        • Bottom of the pyramid, combine multiple optical flow representations from longer videos. — More global temporal features.
        • Spatiotemporal attention subnet — Fuse the spatial feature maps with the motion representations.
        • Top, fuse all.
  • Take home messages
  • Other methods mentioned
    • C3D –3D convolution filters and 3D pooling layers operating over space and time simultaneously.
    • Two stream networks.
    • Note: the Related works part is very good!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s