Notes for Paper “Spatiotemporal Pyramid Network for Video Action Recognition”


Wang, Yunbo, et al. “Spatiotemporal Pyramid Network for Video Action Recognition.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.

  • Basics
  • Main methods
    • Spatio-temporal pyramid network — Reinforce each other
    • Hierarchical strategies for fusion — Using a unified spatiotemporal loss — To maximally complement each other.
    • Tackle the problem for two-stream method — Usually for most misclassification cases, there is one stream failing and the other correct. — Present end-to-end pyramid architecture to let the two facilitate each other.
    • Temporal part:
      • To learn more global (two actions may not be distinct in a short term) video features, use multi-path temporal sub-networks to sample optical flow in a long sequences. — Use different fusion methods to combine the temporal information.
      • Enlarge the video chunks(块) by using multiple CNNs with shared network parameters
    • Spatio part:
      • If two videos have similar background, the spatio cannot tell, because the background is strongest feature.
      • Use temporal part as a guidance — Inform the spatio network where the motion happens (help to extract the significant locations on feature maps of spatio network)
    • Joint optimization for temporal and spatio
      • Compact bilinear fusion strategy.
    • Details for compact fusion
      • Maximal the information from both parts while maximizing the interactions.
      • Bilinear fusion — Lead to high dimensional representations. — Use spatiotemporal compact bilinear (STCB) to  transfer to low dimension.
      • STCB can preserve the temporal cues to supervise the spatiotemporal attention module.
    • Spatiotemporal Attention
      • Taking advantage of the motion information to locate salient regions on the feature maps.
    • Integrate all the techniques mentioned above in the pyramid architecture.
      • Use STCB three times
        • Bottom of the pyramid, combine multiple optical flow representations from longer videos. — More global temporal features.
        • Spatiotemporal attention subnet — Fuse the spatial feature maps with the motion representations.
        • Top, fuse all.
  • Take home messages
  • Other methods mentioned
    • C3D –3D convolution filters and 3D pooling layers operating over space and time simultaneously.
    • Two stream networks.
    • Note: the Related works part is very good!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s