Category Archives: Paper

Notes for Paper “Lattice Long Short-Term Memory for Human Action Recognition”


Sun, Lin, et al. “Lattice long short-term memory for human action recognition.” arXiv preprint arXiv:1708.03958 (2017).

  • Basics
    • CNN methods for spatial appearance
    • RNN methods (LSTM) for temporal dynamics. — Natively applying RNN only suitable for short term motions.
  • Main methods
    • Lattice-LSTM. — extend LSTM by learning independent hidden state transitions of memory cells for individual spatial locations.
      • Control gates are shared between RGB and optical flow stream.
      • Greatly enhance the capacity of the memory cell to learn motion dynamics.
    • Multi-model training procedure. — Train both input gates and forgor gates in the network. (Other two-stream network training these two separately)


  • Take home message
  • Other methods mentioned
    • Extension of CNN. –C3D learns both space and time.– Only covers a short range of the sequence.
    • Training another nerual network on optical flow.
    • Methods for obtain a better combination of appearance and motion:  spatial-temporal features using sequential procedure. 2D spatial (short) and 1D temporal (long)information.
    • ResNets
    • RNN, LSTM — encoder and decoder



Notes for Paper “Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks”


Wang, Hongsong, and Liang Wang. “Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks.” e Conference on Computer Vision and Pa ern Recognition (CVPR). 2017.

  • Basics
    • Skeleton based action recognition
    • Two-stream RNN
    • Two architectures for temporal streams
      • Stacked RNN
      • Hieratical RNN
    • Model spatial structure by converting spatial graph into a sequence of joints.
    • Obtain 3D skeletons from depth images.
  • Main method
    • End- to -end two-stream RNN
    • Fusion is performed by combining the softmax class posteriors from the two nets.
    • Temporal channel. — Concatenate the 3D coordinates of different joints at each time step, get the generated sequence with a RNN.
      • Stacked RNN
        • Feed the concatenated coordinates of all joints into RNN. Stack two layers. Adding more layer will not improve the performance.
      • Hierarchical RNN
        • Divide human skeleton into 5 parts.
        • Use hierarchical RNN to model the motions of different parts of the body (first layer) and the whole body (second layer).
    • Spatial RNN
      • Nodes denote the joints and edges denote the physical connections.
        • Action == the undirected graph displays some varied patterns of spatial structures.
      • Select a temporal window centered at the time step and feed the coordinates of one joint inside the window to model the spatial relationship of joints.
      • Three graph representations
        • Undirected graph
        • Chain sequence
        • Traversal sequence
      • Spatial RNN can recognize action based on just one graph representations.
    • Data augmentation
      • 3D transformation of skeletons
        • Rotation
        • Scaling
        • Shear




  • Take home messages
  • Other methods mentioned
    • Body part based action recognition and Joint based action recognition.
      • Based on hand-crafted low level features, use Markov Random Fields.
      • Fully connected deep LSTM network with regularization terms to learn co-occurrence features of joints.
      • —  These methods
    • RGB based action recognition
      • Hieratical RNN, RNN with regularizations, differential RNN and part-aware Long Short Term memory

Notes for Paper “Spatiotemporal Pyramid Network for Video Action Recognition”


Wang, Yunbo, et al. “Spatiotemporal Pyramid Network for Video Action Recognition.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.

  • Basics
  • Main methods
    • Spatio-temporal pyramid network — Reinforce each other
    • Hierarchical strategies for fusion — Using a unified spatiotemporal loss — To maximally complement each other.
    • Tackle the problem for two-stream method — Usually for most misclassification cases, there is one stream failing and the other correct. — Present end-to-end pyramid architecture to let the two facilitate each other.
    • Temporal part:
      • To learn more global (two actions may not be distinct in a short term) video features, use multi-path temporal sub-networks to sample optical flow in a long sequences. — Use different fusion methods to combine the temporal information.
      • Enlarge the video chunks(块) by using multiple CNNs with shared network parameters
    • Spatio part:
      • If two videos have similar background, the spatio cannot tell, because the background is strongest feature.
      • Use temporal part as a guidance — Inform the spatio network where the motion happens (help to extract the significant locations on feature maps of spatio network)
    • Joint optimization for temporal and spatio
      • Compact bilinear fusion strategy.
    • Details for compact fusion
      • Maximal the information from both parts while maximizing the interactions.
      • Bilinear fusion — Lead to high dimensional representations. — Use spatiotemporal compact bilinear (STCB) to  transfer to low dimension.
      • STCB can preserve the temporal cues to supervise the spatiotemporal attention module.
    • Spatiotemporal Attention
      • Taking advantage of the motion information to locate salient regions on the feature maps.
    • Integrate all the techniques mentioned above in the pyramid architecture.
      • Use STCB three times
        • Bottom of the pyramid, combine multiple optical flow representations from longer videos. — More global temporal features.
        • Spatiotemporal attention subnet — Fuse the spatial feature maps with the motion representations.
        • Top, fuse all.
  • Take home messages
  • Other methods mentioned
    • C3D –3D convolution filters and 3D pooling layers operating over space and time simultaneously.
    • Two stream networks.
    • Note: the Related works part is very good!