Category Archives: Paper

Notes for Paper “Lattice Long Short-Term Memory for Human Action Recognition”

Paper:

Sun, Lin, et al. “Lattice long short-term memory for human action recognition.” arXiv preprint arXiv:1708.03958 (2017).

  • Basics
    • CNN methods for spatial appearance
    • RNN methods (LSTM) for temporal dynamics. — Natively applying RNN only suitable for short term motions.
  • Main methods
    • Lattice-LSTM. — extend LSTM by learning independent hidden state transitions of memory cells for individual spatial locations.
      • Control gates are shared between RGB and optical flow stream.
      • Greatly enhance the capacity of the memory cell to learn motion dynamics.
    • Multi-model training procedure. — Train both input gates and forgor gates in the network. (Other two-stream network training these two separately)

 

  • Take home message
  • Other methods mentioned
    • Extension of CNN. –C3D learns both space and time.– Only covers a short range of the sequence.
    • Training another nerual network on optical flow.
    • Methods for obtain a better combination of appearance and motion:  spatial-temporal features using sequential procedure. 2D spatial (short) and 1D temporal (long)information.
    • ResNets
    • RNN, LSTM — encoder and decoder

 

 

Notes for Paper “Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks”

Paper:

Wang, Hongsong, and Liang Wang. “Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks.” e Conference on Computer Vision and Pa ern Recognition (CVPR). 2017.

  • Basics
    • Skeleton based action recognition
    • Two-stream RNN
    • Two architectures for temporal streams
      • Stacked RNN
      • Hieratical RNN
    • Model spatial structure by converting spatial graph into a sequence of joints.
    • Obtain 3D skeletons from depth images.
  • Main method
    • End- to -end two-stream RNN
    • Fusion is performed by combining the softmax class posteriors from the two nets.
    • Temporal channel. — Concatenate the 3D coordinates of different joints at each time step, get the generated sequence with a RNN.
      • Stacked RNN
        • Feed the concatenated coordinates of all joints into RNN. Stack two layers. Adding more layer will not improve the performance.
      • Hierarchical RNN
        • Divide human skeleton into 5 parts.
        • Use hierarchical RNN to model the motions of different parts of the body (first layer) and the whole body (second layer).
    • Spatial RNN
      • Nodes denote the joints and edges denote the physical connections.
        • Action == the undirected graph displays some varied patterns of spatial structures.
      • Select a temporal window centered at the time step and feed the coordinates of one joint inside the window to model the spatial relationship of joints.
      • Three graph representations
        • Undirected graph
        • Chain sequence
        • Traversal sequence
      • Spatial RNN can recognize action based on just one graph representations.
    • Data augmentation
      • 3D transformation of skeletons
        • Rotation
        • Scaling
        • Shear

 

 

 

  • Take home messages
  • Other methods mentioned
    • Body part based action recognition and Joint based action recognition.
      • Based on hand-crafted low level features, use Markov Random Fields.
      • Fully connected deep LSTM network with regularization terms to learn co-occurrence features of joints.
      • —  These methods
    • RGB based action recognition
      • Hieratical RNN, RNN with regularizations, differential RNN and part-aware Long Short Term memory

Notes for Paper “Spatiotemporal Pyramid Network for Video Action Recognition”

Paper:

Wang, Yunbo, et al. “Spatiotemporal Pyramid Network for Video Action Recognition.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.

  • Basics
  • Main methods
    • Spatio-temporal pyramid network — Reinforce each other
    • Hierarchical strategies for fusion — Using a unified spatiotemporal loss — To maximally complement each other.
    • Tackle the problem for two-stream method — Usually for most misclassification cases, there is one stream failing and the other correct. — Present end-to-end pyramid architecture to let the two facilitate each other.
    • Temporal part:
      • To learn more global (two actions may not be distinct in a short term) video features, use multi-path temporal sub-networks to sample optical flow in a long sequences. — Use different fusion methods to combine the temporal information.
      • Enlarge the video chunks(块) by using multiple CNNs with shared network parameters
    • Spatio part:
      • If two videos have similar background, the spatio cannot tell, because the background is strongest feature.
      • Use temporal part as a guidance — Inform the spatio network where the motion happens (help to extract the significant locations on feature maps of spatio network)
    • Joint optimization for temporal and spatio
      • Compact bilinear fusion strategy.
    • Details for compact fusion
      • Maximal the information from both parts while maximizing the interactions.
      • Bilinear fusion — Lead to high dimensional representations. — Use spatiotemporal compact bilinear (STCB) to  transfer to low dimension.
      • STCB can preserve the temporal cues to supervise the spatiotemporal attention module.
    • Spatiotemporal Attention
      • Taking advantage of the motion information to locate salient regions on the feature maps.
    • Integrate all the techniques mentioned above in the pyramid architecture.
      • Use STCB three times
        • Bottom of the pyramid, combine multiple optical flow representations from longer videos. — More global temporal features.
        • Spatiotemporal attention subnet — Fuse the spatial feature maps with the motion representations.
        • Top, fuse all.
  • Take home messages
  • Other methods mentioned
    • C3D –3D convolution filters and 3D pooling layers operating over space and time simultaneously.
    • Two stream networks.
    • Note: the Related works part is very good!