Notes for Paper “Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks”


Wang, Hongsong, and Liang Wang. “Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks.” e Conference on Computer Vision and Pa ern Recognition (CVPR). 2017.

  • Basics
    • Skeleton based action recognition
    • Two-stream RNN
    • Two architectures for temporal streams
      • Stacked RNN
      • Hieratical RNN
    • Model spatial structure by converting spatial graph into a sequence of joints.
    • Obtain 3D skeletons from depth images.
  • Main method
    • End- to -end two-stream RNN
    • Fusion is performed by combining the softmax class posteriors from the two nets.
    • Temporal channel. — Concatenate the 3D coordinates of different joints at each time step, get the generated sequence with a RNN.
      • Stacked RNN
        • Feed the concatenated coordinates of all joints into RNN. Stack two layers. Adding more layer will not improve the performance.
      • Hierarchical RNN
        • Divide human skeleton into 5 parts.
        • Use hierarchical RNN to model the motions of different parts of the body (first layer) and the whole body (second layer).
    • Spatial RNN
      • Nodes denote the joints and edges denote the physical connections.
        • Action == the undirected graph displays some varied patterns of spatial structures.
      • Select a temporal window centered at the time step and feed the coordinates of one joint inside the window to model the spatial relationship of joints.
      • Three graph representations
        • Undirected graph
        • Chain sequence
        • Traversal sequence
      • Spatial RNN can recognize action based on just one graph representations.
    • Data augmentation
      • 3D transformation of skeletons
        • Rotation
        • Scaling
        • Shear




  • Take home messages
  • Other methods mentioned
    • Body part based action recognition and Joint based action recognition.
      • Based on hand-crafted low level features, use Markov Random Fields.
      • Fully connected deep LSTM network with regularization terms to learn co-occurrence features of joints.
      • —  These methods
    • RGB based action recognition
      • Hieratical RNN, RNN with regularizations, differential RNN and part-aware Long Short Term memory

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s