Notes for Paper “Two stream method for action recognition in videos”

Paper: Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos.” Advances in neural information processing systems. 2014.

  • Basics
  • Main methods
    • Recognize Action from still frame — Recognize Action from motion
    • Spatial stream. — objects in single video frame. — Spatial ConvNet is an image classification architecture
    • Temporal stream.
      • Input: Consider different kinds of optical flow based input
        • stacking optical flow displacement fields between frames.
          • Optical flow stacking
            • Optical flow stacking samples the displacement vectors at the same location in multiple frames.
          • Trajectory stacking
            • Trajectory stacking samples the vectors along the trajectory
        • Bi-direction optical flow. — From the latest frame to the previous frame
        • Mean flow subtraction. — Importance of camera motion compensation —
        • Camera motion compensation. — Use a mean vector for each displacement field.
    • Multi-task learning. — In order to solve the problem that video data is not enough. — In order to combine several datasets together for one learning.
      • Additional tasks act as a regulariser. — Two softmax classification layers (For different datasets/ tasks) with own loss function
      • Overall training loss is computed as the sum of the losses.
    • Temporal stream and spatial stream fusion methods
      • Averaging
      • Using multi-class linear SVM
  • Take home message:
    • Input based on optical flow
    • Use spatial stream based on image classification and temporal stream based on optical flow. Fusion them together.
  • Other methods mentioned
    • Spatio-temporal cuboids. Shallow high dimensional encodings of local spatio-temporal features
      • Detecting sparse S-T interest points: HOG and HOF (Histogram of optical flow)
    • Trajectory based pipeline. — optical flow
      • Motion boundary histogram (MBH) — gradient based feature, horizontal and vertical components of optical flow.
      • Compensation of camera motion.
      • Fisher vector encoding. Deeper variant.
    • Deep architecture for video recognition.
      • Input is a stack of consecutive video frames.
        • First layer learns spatio-temporal features (Predefined filters) — HMAX to get spatial and temporal recognition streams.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s