Notes for Paper “Stacked Hourglass Networks for Human Pose Estimation”

Paper:

Newell, Alejandro, Kaiyu Yang, and Jia Deng. “Stacked hourglass networks for human pose estimation.” European Conference on Computer Vision. Springer, Cham, 2016.

Summary:

Architecture, intermediate supervision, multi-scale feature learning, small number of parameters

Monday, January 8, 2018 2:01 AM

  • Demo
  • Code
  • Basics
    • Hourglass allows inference across scales.
      • Local evidence for identifying the features like faces
      • Global full body: orientation of the person, arrangement of limbs, relationship of joints.
    • Repeated bottom-up top-town process.
    • Intermediate supervision
    • Successive pooling and upsampling
  • Main method
    • Pipeline:
      • Conv and max-pooling to process features down to a very low resolution.
        • The network reaches its lowest resolution allowing smaller spatial filters.
      • After max-pooling, branch off and apply more conv at the original pre-pooled resolution.
      • Upsampling and combination of features across scales.
        • For combining across two adjacent resolutions, do nearest neighbor upsampling of the lower resolution, and do an elementwise addtion of the two sets of features.
      • After reaching the output resolution of the network, two rounds of 1*1 conv to produce the final network predictions — A set of heatmaps.
        • Output: Heatmap that predict the probability of a joint’s presence at each pixel.
      • Detailed layer design involves residual modules.
    • Stacked hourglass with intermediate supervision
      • Feeding the output of one as input into next. — repeated bottom-up, top-down inference, for reevaluation of initial estimates and features. — Key: prediction of intermediate heatmaps for applying a loss. — Allows high level features to be processed again, for higher order spatial relationships. Note: most higher order features are present only at lower resolutions.
      • For hourglass, local and global cues are integrated within each hourglass module.
      • Maintain precise location information.
      • Note: for hourglass, weights are not shared across hourglass modules.
      • Note: Loss is applied for all hourglasses, same ground truth.
      • Loss during training: Mean squared loss on heatmap.
    • Limitations:
      • Highest input and output resolution: 64*64
      • Occlusion and multiple people.
        • Make no use of the additional visibility annotations in the dataset.

 

  • Take home message
  • Other methods mentioned
    • Heat map based deep network
      • Contains an graphical model — learns typical spatial relationships between joints.
      • Or. Approach unary score generation and pairwise comparison of adjacent joints.
      • Or. Cluster detections into typical orientations — So that getting additional information for the likely location of a neighboring joint.
    • Successive predictions for pose estimation.
      • Iterative Error Feedback. — Require multi-stage training and the weights are shared across each iteration
      • Note: For many failure cases a refinement of position within a local window would not offer much improvement since error cases oftern consist of either occluded or misattributed limbs.
    • Use of additional features such as depth or motion cues.
    • Hourglass module before stacking is also related to conv-deconv and encoder-decoder architectures.
  • Machine and software
    • Torch7
    • 12 GB NVIDIA TitanX GPU

 

 

  • Evaluation metrics.
    • Percentage of Correct keypoints.
      • The percentage of detections that fall within a normalized distance of the ground truth. For FLIC, distance by a fraction of torso size, for MPII, by a fraction of the head size.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s