Notes for Paper “Stacked Hourglass Networks for Human Pose Estimation”


Newell, Alejandro, Kaiyu Yang, and Jia Deng. “Stacked hourglass networks for human pose estimation.” European Conference on Computer Vision. Springer, Cham, 2016.


Architecture, intermediate supervision, multi-scale feature learning, small number of parameters

Monday, January 8, 2018 2:01 AM

  • Demo
  • Code
  • Basics
    • Hourglass allows inference across scales.
      • Local evidence for identifying the features like faces
      • Global full body: orientation of the person, arrangement of limbs, relationship of joints.
    • Repeated bottom-up top-town process.
    • Intermediate supervision
    • Successive pooling and upsampling
  • Main method
    • Pipeline:
      • Conv and max-pooling to process features down to a very low resolution.
        • The network reaches its lowest resolution allowing smaller spatial filters.
      • After max-pooling, branch off and apply more conv at the original pre-pooled resolution.
      • Upsampling and combination of features across scales.
        • For combining across two adjacent resolutions, do nearest neighbor upsampling of the lower resolution, and do an elementwise addtion of the two sets of features.
      • After reaching the output resolution of the network, two rounds of 1*1 conv to produce the final network predictions — A set of heatmaps.
        • Output: Heatmap that predict the probability of a joint’s presence at each pixel.
      • Detailed layer design involves residual modules.
    • Stacked hourglass with intermediate supervision
      • Feeding the output of one as input into next. — repeated bottom-up, top-down inference, for reevaluation of initial estimates and features. — Key: prediction of intermediate heatmaps for applying a loss. — Allows high level features to be processed again, for higher order spatial relationships. Note: most higher order features are present only at lower resolutions.
      • For hourglass, local and global cues are integrated within each hourglass module.
      • Maintain precise location information.
      • Note: for hourglass, weights are not shared across hourglass modules.
      • Note: Loss is applied for all hourglasses, same ground truth.
      • Loss during training: Mean squared loss on heatmap.
    • Limitations:
      • Highest input and output resolution: 64*64
      • Occlusion and multiple people.
        • Make no use of the additional visibility annotations in the dataset.


  • Take home message
  • Other methods mentioned
    • Heat map based deep network
      • Contains an graphical model — learns typical spatial relationships between joints.
      • Or. Approach unary score generation and pairwise comparison of adjacent joints.
      • Or. Cluster detections into typical orientations — So that getting additional information for the likely location of a neighboring joint.
    • Successive predictions for pose estimation.
      • Iterative Error Feedback. — Require multi-stage training and the weights are shared across each iteration
      • Note: For many failure cases a refinement of position within a local window would not offer much improvement since error cases oftern consist of either occluded or misattributed limbs.
    • Use of additional features such as depth or motion cues.
    • Hourglass module before stacking is also related to conv-deconv and encoder-decoder architectures.
  • Machine and software
    • Torch7
    • 12 GB NVIDIA TitanX GPU



  • Evaluation metrics.
    • Percentage of Correct keypoints.
      • The percentage of detections that fall within a normalized distance of the ground truth. For FLIC, distance by a fraction of torso size, for MPII, by a fraction of the head size.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s