Paper:
Newell, Alejandro, Kaiyu Yang, and Jia Deng. “Stacked hourglass networks for human pose estimation.” European Conference on Computer Vision. Springer, Cham, 2016.
Summary:
Architecture, intermediate supervision, multi-scale feature learning, small number of parameters
Monday, January 8, 2018 2:01 AM
- Demo
- Code
- Basics
- Hourglass allows inference across scales.
- Local evidence for identifying the features like faces
- Global full body: orientation of the person, arrangement of limbs, relationship of joints.
- Repeated bottom-up top-town process.
- Intermediate supervision
- Successive pooling and upsampling
- Hourglass allows inference across scales.
- Main method
- Pipeline:
- Conv and max-pooling to process features down to a very low resolution.
- The network reaches its lowest resolution allowing smaller spatial filters.
- After max-pooling, branch off and apply more conv at the original pre-pooled resolution.
- Upsampling and combination of features across scales.
- For combining across two adjacent resolutions, do nearest neighbor upsampling of the lower resolution, and do an elementwise addtion of the two sets of features.
- After reaching the output resolution of the network, two rounds of 1*1 conv to produce the final network predictions — A set of heatmaps.
- Output: Heatmap that predict the probability of a joint’s presence at each pixel.
- Detailed layer design involves residual modules.
- Conv and max-pooling to process features down to a very low resolution.
- Stacked hourglass with intermediate supervision
- Feeding the output of one as input into next. — repeated bottom-up, top-down inference, for reevaluation of initial estimates and features. — Key: prediction of intermediate heatmaps for applying a loss. — Allows high level features to be processed again, for higher order spatial relationships. Note: most higher order features are present only at lower resolutions.
- For hourglass, local and global cues are integrated within each hourglass module.
- Maintain precise location information.
- Note: for hourglass, weights are not shared across hourglass modules.
- Note: Loss is applied for all hourglasses, same ground truth.
- Loss during training: Mean squared loss on heatmap.
- Limitations:
- Highest input and output resolution: 64*64
- Occlusion and multiple people.
- Make no use of the additional visibility annotations in the dataset.
- Pipeline:
- Take home message
- Other methods mentioned
- Heat map based deep network
- Contains an graphical model — learns typical spatial relationships between joints.
- Or. Approach unary score generation and pairwise comparison of adjacent joints.
- Or. Cluster detections into typical orientations — So that getting additional information for the likely location of a neighboring joint.
- Successive predictions for pose estimation.
- Iterative Error Feedback. — Require multi-stage training and the weights are shared across each iteration
- Note: For many failure cases a refinement of position within a local window would not offer much improvement since error cases oftern consist of either occluded or misattributed limbs.
- Use of additional features such as depth or motion cues.
- Hourglass module before stacking is also related to conv-deconv and encoder-decoder architectures.
- Heat map based deep network
- Machine and software
- Torch7
- 12 GB NVIDIA TitanX GPU
- Evaluation.
- Dataset:
- FLIC https://bensapp.github.io/flic-dataset.html
- 5003 images taken from films.
- MPII human http://human-pose.mpi-inf.mpg.de/
- 25K images containing over 40K people with annotated body joints. Covers 410 human activities.
- Note: for this paper, they only choose the person at the center of the image as the target person.
- FLIC https://bensapp.github.io/flic-dataset.html
- Dataset:
- Evaluation metrics.
- Percentage of Correct keypoints.
- The percentage of detections that fall within a normalized distance of the ground truth. For FLIC, distance by a fraction of torso size, for MPII, by a fraction of the head size.
- Percentage of Correct keypoints.