Notes for Paper “Compositional human pose regression”


Sun, Xiao, et al. “Compositional human pose regression.” The IEEE International Conference on Computer Vision (ICCV). Vol. 2. 2017.

Key: Structure-aware

      • Performance:
        • 48.3mm on H3.6M Protocol 1 (Avg joint error)
        • 59.1mm on H3.6M Protocol 2 (Avg joint error)
        • PCK(0.5) 86.4 on MPII
      • Evaluation
        • Metrics:
          • Absolute
            • 3D: Procrustes Analysis + MPJPE
            • 2D: PCK
          • Relative:
            • 2D: Mean per bone position error
            • 3D pose: bone length standard deviation and the percentage of illegal joint angle.
        • MPII, H3.6M
      • Basics
        • Structure-aware approach
        • Use bones instead of joints as pose representation.
        • Use joint connection structure to define a compositional loss function.
        • Just re-parameterizes the pose representation. Compatible with any other algorithm design.
        • Both 3D and 2D
      • Main method
        • Use L1 norm for joint regression. (instead of squared distance)
        • Bone based representation.
          • Bone is easier to learn compared with joints. And Bone can express constraints more easily than joints.
          • Many pose-driven applications only need local bone, not global joints.
        • Use L1 norm for bone loss function.
        • Bone is a vector from one joint to another joint. Then the relative joint position is the summation of the bones along the path.
        • Network
          • ResNet-50 pre-trained on ImageNet
          • Last FC outputs 3-coordinates (or 2-coordinates)
          • Fine-tuned on the task

      • Other methods mentioned
        • Detection based and regression based
          • The heatmaps are usually noisy and multi-mode
        • Problem: Simply minimize the per-joint location errors independently but ignore the internal structures of the pose.
        • 3D pose estimation
          • Not use prior knowledge in 3D model
            • Use two separate steps: First do 2D joint prediction, then re-construct the 3D pose via optimization or search.
            • [[20] Sparseness Meets Deepness] combines uncertainty maps of the 2D joints location and a sparsity-driven 3D geometric prior to infer the 3D joint location via an EM (expectation maximization) algorithm
            • Represents 3D pose with an over-complete dictionary, use high-dim latent pose representation
            • Extend Hourglass from 2D to 3D
          • Use prior knowledge in 3D model
            • Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
              • The kinematic model parameterization is highly non-linear and its optimization in deep networks is hard.
        • 2D pose estimation
          • Pure Graphical models, inference models.
            • PS model
          • Graphical model with CNN
      • Evaluation
        • Dataset: H3.6M
        • Metrics:
          • 59.1 mm Average joint error.
          • 86.4% PCK(h0.5)
      • Coding
        • Caffe
        • Two GPU

Notes for Paper “A limb based graphical model for human pose estimation”


Liang, Guoqiang, et al. “A limb-based graphical model for human pose estimation.” IEEE Transactions on Systems, Man, and Cybernetics: Systems (2017).

      • Code not available
      • Caffe
      • NVIDIA Tesla K40m GPU
      • Basics
        • New task: Human limb detection
          • Detect and represent the local image appearance.
        • Use human limbs to augment constraints between neighboring human joints.
        • Design a new limb representation: Model a limb as a wide line.
      • Main method: ConvNet consists of two modules: Limbs and joints detector, and a limb-based graphical model. Both output heatmaps and trained with Euclidean distance loss.
        • Unified framework detector: VGG16 architecture.
          • Human limb detection combined with joint localization
          • Integrate the two detection processes in a single CNN
        • After initial detections, a two-steps graphical model.
          • To capture the spatial relationship among human joints. And to capture the spatial relationship among limb in a coarse to fine way.
          • First step: Full-connected graphical model is used to capture the coarse relation from an arbitrary
          • Second step: Construct a new pairwise relation term based on limbs.
      • Other methods mentioned
        • Define the relationship as geometric constraint on the relative locations of two neighboring joints.
          • Not using the local appearance (image input itself) of the region connecting two neighboring joints
          • Lead to problems: double-counting and localization failure.
        • PS model (Pictorial Structures)
          • Most popular and influential model.
          • Model human limb as a rigid oriented rectangle
          • Model human limb as bar, detect it by searching parallel edges.
          • Model a limb with 2 joints. Or add an extra joint at the middle point.
          • Use image segmentation methods to distinguish limbs from background.
        • ConvNet based pose estimation
          • Extract appearance and type score.
          • Heat-map
            • Heat-map based methods are per-pixel classification problems with large contextual information.
          • Use Conv-Net to learn a MRF-based graphical model.
        • Add motion feature
        • For Spatial relations:
          • Tree structure.
        • Appearance and relation models.
          • The relation among human parts is defined as geometric constraints  on the location and orientation of parts.
            • Spring like model
            • Conditional probability of joints location
          • Note: For joints with higher flexibility, the constraint is too weak.
        • Graphical model over parts.
          • Nodes representing parts
          • Edges encoding constraints.
          • Note: limited by hand-crafted features and tree-based graphical models, the accuracy was not good.


    • Limb modeling:
    • Evaluation
      • PCP 74.6 on LSP
      • Dataset: FLIC, LSP

Notes for Paper “Towards Accurate Multi-person Pose Estimation in the Wild “


Papandreou, George, et al. “Towards accurate multiperson pose estimation in the wild.” arXiv preprint arXiv:1701.01779 8 (2017).


ResNet for keypoint, heatmap and offset

  • Performance
  • Basics
    • Without ground truth of the location or the scale of the person.
    • Top-down approach
  • Main methods
    • Pipeline:
      • Person box detection using Faster-RCNN, (ResNet-101)
        • CNN backbone pre-trained on ImageNet
          • No multi-scale evaluation.
      • Person pose estimation. Use ResNet 101 for heatmap and offset.
        • K=17 keypoints.
        • Classification && Regression
          • Firstly classify whether it is (0 or 1) in the neighborhood of any keypoint. (heatmap)
          • Predict a 2D local offset vector to see the precise keypoint location.

  • Take home message
  • Other methods mentioned
  • Evaluation
    • Object Keypoint Similarity
    • COCO

Notes for Paper “MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild”


Rogez, Grégory, and Cordelia Schmid. “Mocap-guided data augmentation for 3d pose estimation in the wild.” Advances in Neural Information Processing Systems. 2016.


Dummy human pose for augmentation

  • Basics
    • Data augmentation for 3D pose estimation.
    • Input: Using 3D motion capture data.
    • Combine selected images to generate a new synthetic image. — stitching local image patches — Constraint on kinematical manner.
    • Cluster the training data into a large number of pose classes. — K way classification problem.
  • Main methods
    • Cluster 3D poses into K pose classes. Then generate the “dummy” pose image, just keep shape outline looks like a human pose, that will be fine.
    •  Input: two training sources — Images with annotated 2D pose && 3D MoCap data
    • Two process
      • MoCap guided mosaic construction  — Stitches image patches together
        • Input: 3D pose with n joints. && projected 2D joints in one view.
        • Output: For an image, we find each joint in the image which corresponds with the pose.
        • Get the transformation matrix of the joint’s location from one pose to another. — Measure the similarity between the joint in the 2nd pose and the aligned joint from 1st pose to the 2nd pose.
        • Increase the weight for the neighboring joints.
        • Transfer the cropped image to another pose, and select the patch to form a new image.
      • Pose-aware blending — improve image quality, erases patch seams.
        • Solving the boundaries between image regions.
        • Select a surrounding squared region. — Evaluate how much each image should contribute to the pixel. — Final is computed as the weighted sum over all aligned images.
    • CNN for full-body 3D pose estimation
      • Shows that with only synthetic data, we can still obtain good performance.
  • Take home messages.
  • Other methods mentioned.
    • Data augmentation
      • Jittering
      • Complex affine
    • 3D pose estimation
      • CNNs — trained on 3D MoCap data in constrained environments.
      • Estimate 3D pose from 2D poses data.
        • 2D pose detector
        • Or jointly learn 2D and 3D pose.
        • Dual source approach — combines 2D pose estimation and 3D pose retrieval.
      • Synthetic pose data.




Notes for Paper “Stacked Hourglass Networks for Human Pose Estimation”


Newell, Alejandro, Kaiyu Yang, and Jia Deng. “Stacked hourglass networks for human pose estimation.” European Conference on Computer Vision. Springer, Cham, 2016.


Architecture, intermediate supervision, multi-scale feature learning, small number of parameters

Monday, January 8, 2018 2:01 AM

  • Demo
  • Code
  • Basics
    • Hourglass allows inference across scales.
      • Local evidence for identifying the features like faces
      • Global full body: orientation of the person, arrangement of limbs, relationship of joints.
    • Repeated bottom-up top-town process.
    • Intermediate supervision
    • Successive pooling and upsampling
  • Main method
    • Pipeline:
      • Conv and max-pooling to process features down to a very low resolution.
        • The network reaches its lowest resolution allowing smaller spatial filters.
      • After max-pooling, branch off and apply more conv at the original pre-pooled resolution.
      • Upsampling and combination of features across scales.
        • For combining across two adjacent resolutions, do nearest neighbor upsampling of the lower resolution, and do an elementwise addtion of the two sets of features.
      • After reaching the output resolution of the network, two rounds of 1*1 conv to produce the final network predictions — A set of heatmaps.
        • Output: Heatmap that predict the probability of a joint’s presence at each pixel.
      • Detailed layer design involves residual modules.
    • Stacked hourglass with intermediate supervision
      • Feeding the output of one as input into next. — repeated bottom-up, top-down inference, for reevaluation of initial estimates and features. — Key: prediction of intermediate heatmaps for applying a loss. — Allows high level features to be processed again, for higher order spatial relationships. Note: most higher order features are present only at lower resolutions.
      • For hourglass, local and global cues are integrated within each hourglass module.
      • Maintain precise location information.
      • Note: for hourglass, weights are not shared across hourglass modules.
      • Note: Loss is applied for all hourglasses, same ground truth.
      • Loss during training: Mean squared loss on heatmap.
    • Limitations:
      • Highest input and output resolution: 64*64
      • Occlusion and multiple people.
        • Make no use of the additional visibility annotations in the dataset.


  • Take home message
  • Other methods mentioned
    • Heat map based deep network
      • Contains an graphical model — learns typical spatial relationships between joints.
      • Or. Approach unary score generation and pairwise comparison of adjacent joints.
      • Or. Cluster detections into typical orientations — So that getting additional information for the likely location of a neighboring joint.
    • Successive predictions for pose estimation.
      • Iterative Error Feedback. — Require multi-stage training and the weights are shared across each iteration
      • Note: For many failure cases a refinement of position within a local window would not offer much improvement since error cases oftern consist of either occluded or misattributed limbs.
    • Use of additional features such as depth or motion cues.
    • Hourglass module before stacking is also related to conv-deconv and encoder-decoder architectures.
  • Machine and software
    • Torch7
    • 12 GB NVIDIA TitanX GPU



  • Evaluation metrics.
    • Percentage of Correct keypoints.
      • The percentage of detections that fall within a normalized distance of the ground truth. For FLIC, distance by a fraction of torso size, for MPII, by a fraction of the head size.