Category Archives: Uncategorized

Notes for Paper “Associative Embedding: End-to-End Learning for Joint Detection and Grouping”


Newell, Alejandro, Zhiao Huang, and Jia Deng. “Associative embedding: End-to-end learning for joint detection and grouping.” Advances in Neural Information Processing Systems. 2017.

  • Performance
  • Basics
    • Associative embedding
    • Jointly perform detections and grouping using a single-stage deep network trained end-to-end
    • For each detection, introduce a “tag” (is a number) to identify which group this detection belongs to.
      • Note: We have no ground truth tags for the network to predict, because what matters is not the particular tag values, only the difference between them.
    • Output: Two heatmaps
      • A heatmap for Per-pixel detection scores. (detection score at each pixel for each joint.)
      • A heatmap for per-pixel identity tags.(tagging score at each pixel for each joint.)
      • For multi-person pose estimation, output a detection heatmap and a tagging heatmap for each body joint, then group body joints with similar tags into individual people.
    • Two loss functions together
      • Detection loss: mean square error (MSE) between each predicted detection heatmap and its ground truth heatmap (is a 2D Gaussian activation at each keypoint location).
      • Grouping loss: We compare the tags within each person and across people, Tags within a person should be the same, while tags across people should be different.
  • Other methods mentioned.
    • Vector embedding
    • Perceptual organization: group pixels of an image into regions, parts and objects.
    • Multiplerson pose estimation
    • Instance segmentation
  • Evaluation
  • Questions: How to get the tags of training data?

Notes for Paper “Compositional human pose regression”


Sun, Xiao, et al. “Compositional human pose regression.” The IEEE International Conference on Computer Vision (ICCV). Vol. 2. 2017.

Key: Structure-aware

      • Performance:
        • 48.3mm on H3.6M Protocol 1 (Avg joint error)
        • 59.1mm on H3.6M Protocol 2 (Avg joint error)
        • PCK(0.5) 86.4 on MPII
      • Evaluation
        • Metrics:
          • Absolute
            • 3D: Procrustes Analysis + MPJPE
            • 2D: PCK
          • Relative:
            • 2D: Mean per bone position error
            • 3D pose: bone length standard deviation and the percentage of illegal joint angle.
        • MPII, H3.6M
      • Basics
        • Structure-aware approach
        • Use bones instead of joints as pose representation.
        • Use joint connection structure to define a compositional loss function.
        • Just re-parameterizes the pose representation. Compatible with any other algorithm design.
        • Both 3D and 2D
      • Main method
        • Use L1 norm for joint regression. (instead of squared distance)
        • Bone based representation.
          • Bone is easier to learn compared with joints. And Bone can express constraints more easily than joints.
          • Many pose-driven applications only need local bone, not global joints.
        • Use L1 norm for bone loss function.
        • Bone is a vector from one joint to another joint. Then the relative joint position is the summation of the bones along the path.
        • Network
          • ResNet-50 pre-trained on ImageNet
          • Last FC outputs 3-coordinates (or 2-coordinates)
          • Fine-tuned on the task

      • Other methods mentioned
        • Detection based and regression based
          • The heatmaps are usually noisy and multi-mode
        • Problem: Simply minimize the per-joint location errors independently but ignore the internal structures of the pose.
        • 3D pose estimation
          • Not use prior knowledge in 3D model
            • Use two separate steps: First do 2D joint prediction, then re-construct the 3D pose via optimization or search.
            • [[20] Sparseness Meets Deepness] combines uncertainty maps of the 2D joints location and a sparsity-driven 3D geometric prior to infer the 3D joint location via an EM (expectation maximization) algorithm
            • Represents 3D pose with an over-complete dictionary, use high-dim latent pose representation
            • Extend Hourglass from 2D to 3D
          • Use prior knowledge in 3D model
            • Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
              • The kinematic model parameterization is highly non-linear and its optimization in deep networks is hard.
        • 2D pose estimation
          • Pure Graphical models, inference models.
            • PS model
          • Graphical model with CNN
      • Evaluation
        • Dataset: H3.6M
        • Metrics:
          • 59.1 mm Average joint error.
          • 86.4% PCK(h0.5)
      • Coding
        • Caffe
        • Two GPU

Notes for Paper “A limb based graphical model for human pose estimation”


Liang, Guoqiang, et al. “A limb-based graphical model for human pose estimation.” IEEE Transactions on Systems, Man, and Cybernetics: Systems (2017).

      • Code not available
      • Caffe
      • NVIDIA Tesla K40m GPU
      • Basics
        • New task: Human limb detection
          • Detect and represent the local image appearance.
        • Use human limbs to augment constraints between neighboring human joints.
        • Design a new limb representation: Model a limb as a wide line.
      • Main method: ConvNet consists of two modules: Limbs and joints detector, and a limb-based graphical model. Both output heatmaps and trained with Euclidean distance loss.
        • Unified framework detector: VGG16 architecture.
          • Human limb detection combined with joint localization
          • Integrate the two detection processes in a single CNN
        • After initial detections, a two-steps graphical model.
          • To capture the spatial relationship among human joints. And to capture the spatial relationship among limb in a coarse to fine way.
          • First step: Full-connected graphical model is used to capture the coarse relation from an arbitrary
          • Second step: Construct a new pairwise relation term based on limbs.
      • Other methods mentioned
        • Define the relationship as geometric constraint on the relative locations of two neighboring joints.
          • Not using the local appearance (image input itself) of the region connecting two neighboring joints
          • Lead to problems: double-counting and localization failure.
        • PS model (Pictorial Structures)
          • Most popular and influential model.
          • Model human limb as a rigid oriented rectangle
          • Model human limb as bar, detect it by searching parallel edges.
          • Model a limb with 2 joints. Or add an extra joint at the middle point.
          • Use image segmentation methods to distinguish limbs from background.
        • ConvNet based pose estimation
          • Extract appearance and type score.
          • Heat-map
            • Heat-map based methods are per-pixel classification problems with large contextual information.
          • Use Conv-Net to learn a MRF-based graphical model.
        • Add motion feature
        • For Spatial relations:
          • Tree structure.
        • Appearance and relation models.
          • The relation among human parts is defined as geometric constraints  on the location and orientation of parts.
            • Spring like model
            • Conditional probability of joints location
          • Note: For joints with higher flexibility, the constraint is too weak.
        • Graphical model over parts.
          • Nodes representing parts
          • Edges encoding constraints.
          • Note: limited by hand-crafted features and tree-based graphical models, the accuracy was not good.


    • Limb modeling:
    • Evaluation
      • PCP 74.6 on LSP
      • Dataset: FLIC, LSP

Notes for Paper “Towards Accurate Multi-person Pose Estimation in the Wild “


Papandreou, George, et al. “Towards accurate multiperson pose estimation in the wild.” arXiv preprint arXiv:1701.01779 8 (2017).


ResNet for keypoint, heatmap and offset

  • Performance
  • Basics
    • Without ground truth of the location or the scale of the person.
    • Top-down approach
  • Main methods
    • Pipeline:
      • Person box detection using Faster-RCNN, (ResNet-101)
        • CNN backbone pre-trained on ImageNet
          • No multi-scale evaluation.
      • Person pose estimation. Use ResNet 101 for heatmap and offset.
        • K=17 keypoints.
        • Classification && Regression
          • Firstly classify whether it is (0 or 1) in the neighborhood of any keypoint. (heatmap)
          • Predict a 2D local offset vector to see the precise keypoint location.

  • Take home message
  • Other methods mentioned
  • Evaluation
    • Object Keypoint Similarity
    • COCO