Notes for Paper “Compositional human pose regression”

Paper:

Sun, Xiao, et al. “Compositional human pose regression.” The IEEE International Conference on Computer Vision (ICCV). Vol. 2. 2017.

Key: Structure-aware

      • Performance:
        • 48.3mm on H3.6M Protocol 1 (Avg joint error)
        • 59.1mm on H3.6M Protocol 2 (Avg joint error)
        • PCK(0.5) 86.4 on MPII
      • Evaluation
        • Metrics:
          • Absolute
            • 3D: Procrustes Analysis + MPJPE
            • 2D: PCK
          • Relative:
            • 2D: Mean per bone position error
            • 3D pose: bone length standard deviation and the percentage of illegal joint angle.
        • MPII, H3.6M
      • Basics
        • Structure-aware approach
        • Use bones instead of joints as pose representation.
        • Use joint connection structure to define a compositional loss function.
        • Just re-parameterizes the pose representation. Compatible with any other algorithm design.
        • Both 3D and 2D
      • Main method
        • Use L1 norm for joint regression. (instead of squared distance)
        • Bone based representation.
          • Bone is easier to learn compared with joints. And Bone can express constraints more easily than joints.
          • Many pose-driven applications only need local bone, not global joints.
        • Use L1 norm for bone loss function.
        • Bone is a vector from one joint to another joint. Then the relative joint position is the summation of the bones along the path.
        • Network
          • ResNet-50 pre-trained on ImageNet
          • Last FC outputs 3-coordinates (or 2-coordinates)
          • Fine-tuned on the task

      • Other methods mentioned
        • Detection based and regression based
          • The heatmaps are usually noisy and multi-mode
        • Problem: Simply minimize the per-joint location errors independently but ignore the internal structures of the pose.
        • 3D pose estimation
          • Not use prior knowledge in 3D model
            • Use two separate steps: First do 2D joint prediction, then re-construct the 3D pose via optimization or search.
            • [[20] Sparseness Meets Deepness] combines uncertainty maps of the 2D joints location and a sparsity-driven 3D geometric prior to infer the 3D joint location via an EM (expectation maximization) algorithm
            • Represents 3D pose with an over-complete dictionary, use high-dim latent pose representation
            • Extend Hourglass from 2D to 3D
          • Use prior knowledge in 3D model
            • Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
              • The kinematic model parameterization is highly non-linear and its optimization in deep networks is hard.
        • 2D pose estimation
          • Pure Graphical models, inference models.
            • PS model
          • Graphical model with CNN
      • Evaluation
        • Dataset: H3.6M
        • Metrics:
          • 59.1 mm Average joint error.
          • 86.4% PCK(h0.5)
      • Coding
        • Caffe
        • Two GPU

Notes for Paper “A limb based graphical model for human pose estimation”

Paper:

Liang, Guoqiang, et al. “A limb-based graphical model for human pose estimation.” IEEE Transactions on Systems, Man, and Cybernetics: Systems (2017).

      • Code not available
      • Caffe
      • NVIDIA Tesla K40m GPU
      • Basics
        • New task: Human limb detection
          • Detect and represent the local image appearance.
        • Use human limbs to augment constraints between neighboring human joints.
        • Design a new limb representation: Model a limb as a wide line.
      • Main method: ConvNet consists of two modules: Limbs and joints detector, and a limb-based graphical model. Both output heatmaps and trained with Euclidean distance loss.
        • Unified framework detector: VGG16 architecture.
          • Human limb detection combined with joint localization
          • Integrate the two detection processes in a single CNN
        • After initial detections, a two-steps graphical model.
          • To capture the spatial relationship among human joints. And to capture the spatial relationship among limb in a coarse to fine way.
          • First step: Full-connected graphical model is used to capture the coarse relation from an arbitrary
          • Second step: Construct a new pairwise relation term based on limbs.
      • Other methods mentioned
        • Define the relationship as geometric constraint on the relative locations of two neighboring joints.
          • Not using the local appearance (image input itself) of the region connecting two neighboring joints
          • Lead to problems: double-counting and localization failure.
        • PS model (Pictorial Structures)
          • Most popular and influential model.
          • Model human limb as a rigid oriented rectangle
          • Model human limb as bar, detect it by searching parallel edges.
          • Model a limb with 2 joints. Or add an extra joint at the middle point.
          • Use image segmentation methods to distinguish limbs from background.
        • ConvNet based pose estimation
          • Extract appearance and type score.
          • Heat-map
            • Heat-map based methods are per-pixel classification problems with large contextual information.
          • Use Conv-Net to learn a MRF-based graphical model.
        • Add motion feature
        • For Spatial relations:
          • Tree structure.
        • Appearance and relation models.
          • The relation among human parts is defined as geometric constraints  on the location and orientation of parts.
            • Spring like model
            • Conditional probability of joints location
          • Note: For joints with higher flexibility, the constraint is too weak.
        • Graphical model over parts.
          • Nodes representing parts
          • Edges encoding constraints.
          • Note: limited by hand-crafted features and tree-based graphical models, the accuracy was not good.

 

    • Limb modeling:
    • Evaluation
      • PCP 74.6 on LSP
      • Dataset: FLIC, LSP

Notes for Paper “Towards Accurate Multi-person Pose Estimation in the Wild “

Paper:

Papandreou, George, et al. “Towards accurate multiperson pose estimation in the wild.” arXiv preprint arXiv:1701.01779 8 (2017).

key:

ResNet for keypoint, heatmap and offset

  • Performance
  • Basics
    • Without ground truth of the location or the scale of the person.
    • Top-down approach
  • Main methods
    • Pipeline:
      • Person box detection using Faster-RCNN, (ResNet-101)
        • CNN backbone pre-trained on ImageNet
          • No multi-scale evaluation.
      • Person pose estimation. Use ResNet 101 for heatmap and offset.
        • K=17 keypoints.
        • Classification && Regression
          • Firstly classify whether it is (0 or 1) in the neighborhood of any keypoint. (heatmap)
          • Predict a 2D local offset vector to see the precise keypoint location.

  • Take home message
  • Other methods mentioned
  • Evaluation
    • Object Keypoint Similarity
    • COCO

Notes for Paper “MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild”

Paper:

Rogez, Grégory, and Cordelia Schmid. “Mocap-guided data augmentation for 3d pose estimation in the wild.” Advances in Neural Information Processing Systems. 2016.

Key:

Dummy human pose for augmentation

  • Basics
    • Data augmentation for 3D pose estimation.
    • Input: Using 3D motion capture data.
    • Combine selected images to generate a new synthetic image. — stitching local image patches — Constraint on kinematical manner.
    • Cluster the training data into a large number of pose classes. — K way classification problem.
  • Main methods
    • Cluster 3D poses into K pose classes. Then generate the “dummy” pose image, just keep shape outline looks like a human pose, that will be fine.
    •  Input: two training sources — Images with annotated 2D pose && 3D MoCap data
    • Two process
      • MoCap guided mosaic construction  — Stitches image patches together
        • Input: 3D pose with n joints. && projected 2D joints in one view.
        • Output: For an image, we find each joint in the image which corresponds with the pose.
        • Get the transformation matrix of the joint’s location from one pose to another. — Measure the similarity between the joint in the 2nd pose and the aligned joint from 1st pose to the 2nd pose.
        • Increase the weight for the neighboring joints.
        • Transfer the cropped image to another pose, and select the patch to form a new image.
      • Pose-aware blending — improve image quality, erases patch seams.
        • Solving the boundaries between image regions.
        • Select a surrounding squared region. — Evaluate how much each image should contribute to the pixel. — Final is computed as the weighted sum over all aligned images.
    • CNN for full-body 3D pose estimation
      • Shows that with only synthetic data, we can still obtain good performance.
  • Take home messages.
  • Other methods mentioned.
    • Data augmentation
      • Jittering
      • Complex affine
    • 3D pose estimation
      • CNNs — trained on 3D MoCap data in constrained environments.
      • Estimate 3D pose from 2D poses data.
        • 2D pose detector
        • Or jointly learn 2D and 3D pose.
        • Dual source approach — combines 2D pose estimation and 3D pose retrieval.
      • Synthetic pose data.

 

 

 

Notes for Paper “Stacked Hourglass Networks for Human Pose Estimation”

Paper:

Newell, Alejandro, Kaiyu Yang, and Jia Deng. “Stacked hourglass networks for human pose estimation.” European Conference on Computer Vision. Springer, Cham, 2016.

Summary:

Architecture, intermediate supervision, multi-scale feature learning, small number of parameters

Monday, January 8, 2018 2:01 AM

  • Demo
  • Code
  • Basics
    • Hourglass allows inference across scales.
      • Local evidence for identifying the features like faces
      • Global full body: orientation of the person, arrangement of limbs, relationship of joints.
    • Repeated bottom-up top-town process.
    • Intermediate supervision
    • Successive pooling and upsampling
  • Main method
    • Pipeline:
      • Conv and max-pooling to process features down to a very low resolution.
        • The network reaches its lowest resolution allowing smaller spatial filters.
      • After max-pooling, branch off and apply more conv at the original pre-pooled resolution.
      • Upsampling and combination of features across scales.
        • For combining across two adjacent resolutions, do nearest neighbor upsampling of the lower resolution, and do an elementwise addtion of the two sets of features.
      • After reaching the output resolution of the network, two rounds of 1*1 conv to produce the final network predictions — A set of heatmaps.
        • Output: Heatmap that predict the probability of a joint’s presence at each pixel.
      • Detailed layer design involves residual modules.
    • Stacked hourglass with intermediate supervision
      • Feeding the output of one as input into next. — repeated bottom-up, top-down inference, for reevaluation of initial estimates and features. — Key: prediction of intermediate heatmaps for applying a loss. — Allows high level features to be processed again, for higher order spatial relationships. Note: most higher order features are present only at lower resolutions.
      • For hourglass, local and global cues are integrated within each hourglass module.
      • Maintain precise location information.
      • Note: for hourglass, weights are not shared across hourglass modules.
      • Note: Loss is applied for all hourglasses, same ground truth.
      • Loss during training: Mean squared loss on heatmap.
    • Limitations:
      • Highest input and output resolution: 64*64
      • Occlusion and multiple people.
        • Make no use of the additional visibility annotations in the dataset.

 

  • Take home message
  • Other methods mentioned
    • Heat map based deep network
      • Contains an graphical model — learns typical spatial relationships between joints.
      • Or. Approach unary score generation and pairwise comparison of adjacent joints.
      • Or. Cluster detections into typical orientations — So that getting additional information for the likely location of a neighboring joint.
    • Successive predictions for pose estimation.
      • Iterative Error Feedback. — Require multi-stage training and the weights are shared across each iteration
      • Note: For many failure cases a refinement of position within a local window would not offer much improvement since error cases oftern consist of either occluded or misattributed limbs.
    • Use of additional features such as depth or motion cues.
    • Hourglass module before stacking is also related to conv-deconv and encoder-decoder architectures.
  • Machine and software
    • Torch7
    • 12 GB NVIDIA TitanX GPU

 

 

  • Evaluation metrics.
    • Percentage of Correct keypoints.
      • The percentage of detections that fall within a normalized distance of the ground truth. For FLIC, distance by a fraction of torso size, for MPII, by a fraction of the head size.