Category Archives: Uncategorized

Notes for Paper “Associative Embedding: End-to-End Learning for Joint Detection and Grouping”

Paper:

Newell, Alejandro, Zhiao Huang, and Jia Deng. “Associative embedding: End-to-end learning for joint detection and grouping.” Advances in Neural Information Processing Systems. 2017.

  • Performance
  • Basics
    • Associative embedding
    • Jointly perform detections and grouping using a single-stage deep network trained end-to-end
    • For each detection, introduce a “tag” (is a number) to identify which group this detection belongs to.
      • Note: We have no ground truth tags for the network to predict, because what matters is not the particular tag values, only the difference between them.
    • Output: Two heatmaps
      • A heatmap for Per-pixel detection scores. (detection score at each pixel for each joint.)
      • A heatmap for per-pixel identity tags.(tagging score at each pixel for each joint.)
      • For multi-person pose estimation, output a detection heatmap and a tagging heatmap for each body joint, then group body joints with similar tags into individual people.
    • Two loss functions together
      • Detection loss: mean square error (MSE) between each predicted detection heatmap and its ground truth heatmap (is a 2D Gaussian activation at each keypoint location).
      • Grouping loss: We compare the tags within each person and across people, Tags within a person should be the same, while tags across people should be different.
  • Other methods mentioned.
    • Vector embedding
    • Perceptual organization: group pixels of an image into regions, parts and objects.
    • Multiplerson pose estimation
    • Instance segmentation
  • Evaluation
  • Questions: How to get the tags of training data?

Notes for Paper “Compositional human pose regression”

Paper:

Sun, Xiao, et al. “Compositional human pose regression.” The IEEE International Conference on Computer Vision (ICCV). Vol. 2. 2017.

Key: Structure-aware

      • Performance:
        • 48.3mm on H3.6M Protocol 1 (Avg joint error)
        • 59.1mm on H3.6M Protocol 2 (Avg joint error)
        • PCK(0.5) 86.4 on MPII
      • Evaluation
        • Metrics:
          • Absolute
            • 3D: Procrustes Analysis + MPJPE
            • 2D: PCK
          • Relative:
            • 2D: Mean per bone position error
            • 3D pose: bone length standard deviation and the percentage of illegal joint angle.
        • MPII, H3.6M
      • Basics
        • Structure-aware approach
        • Use bones instead of joints as pose representation.
        • Use joint connection structure to define a compositional loss function.
        • Just re-parameterizes the pose representation. Compatible with any other algorithm design.
        • Both 3D and 2D
      • Main method
        • Use L1 norm for joint regression. (instead of squared distance)
        • Bone based representation.
          • Bone is easier to learn compared with joints. And Bone can express constraints more easily than joints.
          • Many pose-driven applications only need local bone, not global joints.
        • Use L1 norm for bone loss function.
        • Bone is a vector from one joint to another joint. Then the relative joint position is the summation of the bones along the path.
        • Network
          • ResNet-50 pre-trained on ImageNet
          • Last FC outputs 3-coordinates (or 2-coordinates)
          • Fine-tuned on the task

      • Other methods mentioned
        • Detection based and regression based
          • The heatmaps are usually noisy and multi-mode
        • Problem: Simply minimize the per-joint location errors independently but ignore the internal structures of the pose.
        • 3D pose estimation
          • Not use prior knowledge in 3D model
            • Use two separate steps: First do 2D joint prediction, then re-construct the 3D pose via optimization or search.
            • [[20] Sparseness Meets Deepness] combines uncertainty maps of the 2D joints location and a sparsity-driven 3D geometric prior to infer the 3D joint location via an EM (expectation maximization) algorithm
            • Represents 3D pose with an over-complete dictionary, use high-dim latent pose representation
            • Extend Hourglass from 2D to 3D
          • Use prior knowledge in 3D model
            • Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
              • The kinematic model parameterization is highly non-linear and its optimization in deep networks is hard.
        • 2D pose estimation
          • Pure Graphical models, inference models.
            • PS model
          • Graphical model with CNN
      • Evaluation
        • Dataset: H3.6M
        • Metrics:
          • 59.1 mm Average joint error.
          • 86.4% PCK(h0.5)
      • Coding
        • Caffe
        • Two GPU

Notes for Paper “A limb based graphical model for human pose estimation”

Paper:

Liang, Guoqiang, et al. “A limb-based graphical model for human pose estimation.” IEEE Transactions on Systems, Man, and Cybernetics: Systems (2017).

      • Code not available
      • Caffe
      • NVIDIA Tesla K40m GPU
      • Basics
        • New task: Human limb detection
          • Detect and represent the local image appearance.
        • Use human limbs to augment constraints between neighboring human joints.
        • Design a new limb representation: Model a limb as a wide line.
      • Main method: ConvNet consists of two modules: Limbs and joints detector, and a limb-based graphical model. Both output heatmaps and trained with Euclidean distance loss.
        • Unified framework detector: VGG16 architecture.
          • Human limb detection combined with joint localization
          • Integrate the two detection processes in a single CNN
        • After initial detections, a two-steps graphical model.
          • To capture the spatial relationship among human joints. And to capture the spatial relationship among limb in a coarse to fine way.
          • First step: Full-connected graphical model is used to capture the coarse relation from an arbitrary
          • Second step: Construct a new pairwise relation term based on limbs.
      • Other methods mentioned
        • Define the relationship as geometric constraint on the relative locations of two neighboring joints.
          • Not using the local appearance (image input itself) of the region connecting two neighboring joints
          • Lead to problems: double-counting and localization failure.
        • PS model (Pictorial Structures)
          • Most popular and influential model.
          • Model human limb as a rigid oriented rectangle
          • Model human limb as bar, detect it by searching parallel edges.
          • Model a limb with 2 joints. Or add an extra joint at the middle point.
          • Use image segmentation methods to distinguish limbs from background.
        • ConvNet based pose estimation
          • Extract appearance and type score.
          • Heat-map
            • Heat-map based methods are per-pixel classification problems with large contextual information.
          • Use Conv-Net to learn a MRF-based graphical model.
        • Add motion feature
        • For Spatial relations:
          • Tree structure.
        • Appearance and relation models.
          • The relation among human parts is defined as geometric constraints  on the location and orientation of parts.
            • Spring like model
            • Conditional probability of joints location
          • Note: For joints with higher flexibility, the constraint is too weak.
        • Graphical model over parts.
          • Nodes representing parts
          • Edges encoding constraints.
          • Note: limited by hand-crafted features and tree-based graphical models, the accuracy was not good.

 

    • Limb modeling:
    • Evaluation
      • PCP 74.6 on LSP
      • Dataset: FLIC, LSP

Notes for Paper “Towards Accurate Multi-person Pose Estimation in the Wild “

Paper:

Papandreou, George, et al. “Towards accurate multiperson pose estimation in the wild.” arXiv preprint arXiv:1701.01779 8 (2017).

key:

ResNet for keypoint, heatmap and offset

  • Performance
  • Basics
    • Without ground truth of the location or the scale of the person.
    • Top-down approach
  • Main methods
    • Pipeline:
      • Person box detection using Faster-RCNN, (ResNet-101)
        • CNN backbone pre-trained on ImageNet
          • No multi-scale evaluation.
      • Person pose estimation. Use ResNet 101 for heatmap and offset.
        • K=17 keypoints.
        • Classification && Regression
          • Firstly classify whether it is (0 or 1) in the neighborhood of any keypoint. (heatmap)
          • Predict a 2D local offset vector to see the precise keypoint location.

  • Take home message
  • Other methods mentioned
  • Evaluation
    • Object Keypoint Similarity
    • COCO