Category Archives: Uncategorized
Notes for Paper “Associative Embedding: End-to-End Learning for Joint Detection and Grouping”
Paper:
Newell, Alejandro, Zhiao Huang, and Jia Deng. “Associative embedding: End-to-end learning for joint detection and grouping.” Advances in Neural Information Processing Systems. 2017.
- Performance
- Basics
- Associative embedding
- Jointly perform detections and grouping using a single-stage deep network trained end-to-end
- For each detection, introduce a “tag” (is a number) to identify which group this detection belongs to.
- Note: We have no ground truth tags for the network to predict, because what matters is not the particular tag values, only the difference between them.
- Output: Two heatmaps
- A heatmap for Per-pixel detection scores. (detection score at each pixel for each joint.)
- A heatmap for per-pixel identity tags.(tagging score at each pixel for each joint.)
- For multi-person pose estimation, output a detection heatmap and a tagging heatmap for each body joint, then group body joints with similar tags into individual people.
- Two loss functions together
- Detection loss: mean square error (MSE) between each predicted detection heatmap and its ground truth heatmap (is a 2D Gaussian activation at each keypoint location).
- Grouping loss: We compare the tags within each person and across people, Tags within a person should be the same, while tags across people should be different.
- Other methods mentioned.
- Vector embedding
- Perceptual organization: group pixels of an image into regions, parts and objects.
- Multiplerson pose estimation
- Instance segmentation
- Evaluation
- Dataset
- MPII human multi-person http://human-pose.mpi-inf.mpg.de/
- 25K images containing over 40K people with annotated body joints. Covers 410 human activities.
- COCO 2016 keypoints challenge. http://cocodataset.org/#keypoints-challenge2017
- MPII human multi-person http://human-pose.mpi-inf.mpg.de/
- Evaluation metrics
- Average precision (AP)
- Dataset
- Questions: How to get the tags of training data?
Notes for Paper “A Simple, Fast and Highly-Accurate Algorithm to Recover 3D Shape from 2D Landmarks on a Single Image”
Paper:
Zhao, Ruiqi, Yan Wang, and Aleix M. Martinez. “A simple, fast and highly-accurate algorithm to recover 3d shape from 2d landmarks on a single image.” IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).
Notes for Paper “Deep Kinematic Pose Regression”
Paper:
Zhou, Xingyi, et al. “Deep kinematic pose regression.” European Conference on Computer Vision. Springer, Cham, 2016.
Notes for Paper “Structured prediction of 3d human pose with deep neural networks”
Paper:
Tekin, Bugra, et al. “Structured prediction of 3d human pose with deep neural networks.” arXiv preprint arXiv:1605.05180(2016).
Notes for Paper “Parsing Occluded People by Flexible Compositions”
Paper:
Chen, Xianjie, and Alan Yuille. “Parsing occluded people by flexible compositions.” Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015.
Notes for Paper “Compositional human pose regression”
Paper:
Sun, Xiao, et al. “Compositional human pose regression.” The IEEE International Conference on Computer Vision (ICCV). Vol. 2. 2017.
Key: Structure-aware
-
-
- Performance:
- 48.3mm on H3.6M Protocol 1 (Avg joint error)
- 59.1mm on H3.6M Protocol 2 (Avg joint error)
- PCK(0.5) 86.4 on MPII
- Evaluation
- Metrics:
- Absolute
- 3D: Procrustes Analysis + MPJPE
- 2D: PCK
- Relative:
- 2D: Mean per bone position error
- 3D pose: bone length standard deviation and the percentage of illegal joint angle.
- Absolute
- MPII, H3.6M
- Metrics:
- Basics
- Structure-aware approach
- Use bones instead of joints as pose representation.
- Use joint connection structure to define a compositional loss function.
- Just re-parameterizes the pose representation. Compatible with any other algorithm design.
- Both 3D and 2D
- Main method
- Use L1 norm for joint regression. (instead of squared distance)
- Bone based representation.
- Bone is easier to learn compared with joints. And Bone can express constraints more easily than joints.
- Many pose-driven applications only need local bone, not global joints.
- Use L1 norm for bone loss function.
- Bone is a vector from one joint to another joint. Then the relative joint position is the summation of the bones along the path.
- Network
- ResNet-50 pre-trained on ImageNet
- Last FC outputs 3-coordinates (or 2-coordinates)
- Fine-tuned on the task
- Performance:
-
-
-
- Other methods mentioned
- Detection based and regression based
- The heatmaps are usually noisy and multi-mode
- Problem: Simply minimize the per-joint location errors independently but ignore the internal structures of the pose.
- 3D pose estimation
- Not use prior knowledge in 3D model
- Use two separate steps: First do 2D joint prediction, then re-construct the 3D pose via optimization or search.
- [[20] Sparseness Meets Deepness] combines uncertainty maps of the 2D joints location and a sparsity-driven 3D geometric prior to infer the 3D joint location via an EM (expectation maximization) algorithm
- Represents 3D pose with an over-complete dictionary, use high-dim latent pose representation
- Extend Hourglass from 2D to 3D
- Use prior knowledge in 3D model
- Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
- The kinematic model parameterization is highly non-linear and its optimization in deep networks is hard.
- Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
- Not use prior knowledge in 3D model
- 2D pose estimation
- Pure Graphical models, inference models.
- PS model
- Graphical model with CNN
- Pure Graphical models, inference models.
- Detection based and regression based
- Evaluation
- Dataset: H3.6M
- Metrics:
- 59.1 mm Average joint error.
- 86.4% PCK(h0.5)
- Coding
- Caffe
- Two GPU
- Other methods mentioned
-
Notes for Paper “A limb based graphical model for human pose estimation”
Paper:
Liang, Guoqiang, et al. “A limb-based graphical model for human pose estimation.” IEEE Transactions on Systems, Man, and Cybernetics: Systems (2017).
-
-
- Code not available
- Caffe
- NVIDIA Tesla K40m GPU
- Basics
- New task: Human limb detection
- Detect and represent the local image appearance.
- Use human limbs to augment constraints between neighboring human joints.
- Design a new limb representation: Model a limb as a wide line.
- New task: Human limb detection
- Main method: ConvNet consists of two modules: Limbs and joints detector, and a limb-based graphical model. Both output heatmaps and trained with Euclidean distance loss.
- Unified framework detector: VGG16 architecture.
- Human limb detection combined with joint localization
- Integrate the two detection processes in a single CNN
- After initial detections, a two-steps graphical model.
- To capture the spatial relationship among human joints. And to capture the spatial relationship among limb in a coarse to fine way.
- First step: Full-connected graphical model is used to capture the coarse relation from an arbitrary
- Second step: Construct a new pairwise relation term based on limbs.
- Unified framework detector: VGG16 architecture.
- Other methods mentioned
- Define the relationship as geometric constraint on the relative locations of two neighboring joints.
- Not using the local appearance (image input itself) of the region connecting two neighboring joints
- Lead to problems: double-counting and localization failure.
- PS model (Pictorial Structures)
- Most popular and influential model.
- Model human limb as a rigid oriented rectangle
- Model human limb as bar, detect it by searching parallel edges.
- Model a limb with 2 joints. Or add an extra joint at the middle point.
- Use image segmentation methods to distinguish limbs from background.
- ConvNet based pose estimation
- Extract appearance and type score.
- Heat-map
- Heat-map based methods are per-pixel classification problems with large contextual information.
- Use Conv-Net to learn a MRF-based graphical model.
- Add motion feature
- For Spatial relations:
- Tree structure.
- Appearance and relation models.
- The relation among human parts is defined as geometric constraints on the location and orientation of parts.
- Spring like model
- Conditional probability of joints location
- Note: For joints with higher flexibility, the constraint is too weak.
- The relation among human parts is defined as geometric constraints on the location and orientation of parts.
- Graphical model over parts.
- Nodes representing parts
- Edges encoding constraints.
- Note: limited by hand-crafted features and tree-based graphical models, the accuracy was not good.
- Define the relationship as geometric constraint on the relative locations of two neighboring joints.
-
-
- Limb modeling:
-
- Evaluation
- PCP 74.6 on LSP
- Dataset: FLIC, LSP
- Evaluation
Notes for Paper “Associative Embedding: End-to-End Learning for Joint Detection and Grouping”
Paper:
Newell, Alejandro, Zhiao Huang, and Jia Deng. “Associative embedding: End-to-end learning for joint detection and grouping.” Advances in Neural Information Processing Systems. 2017.
Notes for Paper “Towards Accurate Multi-person Pose Estimation in the Wild “
Paper:
Papandreou, George, et al. “Towards accurate multiperson pose estimation in the wild.” arXiv preprint arXiv:1701.01779 8 (2017).
key:
ResNet for keypoint, heatmap and offset
- Performance
- Basics
- Without ground truth of the location or the scale of the person.
- Top-down approach
- Main methods
- Pipeline:
- Person box detection using Faster-RCNN, (ResNet-101)
- CNN backbone pre-trained on ImageNet
- No multi-scale evaluation.
- CNN backbone pre-trained on ImageNet
- Person pose estimation. Use ResNet 101 for heatmap and offset.
- K=17 keypoints.
- Classification && Regression
- Firstly classify whether it is (0 or 1) in the neighborhood of any keypoint. (heatmap)
- Predict a 2D local offset vector to see the precise keypoint location.
- Person box detection using Faster-RCNN, (ResNet-101)
- Pipeline:
- Take home message
- Other methods mentioned
- Evaluation
- Object Keypoint Similarity
- COCO