Category Archives: Uncategorized
Notes for Paper “Associative Embedding: EndtoEnd Learning for Joint Detection and Grouping”
Paper:
Newell, Alejandro, Zhiao Huang, and Jia Deng. “Associative embedding: Endtoend learning for joint detection and grouping.” Advances in Neural Information Processing Systems. 2017.
 Performance
 Basics
 Associative embedding
 Jointly perform detections and grouping using a singlestage deep network trained endtoend
 For each detection, introduce a “tag” (is a number) to identify which group this detection belongs to.
 Note: We have no ground truth tags for the network to predict, because what matters is not the particular tag values, only the difference between them.
 Output: Two heatmaps
 A heatmap for Perpixel detection scores. (detection score at each pixel for each joint.)
 A heatmap for perpixel identity tags.(tagging score at each pixel for each joint.)
 For multiperson pose estimation, output a detection heatmap and a tagging heatmap for each body joint, then group body joints with similar tags into individual people.
 Two loss functions together
 Detection loss: mean square error (MSE) between each predicted detection heatmap and its ground truth heatmap (is a 2D Gaussian activation at each keypoint location).
 Grouping loss: We compare the tags within each person and across people, Tags within a person should be the same, while tags across people should be different.
 Other methods mentioned.
 Vector embedding
 Perceptual organization: group pixels of an image into regions, parts and objects.
 Multiplerson pose estimation
 Instance segmentation
 Evaluation
 Dataset
 MPII human multiperson http://humanpose.mpiinf.mpg.de/
 25K images containing over 40K people with annotated body joints. Covers 410 human activities.
 COCO 2016 keypoints challenge. http://cocodataset.org/#keypointschallenge2017
 MPII human multiperson http://humanpose.mpiinf.mpg.de/
 Evaluation metrics
 Average precision (AP)
 Dataset
 Questions: How to get the tags of training data?
Notes for Paper “A Simple, Fast and HighlyAccurate Algorithm to Recover 3D Shape from 2D Landmarks on a Single Image”
Paper:
Zhao, Ruiqi, Yan Wang, and Aleix M. Martinez. “A simple, fast and highlyaccurate algorithm to recover 3d shape from 2d landmarks on a single image.” IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).
Notes for Paper “Deep Kinematic Pose Regression”
Paper:
Zhou, Xingyi, et al. “Deep kinematic pose regression.” European Conference on Computer Vision. Springer, Cham, 2016.
Notes for Paper “Structured prediction of 3d human pose with deep neural networks”
Paper:
Tekin, Bugra, et al. “Structured prediction of 3d human pose with deep neural networks.” arXiv preprint arXiv:1605.05180(2016).
Notes for Paper “Parsing Occluded People by Flexible Compositions”
Paper:
Chen, Xianjie, and Alan Yuille. “Parsing occluded people by flexible compositions.” Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015.
Notes for Paper “Compositional human pose regression”
Paper:
Sun, Xiao, et al. “Compositional human pose regression.” The IEEE International Conference on Computer Vision (ICCV). Vol. 2. 2017.
Key: Structureaware


 Performance:
 48.3mm on H3.6M Protocol 1 (Avg joint error)
 59.1mm on H3.6M Protocol 2 (Avg joint error)
 PCK(0.5) 86.4 on MPII
 Evaluation
 Metrics:
 Absolute
 3D: Procrustes Analysis + MPJPE
 2D: PCK
 Relative:
 2D: Mean per bone position error
 3D pose: bone length standard deviation and the percentage of illegal joint angle.
 Absolute
 MPII, H3.6M
 Metrics:
 Basics
 Structureaware approach
 Use bones instead of joints as pose representation.
 Use joint connection structure to define a compositional loss function.
 Just reparameterizes the pose representation. Compatible with any other algorithm design.
 Both 3D and 2D
 Main method
 Use L1 norm for joint regression. (instead of squared distance)
 Bone based representation.
 Bone is easier to learn compared with joints. And Bone can express constraints more easily than joints.
 Many posedriven applications only need local bone, not global joints.
 Use L1 norm for bone loss function.
 Bone is a vector from one joint to another joint. Then the relative joint position is the summation of the bones along the path.
 Network
 ResNet50 pretrained on ImageNet
 Last FC outputs 3coordinates (or 2coordinates)
 Finetuned on the task
 Performance:



 Other methods mentioned
 Detection based and regression based
 The heatmaps are usually noisy and multimode
 Problem: Simply minimize the perjoint location errors independently but ignore the internal structures of the pose.
 3D pose estimation
 Not use prior knowledge in 3D model
 Use two separate steps: First do 2D joint prediction, then reconstruct the 3D pose via optimization or search.
 [[20] Sparseness Meets Deepness] combines uncertainty maps of the 2D joints location and a sparsitydriven 3D geometric prior to infer the 3D joint location via an EM (expectation maximization) algorithm
 Represents 3D pose with an overcomplete dictionary, use highdim latent pose representation
 Extend Hourglass from 2D to 3D
 Use prior knowledge in 3D model
 Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
 The kinematic model parameterization is highly nonlinear and its optimization in deep networks is hard.
 Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
 Not use prior knowledge in 3D model
 2D pose estimation
 Pure Graphical models, inference models.
 PS model
 Graphical model with CNN
 Pure Graphical models, inference models.
 Detection based and regression based
 Evaluation
 Dataset: H3.6M
 Metrics:
 59.1 mm Average joint error.
 86.4% PCK(h0.5)
 Coding
 Caffe
 Two GPU
 Other methods mentioned

Notes for Paper “A limb based graphical model for human pose estimation”
Paper:
Liang, Guoqiang, et al. “A limbbased graphical model for human pose estimation.” IEEE Transactions on Systems, Man, and Cybernetics: Systems (2017).


 Code not available
 Caffe
 NVIDIA Tesla K40m GPU
 Basics
 New task: Human limb detection
 Detect and represent the local image appearance.
 Use human limbs to augment constraints between neighboring human joints.
 Design a new limb representation: Model a limb as a wide line.
 New task: Human limb detection
 Main method: ConvNet consists of two modules: Limbs and joints detector, and a limbbased graphical model. Both output heatmaps and trained with Euclidean distance loss.
 Unified framework detector: VGG16 architecture.
 Human limb detection combined with joint localization
 Integrate the two detection processes in a single CNN
 After initial detections, a twosteps graphical model.
 To capture the spatial relationship among human joints. And to capture the spatial relationship among limb in a coarse to fine way.
 First step: Fullconnected graphical model is used to capture the coarse relation from an arbitrary
 Second step: Construct a new pairwise relation term based on limbs.
 Unified framework detector: VGG16 architecture.
 Other methods mentioned
 Define the relationship as geometric constraint on the relative locations of two neighboring joints.
 Not using the local appearance (image input itself) of the region connecting two neighboring joints
 Lead to problems: doublecounting and localization failure.
 PS model (Pictorial Structures)
 Most popular and influential model.
 Model human limb as a rigid oriented rectangle
 Model human limb as bar, detect it by searching parallel edges.
 Model a limb with 2 joints. Or add an extra joint at the middle point.
 Use image segmentation methods to distinguish limbs from background.
 ConvNet based pose estimation
 Extract appearance and type score.
 Heatmap
 Heatmap based methods are perpixel classification problems with large contextual information.
 Use ConvNet to learn a MRFbased graphical model.
 Add motion feature
 For Spatial relations:
 Tree structure.
 Appearance and relation models.
 The relation among human parts is defined as geometric constraints on the location and orientation of parts.
 Spring like model
 Conditional probability of joints location
 Note: For joints with higher flexibility， the constraint is too weak.
 The relation among human parts is defined as geometric constraints on the location and orientation of parts.
 Graphical model over parts.
 Nodes representing parts
 Edges encoding constraints.
 Note: limited by handcrafted features and treebased graphical models, the accuracy was not good.
 Define the relationship as geometric constraint on the relative locations of two neighboring joints.


 Limb modeling:

 Evaluation
 PCP 74.6 on LSP
 Dataset: FLIC, LSP
 Evaluation
Notes for Paper “Associative Embedding: EndtoEnd Learning for Joint Detection and Grouping”
Paper:
Newell, Alejandro, Zhiao Huang, and Jia Deng. “Associative embedding: Endtoend learning for joint detection and grouping.” Advances in Neural Information Processing Systems. 2017.
Notes for Paper “Towards Accurate Multiperson Pose Estimation in the Wild “
Paper:
Papandreou, George, et al. “Towards accurate multiperson pose estimation in the wild.” arXiv preprint arXiv:1701.01779 8 (2017).
key:
ResNet for keypoint, heatmap and offset
 Performance
 Basics
 Without ground truth of the location or the scale of the person.
 Topdown approach
 Main methods
 Pipeline:
 Person box detection using FasterRCNN, (ResNet101)
 CNN backbone pretrained on ImageNet
 No multiscale evaluation.
 CNN backbone pretrained on ImageNet
 Person pose estimation. Use ResNet 101 for heatmap and offset.
 K=17 keypoints.
 Classification && Regression
 Firstly classify whether it is (0 or 1) in the neighborhood of any keypoint. (heatmap)
 Predict a 2D local offset vector to see the precise keypoint location.
 Person box detection using FasterRCNN, (ResNet101)
 Pipeline:
 Take home message
 Other methods mentioned
 Evaluation
 Object Keypoint Similarity
 COCO