Building a simple baseline for bottom-up human pose estimation. Models trained on COCO and CrowdPose datasets are available. Welcome to contribute to this project.
Earlier project: SimplePose
Guiding offsets greedily “connect” the adjacent keypoints belonging to the same persons.
(a): Responses of “left shoulder” (b): Responses of “left hip”
(c): Guiding offsets from “left shoulder” to “left hip” (d): Candidate keypoints and limbs
(e): Greedy keypoint grouping (f): Final result
- Training Code
- Evaluation Code
- Image Demo
- More (in development)
- Implement the models using Pytorch in auto mixed-precision (using Nvidia Apex).
- Support training on multiple GPUs (over 90% GPU usage rate on each GPU card).
- Fast data preparing and augmentation during training.
- Focal L2 loss for keypoint heatmap regression.
- L1-type loss for guiding offset regression.
- Easy to train and run.
-
Install packages according to
requirement.txt
.Python=3.6, Pytorch>1.0, Nvidia Apex and other packages needed.
-
Download the COCO and CrowdPose datasets.
-
Download the pre-trained models via: GoogleDrive.
-
Change the paths in the code according to your environment.
-
Refer to the docs
cli-help-evaluate.txt
,cli-help-train_dist.txt
to know the hypter-parameter settings and more info of this project. -
Full project is to be released. Also refer to other branches.
python evaluate.py --no-pretrain --initialize-whole False --checkpoint-whole link2checkpoints_storage/PoseNet_77_epoch.pth --resume --sqrt-re --batch-size 8 --loader-workers 4 --thre-hmp 0.06 --topk 32 --headnets hmp omp --dist-max 40 --long-edge 640 --dataset val --flip-test --thre-hmp 0.04 --person-thre 0.04
Hint: if you want to achieve a higher speed (30+ FPS on a 2080 TI), do not use
--flip-test
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.661
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.854
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.714
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.622
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.722
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.702
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.873
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.747
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.644
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.787
python evaluate.py --no-pretrain --initialize-whole False --checkpoint-whole link2checkpoints_storage/PoseNet_77_epoch.pth --resume --sqrt-re --batch-size 8 --loader-workers 4 --thre-hmp 0.06 --topk 32 --headnets hmp omp --dist-max 40 --long-edge 640 --dataset test-dev --flip-test --thre-hmp 0.04 --person-thre 0.04
Hint: if you want to achieve a higher speed (30+ FPS on a 2080 TI), do not use
--flip-test
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.647
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.858
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.705
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.607
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.704
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.696
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.886
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.748
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.636
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.779
python evaluate.py --no-pretrain --initialize-whole False --checkpoint-whole link2checkpoints_storage/PoseNet_77_epoch.pth --resume --sqrt-re --batch-size 8 --loader-workers 4 --thre-hmp 0.06 --topk 32 --headnets hmp omp --dist-max 40 --long-edge 640 --dataset test-dev --flip-test --fixed-height --thre-hmp 0.04 --person-thre 0.04
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.656
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.859
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.713
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.633
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.688
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.702
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.886
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.750
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.659
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.762
Please refer to the develop branch. Change the cofig file to crowdpose setting, then run
python evaluate_crowd.py --no-pretrain --initialize-whole False --checkpoint-whole link2checkpoints_storage_crowdpose/PoseNet_190_epoch.pth --resume --sqrt-re --batch-size 4 --loader-workers 4 --thre-hmp 0.04 --topk 32 --headnets hmp omp --dist-max 40 --long-edge 640 --dataset test --person-thre 0.02 --flip-test --fixed-height
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.652
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.859
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.695
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.706
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.892
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.743
Average Precision (AP) @[ IoU=0.50:0.95 | type= easy | maxDets= 20 ] = 0.738
Average Precision (AP) @[ IoU=0.50:0.95 | type=medium | maxDets= 20 ] = 0.662
Average Precision (AP) @[ IoU=0.50:0.95 | type= hard | maxDets= 20 ] = 0.548
In our paper, we fine-tune the pre-trained model multi_pose_hg_3x.pth
in CenterNet. For simplicity, you can employ our pre-trained models (i.e., training from a checkpoint in GoogleDrive).
Run example:
python -m torch.distributed.launch --nproc_per_node=4 train_dist.py --basenet-checkpoint weights/hourglass_104_renamed.pth --checkpoint-whole link2checkpoints_storage/PoseNet_77_epoch.pth --resume --weight-decay 0 --hmp-loss focal_l2_loss --offset-loss offset_instance_l1_loss --sqrt-re --include-scale --scale-loss scale_l1_loss --lambdas 1 0 0 10000 10 --headnets hmp omp --learning-rate 1.25e-4 --fgamma 2 --drop-amp-state --drop-optim-state
We refer to and borrow some code from SimplePose, OpenPifPaf, CenterNet, etc.
If this work help your research, please cite the corresponding paper:
@inproceedings{li2020simple,
title={Simple pose: Rethinking and improving a bottom-up approach for multi-person pose estimation},
author={Li, Jia and Su, Wen and Wang, Zengfu},
booktitle={Proceedings of the AAAI conference on artificial intelligence},
volume={34},
number={07},
pages={11354--11361},
year={2020}
}
@article{li2021greedy,
title={Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation},
author={Li, Jia and Xiang, Linhua and Chen, Jiwei and Wang, Zengfu},
journal={arXiv preprint arXiv:2107.03098},
year={2021}
}
@article{li2022multi,
title={Multi-person pose estimation with accurate heatmap regression and greedy association},
author={Li, Jia and Wang, Meng},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
volume={32},
number={8},
pages={5521--5535},
year={2022},
publisher={IEEE}
}