Annotation repo using large ViTPose models alongside Yolov5 detectors to annotate videos. Currently outputs predictions in "Alphapose" format.
We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.
git clone https://github.com/fan23j/yolov5-vitpose-video-annotator.git
cd yolov5-vitpose-video-annotator
cd mmcv
MMCV_WITH_OPS=1 pip install -e .
cd ..
pip install -v -e .
After install the two repos, install timm and einops, i.e.,
pip install timm==0.4.9 einops
Download ViTPose pretrained model from below (thanks to the authors of ViTPose).
For ViTPose+ pre-trained models, please first re-organize the pre-trained weights using
python tools/model_split.py --source <Pretrained PATH>
Or for ViTPose with Halpe: model
Specify arguments in video.sh
.
--pose-config
Path to your ViTPose model config
--pose-checkpoint
Path to your pretrained ViTPose model
--det-checkpoint
Path to your pretrained Yolov5 detector model
--video-path
Path to your input video for inference
--out-video-root
Output video + json path
Run the script with
sh video.sh
Using detection results from a detector that obtains 56 mAP on person. The configs here are for both training and test.
With classic decoder
Model | Pretrain | Resolution | AP | AR | config | log | weight |
---|---|---|---|---|---|---|---|
ViTPose-S | MAE | 256x192 | 73.8 | 79.2 | config | log | Onedrive |
ViTPose-B | MAE | 256x192 | 75.8 | 81.1 | config | log | Onedrive |
ViTPose-L | MAE | 256x192 | 78.3 | 83.5 | config | log | Onedrive |
ViTPose-H | MAE | 256x192 | 79.1 | 84.1 | config | log | Onedrive |
With simple decoder
Model | Pretrain | Resolution | AP | AR | config | log | weight |
---|---|---|---|---|---|---|---|
ViTPose-S | MAE | 256x192 | 73.5 | 78.9 | config | log | Onedrive |
ViTPose-B | MAE | 256x192 | 75.5 | 80.9 | config | log | Onedrive |
ViTPose-L | MAE | 256x192 | 78.2 | 83.4 | config | log | Onedrive |
ViTPose-H | MAE | 256x192 | 78.9 | 84.0 | config | log | Onedrive |
Note * There may exist duplicate images in the crowdpose training set and the validation images in other datasets, as discussed in issue #24. Please be careful when using these models for evaluation. We provide the results without the crowpose dataset for reference.
Results on MS COCO val set
Using detection results from a detector that obtains 56 mAP on person. Note the configs here are only for evaluation.
Model | Dataset | Resolution | AP | AR | config | weight |
---|---|---|---|---|---|---|
ViTPose-B | COCO+AIC+MPII | 256x192 | 77.1 | 82.2 | config | Onedrive |
ViTPose-L | COCO+AIC+MPII | 256x192 | 78.7 | 83.8 | config | Onedrive |
ViTPose-H | COCO+AIC+MPII | 256x192 | 79.5 | 84.5 | config | Onedrive |
ViTPose-G | COCO+AIC+MPII | 576x432 | 81.0 | 85.6 | ||
ViTPose-B* | COCO+AIC+MPII+CrowdPose | 256x192 | 77.5 | 82.6 | config | Onedrive |
ViTPose-L* | COCO+AIC+MPII+CrowdPose | 256x192 | 79.1 | 84.1 | config | Onedrive |
ViTPose-H* | COCO+AIC+MPII+CrowdPose | 256x192 | 79.8 | 84.8 | config | Onedrive |
ViTPose+-S | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 75.8 | 82.6 | config | log | Onedrive |
ViTPose+-B | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 77.0 | 82.6 | config | log | Onedrive |
ViTPose+-L | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 78.6 | 84.1 | config | log | Onedrive |
ViTPose+-H | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 79.4 | 84.8 | config | log | Onedrive |
Results on OCHuman test set
Using groundtruth bounding boxes. Note the configs here are only for evaluation.
Model | Dataset | Resolution | AP | AR | config | weight |
---|---|---|---|---|---|---|
ViTPose-B | COCO+AIC+MPII | 256x192 | 88.0 | 89.6 | config | Onedrive |
ViTPose-L | COCO+AIC+MPII | 256x192 | 90.9 | 92.2 | config | Onedrive |
ViTPose-H | COCO+AIC+MPII | 256x192 | 90.9 | 92.3 | config | Onedrive |
ViTPose-G | COCO+AIC+MPII | 576x432 | 93.3 | 94.3 | ||
ViTPose-B* | COCO+AIC+MPII+CrowdPose | 256x192 | 88.2 | 90.0 | config | Onedrive |
ViTPose-L* | COCO+AIC+MPII+CrowdPose | 256x192 | 91.5 | 92.8 | config | Onedrive |
ViTPose-H* | COCO+AIC+MPII+CrowdPose | 256x192 | 91.6 | 92.8 | config | Onedrive |
ViTPose+-S | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 78.4 | 80.6 | config | log | Onedrive |
ViTPose+-B | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 82.6 | 84.8 | config | log | Onedrive |
ViTPose+-L | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 85.7 | 87.5 | config | log | Onedrive |
ViTPose+-H | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 85.7 | 87.4 | config | log | Onedrive |
Results on MPII val set
Using groundtruth bounding boxes. Note the configs here are only for evaluation. The metric is PCKh.
Model | Dataset | Resolution | Mean | config | weight |
---|---|---|---|---|---|
ViTPose-B | COCO+AIC+MPII | 256x192 | 93.3 | config | Onedrive |
ViTPose-L | COCO+AIC+MPII | 256x192 | 94.0 | config | Onedrive |
ViTPose-H | COCO+AIC+MPII | 256x192 | 94.1 | config | Onedrive |
ViTPose-G | COCO+AIC+MPII | 576x432 | 94.3 | ||
ViTPose-B* | COCO+AIC+MPII+CrowdPose | 256x192 | 93.4 | config | Onedrive |
ViTPose-L* | COCO+AIC+MPII+CrowdPose | 256x192 | 93.9 | config | Onedrive |
ViTPose-H* | COCO+AIC+MPII+CrowdPose | 256x192 | 94.1 | config | Onedrive |
ViTPose+-S | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 92.7 | config | log | Onedrive |
ViTPose+-B | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 92.8 | config | log | Onedrive |
ViTPose+-L | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 94.0 | config | log | Onedrive |
ViTPose+-H | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 94.2 | config | log | Onedrive |
Results on AI Challenger test set
Using groundtruth bounding boxes. Note the configs here are only for evaluation.
Model | Dataset | Resolution | AP | AR | config | weight |
---|---|---|---|---|---|---|
ViTPose-B | COCO+AIC+MPII | 256x192 | 32.0 | 36.3 | config | Onedrive |
ViTPose-L | COCO+AIC+MPII | 256x192 | 34.5 | 39.0 | config | Onedrive |
ViTPose-H | COCO+AIC+MPII | 256x192 | 35.4 | 39.9 | config | Onedrive |
ViTPose-G | COCO+AIC+MPII | 576x432 | 43.2 | 47.1 | ||
ViTPose-B* | COCO+AIC+MPII+CrowdPose | 256x192 | 31.9 | 36.3 | config | Onedrive |
ViTPose-L* | COCO+AIC+MPII+CrowdPose | 256x192 | 34.6 | 39.0 | config | Onedrive |
ViTPose-H* | COCO+AIC+MPII+CrowdPose | 256x192 | 35.3 | 39.8 | config | Onedrive |
ViTPose+-S | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 29.7 | 34.3 | config | log | Onedrive |
ViTPose+-B | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 31.8 | 36.3 | config | log | Onedrive |
ViTPose+-L | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 34.3 | 38.9 | config | log | Onedrive |
ViTPose+-H | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 34.8 | 39.1 | config | log | Onedrive |
Results on CrowdPose test set
Using YOLOv3 human detector. Note the configs here are only for evaluation.
Model | Dataset | Resolution | AP | AP(H) | config | weight |
---|---|---|---|---|---|---|
ViTPose-B* | COCO+AIC+MPII+CrowdPose | 256x192 | 74.7 | 63.3 | config | Onedrive |
ViTPose-L* | COCO+AIC+MPII+CrowdPose | 256x192 | 76.6 | 65.9 | config | Onedrive |
ViTPose-H* | COCO+AIC+MPII+CrowdPose | 256x192 | 76.3 | 65.6 | config | Onedrive |
Results on AP-10K test set
Model | Dataset | Resolution | AP | config | weight |
---|---|---|---|---|---|
ViTPose+-S | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 71.4 | config | log | Onedrive |
ViTPose+-B | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 74.5 | config | log | Onedrive |
ViTPose+-L | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 80.4 | config | log | Onedrive |
ViTPose+-H | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 82.4 | config | log | Onedrive |
Results on APT-36K val set
Model | Dataset | Resolution | AP | config | weight |
---|---|---|---|---|---|
ViTPose+-S | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 74.2 | config | log | Onedrive |
ViTPose+-B | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 75.9 | config | log | Onedrive |
ViTPose+-L | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 80.8 | config | log | Onedrive |
ViTPose+-H | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 82.3 | config | log | Onedrive |
Model | Dataset | Resolution | AP | config | weight |
---|---|---|---|---|---|
ViTPose+-S | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 54.4 | config | log | Onedrive |
ViTPose+-B | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 57.4 | config | log | Onedrive |
ViTPose+-L | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 60.6 | config | log | Onedrive |
ViTPose+-H | COCO+AIC+MPII+AP10K+APT36K+WholeBody | 256x192 | 61.2 | config | log | Onedrive |
Model | Dataset | Resolution | AUC | config | weight |
---|---|---|---|---|---|
ViTPose+-S | COCO+AIC+MPII+WholeBody | 256x192 | 86.5 | config | Coming Soon |
ViTPose+-B | COCO+AIC+MPII+WholeBody | 256x192 | 87.0 | config | Coming Soon |
ViTPose+-L | COCO+AIC+MPII+WholeBody | 256x192 | 87.5 | config | Coming Soon |
ViTPose+-H | COCO+AIC+MPII+WholeBody | 256x192 | 87.6 | config | Coming Soon |
[2023-01-10] Update ViTPose+! It uses MoE strategies to jointly deal with human, animal, and wholebody pose estimation tasks.
[2022-05-24] Upload the single-task training code, single-task pre-trained models, and multi-task pretrained models.
[2022-05-06] Upload the logs for the base, large, and huge models!
[2022-04-27] Our ViTPose with ViTAE-G obtains 81.1 AP on COCO test-dev set!
Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting | VSA | ViTDet
We acknowledge the excellent implementation from mmpose and MAE.
For ViTPose
@inproceedings{
xu2022vitpose,
title={Vi{TP}ose: Simple Vision Transformer Baselines for Human Pose Estimation},
author={Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao},
booktitle={Advances in Neural Information Processing Systems},
year={2022},
}
For ViTPose+
@article{xu2022vitpose+,
title={ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation},
author={Xu, Yufei and Zhang, Jing and Zhang, Qiming and Tao, Dacheng},
journal={arXiv preprint arXiv:2212.04246},
year={2022}
}
For ViTAE and ViTAEv2, please refer to:
@article{xu2021vitae,
title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
journal={Advances in Neural Information Processing Systems},
volume={34},
year={2021}
}
@article{zhang2022vitaev2,
title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
journal={arXiv preprint arXiv:2202.10108},
year={2022}
}