GitHub - fan23j/yolov5-vitpose-video-annotator: Simple pipeline using Yolov5 and ViTPose to annotate human pose in videos.

Annotation repo using large ViTPose models alongside Yolov5 detectors to annotate videos. Currently outputs predictions in "Alphapose" format.

Setup

We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.

git clone https://github.com/fan23j/yolov5-vitpose-video-annotator.git
cd yolov5-vitpose-video-annotator
cd mmcv
MMCV_WITH_OPS=1 pip install -e .
cd ..
pip install -v -e .

After install the two repos, install timm and einops, i.e.,

pip install timm==0.4.9 einops

Download pre-trained models

Download ViTPose pretrained model from below (thanks to the authors of ViTPose).

For ViTPose+ pre-trained models, please first re-organize the pre-trained weights using

python tools/model_split.py --source <Pretrained PATH>

Or for ViTPose with Halpe: model

Annotation script

Specify arguments in video.sh. --pose-config Path to your ViTPose model config

--pose-checkpoint Path to your pretrained ViTPose model

--det-checkpoint Path to your pretrained Yolov5 detector model

--video-path Path to your input video for inference

--out-video-root Output video + json path

Run the script with sh video.sh

Results from this repo on MS COCO val set (single-task training)

Using detection results from a detector that obtains 56 mAP on person. The configs here are for both training and test.

With classic decoder

Model	Pretrain	Resolution	AP	AR	config	log	weight
ViTPose-S	MAE	256x192	73.8	79.2	config	log	Onedrive
ViTPose-B	MAE	256x192	75.8	81.1	config	log	Onedrive
ViTPose-L	MAE	256x192	78.3	83.5	config	log	Onedrive
ViTPose-H	MAE	256x192	79.1	84.1	config	log	Onedrive

With simple decoder

Model	Pretrain	Resolution	AP	AR	config	log	weight
ViTPose-S	MAE	256x192	73.5	78.9	config	log	Onedrive
ViTPose-B	MAE	256x192	75.5	80.9	config	log	Onedrive
ViTPose-L	MAE	256x192	78.2	83.4	config	log	Onedrive
ViTPose-H	MAE	256x192	78.9	84.0	config	log	Onedrive

Results with multi-task training

Note * There may exist duplicate images in the crowdpose training set and the validation images in other datasets, as discussed in issue #24. Please be careful when using these models for evaluation. We provide the results without the crowpose dataset for reference.

Human datasets (MS COCO, AIC, MPII, CrowdPose)

Results on MS COCO val set

Using detection results from a detector that obtains 56 mAP on person. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AR	config	weight
ViTPose-B	COCO+AIC+MPII	256x192	77.1	82.2	config	Onedrive
ViTPose-L	COCO+AIC+MPII	256x192	78.7	83.8	config	Onedrive
ViTPose-H	COCO+AIC+MPII	256x192	79.5	84.5	config	Onedrive
ViTPose-G	COCO+AIC+MPII	576x432	81.0	85.6
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	77.5	82.6	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	79.1	84.1	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	79.8	84.8	config	Onedrive
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	75.8	82.6	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	77.0	82.6	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	78.6	84.1	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	79.4	84.8	config	log \| Onedrive

Results on OCHuman test set

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AR	config	weight
ViTPose-B	COCO+AIC+MPII	256x192	88.0	89.6	config	Onedrive
ViTPose-L	COCO+AIC+MPII	256x192	90.9	92.2	config	Onedrive
ViTPose-H	COCO+AIC+MPII	256x192	90.9	92.3	config	Onedrive
ViTPose-G	COCO+AIC+MPII	576x432	93.3	94.3
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	88.2	90.0	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	91.5	92.8	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	91.6	92.8	config	Onedrive
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	78.4	80.6	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	82.6	84.8	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	85.7	87.5	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	85.7	87.4	config	log \| Onedrive

Results on MPII val set

Using groundtruth bounding boxes. Note the configs here are only for evaluation. The metric is PCKh.

Model	Dataset	Resolution	Mean	config	weight
ViTPose-B	COCO+AIC+MPII	256x192	93.3	config	Onedrive
ViTPose-L	COCO+AIC+MPII	256x192	94.0	config	Onedrive
ViTPose-H	COCO+AIC+MPII	256x192	94.1	config	Onedrive
ViTPose-G	COCO+AIC+MPII	576x432	94.3
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	93.4	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	93.9	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	94.1	config	Onedrive
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	92.7	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	92.8	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	94.0	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	94.2	config	log \| Onedrive

Results on AI Challenger test set

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AR	config	weight
ViTPose-B	COCO+AIC+MPII	256x192	32.0	36.3	config	Onedrive
ViTPose-L	COCO+AIC+MPII	256x192	34.5	39.0	config	Onedrive
ViTPose-H	COCO+AIC+MPII	256x192	35.4	39.9	config	Onedrive
ViTPose-G	COCO+AIC+MPII	576x432	43.2	47.1
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	31.9	36.3	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	34.6	39.0	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	35.3	39.8	config	Onedrive
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	29.7	34.3	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	31.8	36.3	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	34.3	38.9	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	34.8	39.1	config	log \| Onedrive

Results on CrowdPose test set

Using YOLOv3 human detector. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AP(H)	config	weight
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	74.7	63.3	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	76.6	65.9	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	76.3	65.6	config	Onedrive

Animal datasets (AP10K, APT36K)

Results on AP-10K test set

Model	Dataset	Resolution	AP	config	weight
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	71.4	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	74.5	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	80.4	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	82.4	config	log \| Onedrive

Results on APT-36K val set

Model	Dataset	Resolution	AP	config	weight
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	74.2	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	75.9	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	80.8	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	82.3	config	log \| Onedrive

WholeBody dataset

Model	Dataset	Resolution	AP	config	weight
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	54.4	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	57.4	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	60.6	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	61.2	config	log \| Onedrive

Transfer results on the hand dataset (InterHand2.6M)

Model	Dataset	Resolution	AUC	config	weight
ViTPose+-S	COCO+AIC+MPII+WholeBody	256x192	86.5	config	Coming Soon
ViTPose+-B	COCO+AIC+MPII+WholeBody	256x192	87.0	config	Coming Soon
ViTPose+-L	COCO+AIC+MPII+WholeBody	256x192	87.5	config	Coming Soon
ViTPose+-H	COCO+AIC+MPII+WholeBody	256x192	87.6	config	Coming Soon

Updates

[2023-01-10] Update ViTPose+! It uses MoE strategies to jointly deal with human, animal, and wholebody pose estimation tasks.

[2022-05-24] Upload the single-task training code, single-task pre-trained models, and multi-task pretrained models.

[2022-05-06] Upload the logs for the base, large, and huge models!

[2022-04-27] Our ViTPose with ViTAE-G obtains 81.1 AP on COCO test-dev set!

Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting | VSA | ViTDet

Acknowledge

We acknowledge the excellent implementation from mmpose and MAE.

Citing ViTPose

For ViTPose

@inproceedings{
  xu2022vitpose,
  title={Vi{TP}ose: Simple Vision Transformer Baselines for Human Pose Estimation},
  author={Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022},
}

For ViTPose+

@article{xu2022vitpose+,
  title={ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation},
  author={Xu, Yufei and Zhang, Jing and Zhang, Qiming and Tao, Dacheng},
  journal={arXiv preprint arXiv:2212.04246},
  year={2022}
}

For ViTAE and ViTAEv2, please refer to:

@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
demo		demo
mmcv		mmcv
mmcv_custom		mmcv_custom
mmpose.egg-info		mmpose.egg-info
mmpose		mmpose
requirements		requirements
tools		tools
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py
video.sh		video.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Download pre-trained models

Annotation script

Results from this repo on MS COCO val set (single-task training)

Results with multi-task training

Human datasets (MS COCO, AIC, MPII, CrowdPose)

Animal datasets (AP10K, APT36K)

WholeBody dataset

Transfer results on the hand dataset (InterHand2.6M)

Updates

Acknowledge

Citing ViTPose

About

Releases

Packages

Languages

License

fan23j/yolov5-vitpose-video-annotator

Folders and files

Latest commit

History

Repository files navigation

Setup

Download pre-trained models

Annotation script

Results from this repo on MS COCO val set (single-task training)

Results with multi-task training

Human datasets (MS COCO, AIC, MPII, CrowdPose)

Animal datasets (AP10K, APT36K)

WholeBody dataset

Transfer results on the hand dataset (InterHand2.6M)

Updates

Acknowledge

Citing ViTPose

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages