This repository aims to extend the capabilities of the original repository with the CMU dataset. See here for more details! Also see the preliminary results here.
For your convenience, I have been regularly updating the wiki page if you would like to see details specific to the CMU dataset. Feel free to check out my personal website for reflections and overview of this project!
The original repository is the official PyTorch implementation of the paper "Learnable Triangulation of Human Pose" (ICCV 2019, oral). Here we tackle the problem of 3D human pose estimation from multiple cameras. We present 2 novel methods — Algebraic and Volumetric learnable triangulation — that outperform previous state of the art.
If you find a bug, have a question or know to improve the code - please open an issue!
This project doesn't have any special or difficult-to-install dependencies. All installation can be down with:
pip install -r requirements.txt
Note: only Human3.6M dataset training/evaluation is available right now. CMU Panoptic dataset will be added soon.
UPDATE: Evaluation of the CMU Panoptic Studio Dataset is no longer WIP. I have been able to successfully test the dataset based on pre-trained weights from Human3.6M, and the volumetric triangulation algorithm. More discussion here and results here. Currently working on the training of the CMU dataset!
UPDATE: You can now download pretrained labels and data from my Google drive here, with supplementary data and weights from the original author's Google drive here
- Download and preprocess the dataset by following the instructions in mvn/datasets/human36m_preprocessing/README.md.
- Download pretrained backbone's weights from here and place them here:
./data/pretrained/human36m/pose_resnet_4.5_pixels_human36m.pth
(ResNet-152 trained on COCO dataset and finetuned jointly on MPII and Human3.6M). - If you want to train Volumetric model, you need rough estimations of the pelvis' 3D positions both for train and val splits. In the paper we estimate them using the Algebraic model. You can use the pretrained Algebraic model to produce predictions or just take precalculated 3D skeletons.
- Download and preprocess the dataset by following the instructions in mvn/datasets/cmu_preprocessing/README.md.
- The config files can be found at
$THIS_REPOSITORY/experiements/[train|eval]/cmupanoptic
- You can also do a quick evaluation using the provided
./eval_cmu
script - You can view preliminary results here
I tried to create documentation on how you can setup, test and train your own general dataset. You can also refer to the wiki pages for setup and [testing/training].
I was able to evaluate the CMU Panoptic dataset using the same ideas, and an example of that is seen above here. I'm also working on testing the CMU Panoptic dataset which does not have ground truth.
In this section we collect pretrained models and configs. All pretrained weights and precalculated 3D skeletons can be downloaded at once from here and placed to ./data/pretrained
, so that eval configs can work out-of-the-box (without additional setting of paths). Alternatively, the table below provides separate links to those files.
Human3.6M:
Model | Train config | Eval config | Weights | Precalculated results | MPJPE (relative to pelvis), mm |
---|---|---|---|---|---|
Algebraic | train/human36m_alg.yaml | eval/human36m_alg.yaml | link | train, val | 22.5 |
Volumetric (softmax) | train/human36m_vol_softmax.yaml | eval/human36m_vol_softmax.yaml | link | — | 20.4 |
CMU Panoptic Studio:
Model | Train config | Eval config | Weights | Precalculated results | MPJPE (relative to pelvis), mm |
---|---|---|---|---|---|
Algebraic | train/cmu_alg.yaml | eval/cmu_alg.yaml | - | - | ?? |
Volumetric (softmax) | train/cmu_vol_softmax.yaml | eval/cmu_vol_softmax.yaml | - | — | ?? |
Every experiment is defined by .config
files. Configs with experiments from the paper can be found in the ./experiments
directory (see model zoo).
To train a Volumetric model with softmax aggregation using 1 GPU, run:
python3 train.py \
--config experiments/human36m/train/human36m_vol_softmax.yaml \
--logdir ./logs
The training will start with the config file specified by --config
, and logs (including tensorboard files) will be stored in --logdir
.
Multi-GPU training is implemented with PyTorch's DistributedDataParallel. It can be used both for single-machine and multi-machine (cluster) training. To run the processes use the PyTorch launch utility.
To train a Volumetric model with softmax aggregation using 2 GPUs on single machine, run:
python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=2345 \
train.py \
--config experiments/human36m/train/human36m_vol_softmax.yaml \
--logdir ./logs
To watch your experiments' progress, run tensorboard:
tensorboard --logdir ./logs
Alternatively, use the script
./scripts/startTensorboard [logs-dir (./logs by default)]
which also overcomes PermissionError
due to /tmp
directory being blocked
You can also visualise the results without tensorboard.
After training, you can evaluate the model. Inside the same config file, add path to the learned weights (they are dumped to logs
dir during training):
model:
init_weights: true
checkpoint: {PATH_TO_WEIGHTS}
Also, you can change other config parameters like retain_every_n_frames_test
.
For H36M, run:
python3 train.py \
--eval --eval_dataset val \
--config experiments/human36m/eval/human36m_vol_softmax.yaml \
--logdir ./logs
or simply,
./scripts/eval_h36m
For CMU, run
python3 train.py \
--eval --eval_dataset val \
--config experiments/human36m/eval/human36m_vol_softmax.yaml \
--logdir ./logs
or simply,
./scripts/eval_cmu
Argument --eval_dataset
can be val
or train
. Results can be seen in logs
directory or in the tensorboard.
Alternatively, after all the pre-processing steps above have been completed, for a quick evaluation of the datasets, you can run the ./scripts/eval_cmu
and ./scripts/eval_human36m
scripts.
IMPORTANT NOTE: There is a bug with the old Python version where multiprocessing
connections are unable to send more than 2 Gb of data. This is fixed in a pull request for new Python versions here.
Therefore, you may possible run into MemoryError
s if running on Linux machines with Python versions < 3.8. The fix to this is to modify the multiprocessing
library's connection.py
file with the updated file here, which is from the aforementioned pull request.
It is advised that you create a backup of the old connection.py
file in case something goes wrong.
Example of where to find the file:
- If using virtual environment:
~/.pyenv/versions/<your_python_version>/lib/python<python_version>/connection.py
- Otherwise:
/usr/lib/python<python_version>/multiprocessing
*
A python script visualise_results.py
has been created to allow you to better view (and play with) the predicted 3D keypoints after testing/evaluation. After evaluation, a results.pkl
file will be saved to ./logs/<experiment-name>/checkpoints/<checkpoint_number>/results.pkl
. For example, ./logs/[email protected]:13:24/checkpoints/0000/results.pkl
.
To use the script, you will need the aforementioned pickle file, and the config file used in the original experiment, also located conveniently in the ./logs/<experiment-name>/
folder.
python3 visualise_results.py <results_pkl_file> <config_yaml_file_used_in_experiment> [n_images_step=1 [save_images_instead=0]]
You can choose to save the images instead of viewing them via OpenCV viewer. If so, images are saved to a new directory saved_images
, in the directory where results_pkl_file
is found.
Feel free to modify the various parameters of the script (such as which camera views to project onto).
- We conduct experiments on two available large multi-view datasets: Human3.6M [2] and CMU Panoptic [3].
- The main metric is MPJPE (Mean Per Joint Position Error) which is L2 distance averaged over all joints.
- We significantly improved upon the previous state of the art (error is measured relative to pelvis, without alignment).
- Our best model reaches 17.7 mm error in absolute coordinates, which was unattainable before.
- Our Volumetric model is able to estimate 3D human pose using any number of cameras, even using only 1 camera. In single-view setup, we get results comparable to current state of the art [6] (49.9 mm vs. 49.6 mm).
MPJPE relative to pelvis:
MPJPE (averaged across all actions), mm | |
---|---|
Multi-View Martinez [4] | 57.0 |
Pavlakos et al. [8] | 56.9 |
Tome et al. [4] | 52.8 |
Kadkhodamohammadi & Padoy [5] | 49.1 |
Qiu et al. [9] | 26.2 |
RANSAC (our implementation) | 27.4 |
Ours, algebraic | 22.4 |
Ours, volumetric | 20.5 |
MPJPE absolute (scenes with invalid ground-truth annotations are excluded):
MPJPE (averaged across all actions), mm | |
---|---|
RANSAC (our implementation) | 22.8 |
Ours, algebraic | 19.2 |
Ours, volumetric | 17.7 |
MPJPE relative to pelvis (single-view methods):
MPJPE (averaged across all actions), mm | |
---|---|
Martinez et al. [7] | 62.9 |
Sun et al. [6] | 49.6 |
Ours, volumetric single view | 49.9 |
- Our best model reaches 13.7 mm error in absolute coordinates for 4 cameras
- We managed to get much smoother and more accurate 3D pose annotations compared to dataset annotations (see video demonstration)
MPJPE relative to pelvis [4 cameras]:
MPJPE, mm | |
---|---|
RANSAC (our implementation) | 39.5 |
Ours, algebraic | 21.3 |
Ours, volumetric | 13.7 |
We present 2 novel methods of learnable triangulation: Algebraic and Volumetric.
Our first method is based on Algebraic triangulation. It is similar to the previous approaches, but differs in 2 critical aspects:
- It is fully differentiable. To achieve this, we use soft-argmax aggregation and triangulate keypoints via a differentiable SVD.
- The neural network additionally predicts scalar confidences for each joint, passed to the triangulation module, which successfully deals with outliers and occluded joints.
For the most popular Human3.6M dataset, this method already dramatically reduces error by 2.2 times (!), compared to the previous art.
In Volumetric triangulation model, intermediate 2D feature maps are densely unprojected to the volumetric cube and then processed with a 3D-convolutional neural network. Unprojection operation allows dense aggregation from multiple views and the 3D-convolutional neural network is able to model implicit human pose prior.
Volumetric triangulation additionally improves accuracy, drastically reducing the previous state-of-the-art error by 2.4 times! Even compared to the best parallelly developed method by MSRA group, our method still offers significantly lower error of 21 mm.
@inproceedings{iskakov2019learnable,
title={Learnable Triangulation of Human Pose},
author={Iskakov, Karim and Burkov, Egor and Lempitsky, Victor and Malkov, Yury},
booktitle = {International Conference on Computer Vision (ICCV)},
year={2019}
}
- 26 Nov 2019: Updataed precalculated results (see this issue).
- 18 Oct 2019: Pretrained models (algebraic and volumetric) for Human3.6M are released.
- 8 Oct 2019: Code is released!
- [1] R. Hartley and A. Zisserman. Multiple view geometry in computer vision.
- [2] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.
- [3] H. Joo, T. Simon, X. Li, H. Liu, L. Tan, L. Gui, S. Banerjee, T. S. Godisart, B. Nabbe, I. Matthews, T. Kanade,S. Nobuhara, and Y. Sheikh. Panoptic studio: A massively multiview system for social interaction capture.
- [4] D. Tome, M. Toso, L. Agapito, and C. Russell. Rethinking Pose in 3D: Multi-stage Refinement and Recovery for Markerless Motion Capture.
- [5] A. Kadkhodamohammadi and N. Padoy. A generalizable approach for multi-view 3D human pose regression.
- [6] X. Sun, B. Xiao, S. Liang, and Y. Wei. Integral human pose regression.
- [7] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation.
- [8] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Harvesting multiple views for marker-less 3D human pose annotations.
- [9] H. Qiu, C. Wang, J. Wang, N. Wang and W. Zeng. (2019). Cross View Fusion for 3D Human Pose Estimation, GitHub