Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training reproducibility thread #71

Open
hkchengrex opened this issue Mar 12, 2023 · 2 comments
Open

Training reproducibility thread #71

hkchengrex opened this issue Mar 12, 2023 · 2 comments

Comments

@hkchengrex
Copy link
Owner

hkchengrex commented Mar 12, 2023

This is a centralized thread for discussing training-related reproducibility. I noticed variations in the resultant accuracy when I was developing the model (mean and std given in TRAINING.md), but there are reports of consistently worse performance when re-trained on a different setup (#68 #60 #50). Granted, those who successfully train the model are not likely to open an issue.

I tried to investigate the issue, and I can confirm that the reproducibility problem exists, in ways that I do not understand. Here, I share my findings in the hope that it helps people who wish to retrain the network. I think that a good network/setup should be stable and not sensitive to small environmental variations but well here we are.

1. Default setting: Two A6000 GPUs, PyTorch 1.11, CUDA 11.3

Environment creation:

conda create -n xmem-repro python=3.9
conda activate xmem-repro
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install opencv-python
pip install -r requirements.txt

Training command:

python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-a6000 --stage 03
(I ctrl-c cancelled this when it entered stage 3 and switched to another server (same GPUs though) because someone else needs the GPU on the first server)
python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-s0-a6000 --stage 3 --load_network saves/Mar09_12.57.58_retrain-a6000_s0/Mar09_12.57.58_retrain-a6000_s0_150000.pth

DAVIS 2017 val at 107K iterations: 86.8
DAVIS 2017 val at 110K iterations: 86.8
Training log: https://drive.google.com/drive/folders/1qBkgIh5a3PMyrt9FFxKEBTC3kUnBABTX?usp=sharing

pip list:

----------------------- ------------
absl-py                 1.4.0
beautifulsoup4          4.11.2
cachetools              5.3.0
certifi                 2022.12.7
charset-normalizer      3.1.0
filelock                3.9.0
gdown                   4.6.4
gitdb                   4.0.10
GitPython               3.1.31
google-auth             2.16.2
google-auth-oauthlib    0.4.6
grpcio                  1.51.3
h5py                    3.8.0
hickle                  5.0.2
idna                    3.4
importlib-metadata      6.0.0
Markdown                3.4.1
MarkupSafe              2.1.2
numpy                   1.24.2
oauthlib                3.2.2
opencv-python           4.7.0.72
Pillow                  9.4.0
pip                     23.0.1
progressbar2            4.2.0
protobuf                4.22.1
pyasn1                  0.4.8
pyasn1-modules          0.2.8
PySocks                 1.7.1
python-utils            3.5.2
requests                2.28.2
requests-oauthlib       1.3.1
rsa                     4.9
setuptools              65.6.3
six                     1.16.0
smmap                   5.0.0
soupsieve               2.4
tensorboard             2.12.0
tensorboard-data-server 0.7.0
tensorboard-plugin-wit  1.8.1
thinplate               1.0.0
torch                   1.11.0+cu113
torchaudio              0.11.0+cu113
torchvision             0.12.0+cu113
tqdm                    4.65.0
typing_extensions       4.5.0
urllib3                 1.26.14
Werkzeug                2.2.3
wheel                   0.38.4
zipp                    3.15.0

2. V100 2-GPU setting: Two V100 GPUs, PyTorch 1.11, CUDA 10.2

Environment creation:

conda create -n xmem-repro python=3.9
conda activate xmem-repro
pip install torch==1.11.0+cu102 torchvision==0.12.0+cu102 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu102
pip install opencv-python
pip install -r requirements.txt

Training command (we start from the pretrained s0 weights):

python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-s0-2gpu --stage 3 --load_network saves/XMem-s0.pth

DAVIS 2017 val at 107K iterations: 86.1
DAVIS 2017 val at 110K iterations: 86.0
Training log: https://drive.google.com/drive/folders/1SDpsbpfnz4rRRNTFrXWr3h3D1-20Vj6s?usp=sharing

pip list:

Package                 Version
----------------------- ------------
absl-py                 1.4.0
beautifulsoup4          4.11.2
cachetools              5.3.0
certifi                 2022.12.7
charset-normalizer      3.1.0
filelock                3.9.0
gdown                   4.6.4
gitdb                   4.0.10
GitPython               3.1.31
google-auth             2.16.2
google-auth-oauthlib    0.4.6
grpcio                  1.51.3
h5py                    3.8.0
hickle                  5.0.2
idna                    3.4
importlib-metadata      6.0.0
Markdown                3.4.1
MarkupSafe              2.1.2
numpy                   1.24.2
oauthlib                3.2.2
opencv-python           4.7.0.72
Pillow                  8.4.0
pip                     23.0.1
progressbar2            4.2.0
protobuf                4.22.1
pyasn1                  0.4.8
pyasn1-modules          0.2.8
PySocks                 1.7.1
python-utils            3.5.2
requests                2.28.2
requests-oauthlib       1.3.1
rsa                     4.9
setuptools              65.6.3
six                     1.16.0
smmap                   5.0.0
soupsieve               2.4
tensorboard             2.12.0
tensorboard-data-server 0.7.0
tensorboard-plugin-wit  1.8.1
thinplate               1.0.0
torch                   1.11.0+cu102
torchaudio              0.11.0+cu102
torchvision             0.12.0+cu102
tqdm                    4.65.0
typing_extensions       4.5.0
urllib3                 1.26.14
Werkzeug                2.2.3
wheel                   0.38.4
zipp                    3.15.0

3. V100 2-GPU setting: Two V100 GPUs, PyTorch 1.12.1, CUDA 10.2

(No environment creation commands available because this is my default development environment that I used for a long time)
Training command:

python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-v100 --stage 03

DAVIS 2017 val at 107K iterations: 86.1
DAVIS 2017 val at 110K iterations: 86.1
Training log: https://drive.google.com/drive/folders/1lKnkKywkOqqBJaRMdei06Z_Cs3ynLIgp?usp=sharing

4. V100 4-GPU setting: Four V100 GPUs, PyTorch 1.11, CUDA 10.2

Environment creation same as (2)
Training command (we start from the pretrained s0 weights):

python -m torch.distributed.run --master_port 25763 --nproc_per_node=4 train.py --exp_id retrain-s0-4gpu --stage 3 --load_network saves/XMem-s0.pth

DAVIS 2017 val at 107K iterations: 85.3
DAVIS 2017 val at 110K iterations: 84.2

5. V100 8-GPU setting: Eight V100 GPUs, PyTorch 1.11, CUDA 10.2

Environment creation same as (2)
Training command (we start from the pretrained s0 weights):

python -m torch.distributed.run --master_port 25763 --nproc_per_node=8 train.py --exp_id retrain-s0 --stage 3 --load_network saves/XMem-s0.pth

DAVIS 2017 val at 107K iterations: 85.5
DAVIS 2017 val at 110K iterations: 85.9

TL;DR: It seems that training on two GPUs gives a more consistent performance (I used two GPUs most of the time during the development of this method)

Feel free to discuss/share below.

@longmalongma
Copy link

Hi, I used 4 3080ti (4*12GB) for training. What changes do I need to make to the default parameters?

@longmalongma
Copy link

Can you provide the supporting parameters for 4 card GPU and 2 card GPU respectively?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants