multi-gpu error in full dreambooth finetuning #2953
Unanswered
Artemis1111
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Why does this error occur?
When I enter '0,1' for gpu_ids as shown in the image below, an error message appears with this sentence
ERROR GPU IDs must be an integer between 0 and 128
And it wouldn’t be an issue if only that error appeared, but initially, fine-tuning seems to proceed normally. Later, however, an error shows up saying it can’t locate the GPU, followed by a timeout error.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5658/5658 [00:11<00:00, 507.57it/s]
[rank1]:[W1106 04:51:09.921327685 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5658/5658 [00:11<00:00, 506.52it/s]
[rank0]:[W1106 04:51:10.005160912 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
I quickly completed 'get image size from name of cache files,' 'read caption,' and 'loading image sizes,' and fully completed 'caching latents...' as well. However, the above code encountered an error right after 'caching latents...' finished.
Finally, I’ll show you how my actual execution code appeared in the terminal.
P.S.) Finetuning works well when attempted with a single GPU.
Beta Was this translation helpful? Give feedback.
All reactions