Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Example] Multi node training on XPU device #9490

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

zhouyu5
Copy link
Contributor

@zhouyu5 zhouyu5 commented Jul 5, 2024

This is a follow-up PR for issue #9464 .

I refer to distributed_sampling_multinode.py for a XPU version of multi node training.

I use two nodes for testing, each nodes with 1 XPU card.

See the following screenshot, it has passed the test in internal cluster.

image

@zhouyu5
Copy link
Contributor Author

zhouyu5 commented Jul 5, 2024

@DamianSzwichtenberg I plan to add doc for it, where should I put the doc, in the current folder?

Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Is there any reason we cannot merge this with multi_gpu/distributed_sampling.py?

@rusty1s rusty1s changed the title [ISSUE 9464] Enable multi node training on XPU device [Example] Multi node training on XPU device Jul 7, 2024
CHANGELOG.md Outdated Show resolved Hide resolved
@zhouyu5
Copy link
Contributor Author

zhouyu5 commented Jul 9, 2024

Thanks. Is there any reason we cannot merge this with multi_gpu/distributed_sampling.py?

@rusty1s Good question. Actually, I had the same confusion as you. I think it is possible to merge it with multi_gpu/distributed_sampling.py.

However, I want to clarify the reason why I didn't merge them. As a new user, I believe it’s essential to follow the existing conventions of the repo. I noticed that for NVDA GPUs, there are two files provided: distributed_sampling.py and distributed_sampling_multinode.py. Following this format, I created separate files for XPU, which is distributed_sampling_xpu.py and distributed_sampling_multinode_xpu.py (note the xpu suffix).

Additionally, if you closely examine the code under multi_gpu, you'll find several redundant pieces of code. Some variations are due to different datasets or training configurations. I counted a total of 12 files for NVDA GPUs. Ideally, by introducing some conditional branches in the code, we could consolidate many of these files, potentially reducing them to 3-4 files. What are your thoughts on this approach?

@zhouyu5
Copy link
Contributor Author

zhouyu5 commented Jul 12, 2024

@rusty1s Update with readme and guide. Please have a check. Thanks.

Copy link
Member

@akihironitta akihironitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Do you have an idea which part of the example script is XPU-specific? We could consider merging this file with distributed_sampling_multinode.py if the only change is in the launch command and the torch.device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants