-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail-safe and partial redundancy for HSDP on unreliable compute #561
Comments
I see it mainly as a complementary addition to the existing torch.distributed.elastic functionality. Also, considering numerous ways to launch a training job, the main functionality would be restoring all model weights, activations, and optimizer states to a smaller number of workers (scale down). In the case of a specified launcher e.g. torchrun or torchx with Kubernetes scheduler, there's also an option to fully manage the cluster and replace workers (both scale up and down). Also, for clusters of thousands of GPUs, overhead won't be significant: for 64-128 or more nodes, the desired overlapping factor might be 2.5% - 5% to guarantee resilience to outages, which is a small cost. |
This is actually a great idea -- as ECC error is quite common in HBMs this can help us to not have to restart the entire job when we encounter a single ECC error. But not sure how well this works with distributed checkpointing. |
Thanks for this proposal @evkogs! We would need to get more specific about a design to say for sure, but I think there are largely 2 issues that need to be addressed before this could be feasible.
|
Well, I don't think that's an issue as it would be an infrequent event, at most 2-3 times for many nodes in an unreliable setup. So I think the current way would be absolutely fine for real-world cases. From torch.distributed.elastic docs:
That's a very good question! I think there's a place for a unified approach, combining all existing ones. Also, I was curious to look into pytorch h2 2024, and saw there are plans to integrate up to 5D model parallelism (whatever this means), so it might get even trickier soon. I feel if we continue to grow number of abstractions, it won't end well. |
I'd like to propose a feature for implementing fail-safe mechanisms and partial redundancy in FSDP2 (possibly not FSDP already, more like HSDP) to allow for more robust training on unreliable compute resources, such as cloud spot instances. The main goal is to make training more resilient to node failures, GPU issues, and other potential interruptions.
Key points:
Use case examples:
This feature would greatly enhance the flexibility and reliability of large-scale distributed training, especially in scenarios where compute resources are not guaranteed to be stable throughout the entire training process.
A key aspect of this implementation would be an overlapping factor, ranging from 0.0 to 1.0, which determines the degree of redundancy. For example, with 64 GPUs across 8 nodes:
The system would need to integrate downscaling with resharding and automatic restoring, as well as upscaling with automatic sharding, all governed by this specified overlapping factor (probably using Kubernetes with torchx, for example).
I'd be happy to discuss this further and provide more details if needed! Looking forward to your thoughts on this proposal!
The text was updated successfully, but these errors were encountered: