Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-node training fixes for state tracker #1040

Merged
merged 6 commits into from
Oct 11, 2024

Commits on Oct 10, 2024

  1. for distributed training we should save rank-specific files in shared…

    … storage
    bghira committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    f0a76fc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d5c62e7 View commit details
    Browse the repository at this point in the history
  3. update header txt

    bghira committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    a37899d View commit details
    Browse the repository at this point in the history
  4. fix rank retrieval

    bghira committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    160ed39 View commit details
    Browse the repository at this point in the history
  5. save state on all nodes

    bghira committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    0853ad3 View commit details
    Browse the repository at this point in the history
  6. disable validations for deepspeed zero3, enable benchmarking for deep…

    …speed zero1 or 2
    bghira committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    f7ec503 View commit details
    Browse the repository at this point in the history