Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing Example #164

Merged
merged 16 commits into from
Jul 6, 2023
Merged

Checkpointing Example #164

merged 16 commits into from
Jul 6, 2023

Conversation

satyaog
Copy link
Member

@satyaog satyaog commented Mar 14, 2023

No description provided.

@satyaog satyaog force-pushed the checkpoint branch 2 times, most recently from 4d29c95 to 33627fe Compare March 14, 2023 22:55
@satyaog satyaog marked this pull request as ready for review March 21, 2023 20:47
@btravouillon
Copy link
Collaborator

@satyaog Is it expected to see an overlap with #163 in the changes? A lot of files include the same changes in both PR.

@btravouillon
Copy link
Collaborator

Waiting for merge of #161.

@satyaog
Copy link
Member Author

satyaog commented Apr 18, 2023

There a lot of overlapping with #163. This one should be rebased on #163 once done

Copy link
Contributor

@lebrice lebrice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not completely done reviewing, moving to #163 before I complete this.

docs/examples/data/checkpointing/job.sh Outdated Show resolved Hide resolved
run: python3 -m pip install -r docs/requirements.txt

- name: Run files generation tests
run: pre-commit run --all-files && [[ -z "$(git status -s)" ]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's && [[ -z "$(git status -s)" ]] for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making sure that nothing changed :

  • && : run the next command only if the previous succeeded
  • git status -s : only print the files that differs (including new and deleted)
  • [[ -z ... ]] : checks if the length of the output or variable is zero and returns true if it is zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self, we should do pre-commit run --all-files preprocess-py to only run this hook in particular.
(I locally added other hooks in my pre-commit config. If we ever add some other hooks for writing the examples (e.g. ruff, black, etc, then we should probably only run the generate hook here)

docs/examples/preprocess.py Outdated Show resolved Hide resolved
docs/examples/frameworks/pytorch_setup/_index.rst Outdated Show resolved Hide resolved
.pre-commit-config.yaml Outdated Show resolved Hide resolved
docs/examples/data/checkpointing/main.py Outdated Show resolved Hide resolved
docs/examples/data/checkpointing/main.py Outdated Show resolved Hide resolved
docs/examples/data/checkpointing/main.py Outdated Show resolved Hide resolved
docs/examples/data/checkpointing/main.py Outdated Show resolved Hide resolved
docs/examples/data/checkpointing/main.py Outdated Show resolved Hide resolved
@satyaog satyaog force-pushed the checkpoint branch 2 times, most recently from 776bcd7 to ad75d92 Compare May 3, 2023 18:16
@satyaog
Copy link
Member Author

satyaog commented May 3, 2023

Rebased on #182 which should be merged before this one

@satyaog satyaog marked this pull request as draft May 31, 2023 17:56
@lebrice lebrice marked this pull request as ready for review June 29, 2023 20:30
@lebrice lebrice requested review from lebrice and obilaniu June 29, 2023 20:30
Copy link
Contributor

@lebrice lebrice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(NOTE: J'approve ce qui est devenu ma propre PR, il me faudrait 1-2 autres reviews svp)

Comment on lines 15 to 22
# trap the signal to the main BATCH script here.
sig_handler()
{
echo "BATCH interrupted"
wait # wait for all children, this is important!
}

trap 'sig_handler' SIGINT SIGTERM SIGCONT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless after exec python main.py

Copy link
Contributor

@lebrice lebrice Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right! Thanks @obilaniu. Should we encourage users to do srun python (...) (and trap?) or exec python (...) in your view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being an example only, I would adopt one of two strategies - either minimize the delta from the foundation mini-example to this one, or minimize the complexity of this one in the absolute.

I would prefer that you drop the trap business, and use the Python equivalent (the signal package).

However, I have no faith in asynchronous checkpointing at all and recommend only synchronous checkpointing. I would use the signal package only to demonstrate in the mini-example that Python can install signal handlers and execute Python code as a result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, here's my take: 779413b

Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Copy link
Collaborator

@btravouillon btravouillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not able to review the python code, but that changes looks good to me overall.

@obilaniu
Copy link
Contributor

obilaniu commented Jul 6, 2023

Overall this is fine, however I do not like the strategy for checkpointing. It's not safe.

Your implementation:

  1. For writing:
    1. If it exists, move aside the current checkpoint to a <PATH>.backup file
    2. Writes a checkpoint to <PATH>
    3. Unlinks <PATH>.backup.
  2. For reading:
    1. If <PATH>.backup exists, load that
    2. If <PATH> exists, load that
    3. Otherwise, start from scratch

You vulnerability is that if you are killed in step 2 during the first checkpoint's write, the reload will detect no <PATH>.backup but does detect the existence of the (unbeknownst to it, corrupt) checkpoint <PATH>, and fail to reload.


My preferred implementation:

  1. For writing:
    1. Writes a checkpoint to <PATH>.<tmpXXXXXXXX>. The name should be unique to avoid collisions.
    2. Atomically move <PATH>.<tmpXXXXXXXX> into <PATH> using os.replace().
  2. For reading:
    1. If <PATH> exists, load that
    2. Otherwise, start from scratch.

Signed-off-by: Fabrice Normandin <[email protected]>
@lebrice
Copy link
Contributor

lebrice commented Jul 6, 2023

Hey @obilaniu , I implemented your suggested changes in 4d3fd15

@lebrice lebrice merged commit 71289ec into mila-iqia:master Jul 6, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants