Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add desi_update_processing_table_statuses and update dry_run_level's #2385

Merged
merged 10 commits into from
Oct 10, 2024

Conversation

akremin
Copy link
Member

@akremin akremin commented Oct 9, 2024

Summary

This adds a script, desi_update_processing_table_statuses, that updates only the STATUS column of the processing table. This is important because after 6 months NERSC deletes all knowledge of the job statuses. The new operating procedure would be to add this as a step in the tsnr afterburner and nightqa script, such that everytime we sign off on a night we would also run this script such that all jobs will be reflected as COMPLETE.

In adding that I went down a rabbit hole to update and make the dry_run_level behavior more uniform. I also updated the documentation to be uniform as well. There was a challenge with unit tests with this, since some assumed behavior that deviated from the new levels. Some required changes to levels to reflect the expected behavior in each case.

New dry_run_level documentation:

dry_run_level : int
    If nonzero, this is a simulated run. Default is 0.
    0 which runs the code normally.
    1 writes all files but doesn't submit any jobs to Slurm.
    2 writes tables but doesn't write scripts or submit anything.
    3 Doesn't write or submit anything but queries Slurm normally for job status.
    4 Doesn't write, submit jobs, or query Slurm.
    5 Doesn't write, submit jobs, or query Slurm; instead it makes up the status of the jobs.

The one notable change was in desi_proc_night (actually proc_night.py). The test environment claims it's at nersc for simulation reasons and wants to write out files to check the outputs. However, the tests run off-site don't have access to Slurm. If we went with a large enough dry_run_level to avoid querying Slurm, then we wouldn't write out the files. So to get around this, I've added "if" statements to proc_night.py that doesn't try to update the tables when not submitting jobs to Slurm or a dry_run_level large enough to avoid Slurm (dry_run_level = 4 or 5). we want to differentiate the Slurm querying from this

Test

I believe the unit tests provide good coverage of the dry_run_level changes. I tested the new desi_update_processing_table_statuses script by running it on the last 6 months of data. There were no errors and the output tables are updates. There are no remaining "SUBMITTED" "PENDING" or "RUNNING" jobs.

Output tables are here:
/global/cfs/cdirs/desi/spectro/redux/ptab_test/daily/processing_tables/

One example would be to compare
/global/cfs/cdirs/desi/spectro/redux/ptab_test/daily/processing_tables/processing_table_daily-20241004.csv
with the updated output in:
/global/cfs/cdirs/desi/spectro/redux/ptab_test/daily/processing_tables/processing_table_daily-20241004.csv

256395|256396|,science,3095,20241004,,all,|,a0123456789,0,241004041,,tilenight,31434381,1728149176,PENDING,,241004019|,31434367|,31434381|
256395|256396|,science,3095,20241004,,all,|,a0123456789,0,241004042,,cumulative,31434382,1728149176,PENDING,,241004041|,31434381|,31434382|
256397|256398|,science,4658,20241004,,all,|,a0123456789,0,241004045,,tilenight,31434383,1728149177,PENDING,,241004019|,31434367|,31434383|
256397|256398|,science,4658,20241004,,all,|,a0123456789,0,241004046,,cumulative,31434384,1728149177,PENDING,,241004045|,31434383|,31434384|

versus

256395|256396|,science,3095,20241004,,all,|,a0123456789,0,241004041,,tilenight,31434381,1728149176,COMPLETED,,241004019|,31434367|,31434381|
256395|256396|,science,3095,20241004,,all,|,a0123456789,0,241004042,,cumulative,31434382,1728149176,COMPLETED,,241004041|,31434381|,31434382|
256397|256398|,science,4658,20241004,,all,|,a0123456789,0,241004045,,tilenight,31434383,1728149177,COMPLETED,,241004019|,31434367|,31434383|
256397|256398|,science,4658,20241004,,all,|,a0123456789,0,241004046,,cumulative,31434384,1728149177,COMPLETED,,241004045|,31434383|,31434384|

@akremin akremin requested a review from sbailey October 9, 2024 20:07
@coveralls
Copy link

coveralls commented Oct 9, 2024

Coverage Status

coverage: 30.081% (-0.1%) from 30.218%
when pulling 5cb164d on update_processing_table
into 41da70a on main.

Copy link
Contributor

@sbailey sbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two crash-inducing "typos" noted inline which appear to be leftovers of your IDE crashing and writing broken files. Please fix those and re-double check the rest, since our unit test coverage isn't catching those.

I have a knee-jerk reaction against the long name of this script; I can't even think the words quickly. Consider desi_update_proctable_status? Given that we won't be typing this by hand regularly, it doesn't matter that much.

Do you envision running this as a separate scronjob after the last desi_proc_night call? Or run as part of the same script that runs the tsnr_afterburner and nightqa?

bin/desi_proc_night Outdated Show resolved Hide resolved
py/desispec/scripts/daily_processing.py Outdated Show resolved Hide resolved
bin/desi_update_processing_table_statuses Outdated Show resolved Hide resolved
@akremin
Copy link
Member Author

akremin commented Oct 10, 2024

Thanks for these comments. I've resolved all of the comments, including changing the script name to desi_update_proctable_status and adding a --outfile optional argument.

I envision adding this to the wrapper script that calls the tsnr afterburner and nightqa. That way it is run every morning and each time we do reprocessing on daily.

Documentation tests passed at NERSC, so I can't reproduce the failure \Ggithub is claiming for the commit.

@akremin akremin requested a review from sbailey October 10, 2024 00:15
@sbailey
Copy link
Contributor

sbailey commented Oct 10, 2024

The github doctest is failing due to this warning:

WARNING: failed to reach any of the inventories with the following issues:
intersphinx inventory 'https://matplotlib.org/stable/objects.inv' not fetchable due to <class 'requests.exceptions.HTTPError'>: 403 Client Error: Forbidden for url: https://matplotlib.org/stable/objects.inv

which is also failing on other PRs, independent of docstring updates you've made here. I think we can merge this and sort out github docstring tests separately. Thanks for the updates.

@sbailey sbailey merged commit bd7fd13 into main Oct 10, 2024
24 of 26 checks passed
@sbailey sbailey deleted the update_processing_table branch October 10, 2024 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants