Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dws: fix Postrun workflow error hang #169

Merged

Conversation

jameshcorbett
Copy link
Member

@jameshcorbett jameshcorbett commented Jun 29, 2024

Problem: if a workflow goes to status: Error, coral2_dws will raise
an exception. For most states the exception will transition the job
to CLEANUP, which eventually moves the workflow to Teardown.
However, if the job is already in CLEANUP and is being held in the
dws-epilog epilog when the exception is thrown, the exception has
no effect and the epilog will never be finished because the workflow
does not progress.

If a workflow moves to status: Error in PostRun or DataOut states,
move it directly to Teardown.

@jameshcorbett jameshcorbett marked this pull request as draft June 29, 2024 03:13
Copy link
Member

@cmoussa1 cmoussa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Problem: if a workflow goes to status: Error, coral2_dws will raise
an exception. For most states the exception will transition the job
to CLEANUP, which eventually moves the workflow to Teardown.
However, if the job is already in CLEANUP and is being held in the
dws-epilog epilog when the exception is thrown, the exception has
no effect and the epilog will never be finished because the workflow
does not progress.

If a workflow moves to status: Error in PostRun or DataOut states,
move it directly to Teardown.
Problem: there are no tests for copy_in and copy_out directives.
The tests will have to be simple because no global lustre file
system will exist for doing the actual copying.

Add some basic tests.
@jameshcorbett jameshcorbett marked this pull request as ready for review July 1, 2024 16:30
@jameshcorbett
Copy link
Member Author

Rebased after #168 was merged and dropped the draft label, no code changes. Thanks for the review! Setting MWP.

@mergify mergify bot merged commit 3dffc03 into flux-framework:master Jul 1, 2024
8 checks passed
@jameshcorbett jameshcorbett deleted the postrun-workflow-error-hang branch July 1, 2024 16:44
jameshcorbett added a commit to jameshcorbett/flux-coral2 that referenced this pull request Jul 17, 2024
Problem: as in flux-framework#169, for most states, raising an exception should
be enough to trigger other logic that eventually moves the workflow
to Teardown. However, if the workflow is in PostRun or DataOut, the
exception won't affect the dws-epilog action holding the job, so
the workflow should be moved to Teardown immediately.

Move workflows that are stuck in TransientCondition in DataOut or
PostRun to Teardown immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants