Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DWS service crashes #81

Merged
merged 20 commits into from
Jul 20, 2023
Merged

Conversation

jameshcorbett
Copy link
Member

@jameshcorbett jameshcorbett commented Jul 20, 2023

This PR is a substantial rewrite of dws-jobtap.c and coral2_dws.py to solve the problem of handling coral2_dws.py crashes. Before this PR is merged, the coral2_dws.py service assumes that its lifetime encompasses the lifetime of any job that it sees. It can't handle rabbit jobs that existed before it started. This is because it keeps an internal data structure of active rabbit jobs, and jobs can only be added to the data structure when they are being created.

Furthermore, the dws-jobtap.c plugin sends a RPC to coral2_dws.py when a job is ready (by flux-core standards) to transition states, and coral2_dws.py responds to that RPC when HPE rabbit software gives the go-ahead. But HPE rabbit software can take a considerable amount of time to give the go-ahead. If the coral2_dws.py service crashes or restarts in the meanwhile, it won't be able to respond to the RPC it received, and the job will be stranded. In the worst case, the job is stranded in the dws epilog, and can't be canceled. That is intensely obnoxious.

Instead of making the coral2_dws.py service hold on to RPCs, make it respond more or less immediately, and send a new RPC to the jobtap plugin once the go-ahead is given by HPE rabbit software.

Fixes #64

Problem: there is an unnecessary flux_plugin_set_name
call in cray_pals_port_distributor.c. The plugin
has its name set twice.

Remove the duplicate set_name call.
Problem: a public module-level function has a name
prefixed with an underscore, making it private by
convention.

Remove the underscore to keep with public function
naming convention.
Problem: the coral2-dws service caches the computes
section of workflows. This makes the service more
stateful than it needs to be.

Remove the caching.
Problem: the coral2-dws service caches the
directivebreakdowns associated with a workflow,
rendering the service more stateful than it needs to be.
The breakdowns are only needed once.

Fetch the breakdowns when they are needed, and don't
cache them.
Problem: the coral2-dws service saves incoming RPCs so it
can respond to them later. However, this means that if the
service crashes, it can no longer communicate with the
jobtap plugin about jobs that were in flight.

Stop saving incoming RPCs and instead of responding
to incoming RPCs, send new RPCs instead.
Problem: Workflows are named predictably, but the
WorkflowInfo constructor requires a 'name' argument.
This makes constructing WorkflowInfo instances based on
jobid only difficult.

Make the 'name' field optional.
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! The separate commits in this PR really made it easy to review, thanks!

Just a few minor suggestions inline, mainly regarding mixing actual code changes and formatting/whitespace updates, which can make it difficult to tell what is going on. However, your breakdown was done so well I'd consider it only a minor inconvenience this time.

Also, as an aside it seems like the line breaks in your commit messages might be set too small? Usually the commit title is broken at ~52 characters and commit body at 72 - it seemed like you might be wrapping the commit bodies at around 52? Not a big deal, just thought I'd point it out in case it was an editor setting.

I didn't see anything mandatory to fix, so approving!

src/modules/coral2_dws.py Outdated Show resolved Hide resolved
src/job-manager/plugins/dws-jobtap.c Outdated Show resolved Hide resolved
src/job-manager/plugins/dws-jobtap.c Outdated Show resolved Hide resolved
src/job-manager/plugins/dws-jobtap.c Outdated Show resolved Hide resolved
src/job-manager/plugins/dws-jobtap.c Outdated Show resolved Hide resolved
src/job-manager/plugins/dws-jobtap.c Outdated Show resolved Hide resolved
Copy link
Member

@cmoussa1 cmoussa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too! I agree with @grondo that this was a very well structured PR, great job! I noticed just one commented-out code block (but maybe that's intended so feel free to ignore if so).

src/job-manager/plugins/dws-jobtap.c Outdated Show resolved Hide resolved
@garlick
Copy link
Member

garlick commented Jul 20, 2023

Just curious: would the problem with the RPC as it was before be addressed if pending RPCs were automatically sent an error response if the service disconnects? If so then maybe there is a flux-core issue to open...

@jameshcorbett
Copy link
Member Author

jameshcorbett commented Jul 20, 2023

Just curious: would the problem with the RPC as it was before be addressed if pending RPCs were automatically sent an error response if the service disconnects? If so then maybe there is a flux-core issue to open...

That would be an improvement over the original, but I think this PR is an improvement over that. If RPCs automatically sent an error response, when the service crashed the jobtap plugin could use the RPC error response to raise an exception and remove the prolog/epilog, so jobs would never be stuck. But users' jobs would fail simply because the service crashed.

However, with this PR, the service can crash repeatedly and users' jobs will be unaffected.

Problem: the flux-coral2 service doesn't handle workflows
it finds in the k8s event stream that aren't also in its
cache. This means that after the service crashes, all
workflows in flight are stranded.

Provided a workflow is named properly, have the
service recognize and handle it.
Problem: the coral2_dws service does not handle
workflows that it doesn't have in its cache in the
post_run RPC callback.

Add handling.
Problem: the dws jobtap plugin progresses jobs
to new states when it receives responses to RPCs
it has sent. However, that forces the original
RPC recipient to hold on to messages until it is
ready to respond. If the recipient crashes,
the job is stranded.

Add a service and handlers to the jobtap plugin.
Move job-state logic from RPC responses to
incoming RPCs to the service.
Problem: the coral2_dws service needs to modify the
'resources' section of a jobspec. It currently relies
on being sent the resources in the workflow creation RPC.

However, in the event that the service restarts, the
service won't have the resources section of any
outstanding jobs.

Add logic to fetch the resources section of jobs on demand.
Problem: t1000-dws-dependencies.t does not verify that the
jobs it starts actually clean up.

Add cleanup checks.
Problem: the fake coral2_dws service needs to be updated
to send job-manager RPCs to advance job state. Without
the RPCs, jobs will be stuck in dependency, prolog,
or epilog.

Add the RPCs to the fake service.
Problem: the coral2_dws service does not respond to the
RPCs it receives unless an error occurs while processing
them. This keeps the RPC sender from knowing that its
message was processed successfully.

Respond with 'success: True' when the RPC has been
processed.
Problem: 'flux_log_error' apends errno to log messages,
and errno is set to a spurious value when logging
the fact that an RPC returned with 'success: False'.

Remove the errno appending by swapping 'flux_log_error'
calls for 'flux_log( ..., LOG_ERR, ...)' calls.
Problem: the dws-jobtap's prolog-remove RPC callback
does not update the prolog_active shared state, so the
plugin may attempt to remove a prolog multiple times.

Make the callback update the prolog_active shared state.
Problem: the dws-jobtap plugin does not track whether
a workflow was successfully created for a job. This can
cause issues when the plugin tries to tell the coral2_dws
service to delete a workflow, when in fact no workflow
was ever created.

Before sending a workflow deletion request, check that
a workflow was created to begin with, by looking for a
boolean aux set after creation.
Problem: the dws jobtap plugin treats all exceptions the same.
When any exception occurs, the plugin removes the jobtap
prolog for the job. However, that is not appropriate when
a severity nonzero exception occurs.

Check exception severity and only take action if the
severity is zero.
Problem: HPE software is slow, and sometimes tests
time out, although they would pass successfully with
longer timeouts.

Increase timeouts in the test suite.
Problem: the testsuite does not contain any tests for
behavior when the coral2_dws service crashes while
jobs/workflows are in flight.

Add a test.
Problem: a function fetches the jobspec for a job
but doesn't use it.

Remove the fetch.
@jameshcorbett
Copy link
Member Author

Thanks for the reviews @grondo and @cmoussa1 ! I addressed your formatting/whitespace comments (breaking them out into a separate PR), and will set MWP.

@mergify mergify bot merged commit 4f5119a into flux-framework:master Jul 20, 2023
8 checks passed
@jameshcorbett jameshcorbett deleted the dws-service-crashes branch July 20, 2023 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling coral2_dws crashes
4 participants