DWS service crashes #81

jameshcorbett · 2023-07-20T05:31:00Z

This PR is a substantial rewrite of dws-jobtap.c and coral2_dws.py to solve the problem of handling coral2_dws.py crashes. Before this PR is merged, the coral2_dws.py service assumes that its lifetime encompasses the lifetime of any job that it sees. It can't handle rabbit jobs that existed before it started. This is because it keeps an internal data structure of active rabbit jobs, and jobs can only be added to the data structure when they are being created.

Furthermore, the dws-jobtap.c plugin sends a RPC to coral2_dws.py when a job is ready (by flux-core standards) to transition states, and coral2_dws.py responds to that RPC when HPE rabbit software gives the go-ahead. But HPE rabbit software can take a considerable amount of time to give the go-ahead. If the coral2_dws.py service crashes or restarts in the meanwhile, it won't be able to respond to the RPC it received, and the job will be stranded. In the worst case, the job is stranded in the dws epilog, and can't be canceled. That is intensely obnoxious.

Instead of making the coral2_dws.py service hold on to RPCs, make it respond more or less immediately, and send a new RPC to the jobtap plugin once the go-ahead is given by HPE rabbit software.

Fixes #64

Problem: there is an unnecessary flux_plugin_set_name call in cray_pals_port_distributor.c. The plugin has its name set twice. Remove the duplicate set_name call.

Problem: a public module-level function has a name prefixed with an underscore, making it private by convention. Remove the underscore to keep with public function naming convention.

Problem: the coral2-dws service caches the computes section of workflows. This makes the service more stateful than it needs to be. Remove the caching.

Problem: the coral2-dws service caches the directivebreakdowns associated with a workflow, rendering the service more stateful than it needs to be. The breakdowns are only needed once. Fetch the breakdowns when they are needed, and don't cache them.

Problem: the coral2-dws service saves incoming RPCs so it can respond to them later. However, this means that if the service crashes, it can no longer communicate with the jobtap plugin about jobs that were in flight. Stop saving incoming RPCs and instead of responding to incoming RPCs, send new RPCs instead.

Problem: Workflows are named predictably, but the WorkflowInfo constructor requires a 'name' argument. This makes constructing WorkflowInfo instances based on jobid only difficult. Make the 'name' field optional.

grondo

Nice! The separate commits in this PR really made it easy to review, thanks!

Just a few minor suggestions inline, mainly regarding mixing actual code changes and formatting/whitespace updates, which can make it difficult to tell what is going on. However, your breakdown was done so well I'd consider it only a minor inconvenience this time.

Also, as an aside it seems like the line breaks in your commit messages might be set too small? Usually the commit title is broken at ~52 characters and commit body at 72 - it seemed like you might be wrapping the commit bodies at around 52? Not a big deal, just thought I'd point it out in case it was an editor setting.

I didn't see anything mandatory to fix, so approving!

src/modules/coral2_dws.py

src/job-manager/plugins/dws-jobtap.c

cmoussa1

LGTM too! I agree with @grondo that this was a very well structured PR, great job! I noticed just one commented-out code block (but maybe that's intended so feel free to ignore if so).

src/job-manager/plugins/dws-jobtap.c

garlick · 2023-07-20T16:03:43Z

Just curious: would the problem with the RPC as it was before be addressed if pending RPCs were automatically sent an error response if the service disconnects? If so then maybe there is a flux-core issue to open...

jameshcorbett · 2023-07-20T17:13:56Z

Just curious: would the problem with the RPC as it was before be addressed if pending RPCs were automatically sent an error response if the service disconnects? If so then maybe there is a flux-core issue to open...

That would be an improvement over the original, but I think this PR is an improvement over that. If RPCs automatically sent an error response, when the service crashed the jobtap plugin could use the RPC error response to raise an exception and remove the prolog/epilog, so jobs would never be stuck. But users' jobs would fail simply because the service crashed.

However, with this PR, the service can crash repeatedly and users' jobs will be unaffected.

Problem: the flux-coral2 service doesn't handle workflows it finds in the k8s event stream that aren't also in its cache. This means that after the service crashes, all workflows in flight are stranded. Provided a workflow is named properly, have the service recognize and handle it.

Problem: the coral2_dws service does not handle workflows that it doesn't have in its cache in the post_run RPC callback. Add handling.

Problem: the dws jobtap plugin progresses jobs to new states when it receives responses to RPCs it has sent. However, that forces the original RPC recipient to hold on to messages until it is ready to respond. If the recipient crashes, the job is stranded. Add a service and handlers to the jobtap plugin. Move job-state logic from RPC responses to incoming RPCs to the service.

Problem: the coral2_dws service needs to modify the 'resources' section of a jobspec. It currently relies on being sent the resources in the workflow creation RPC. However, in the event that the service restarts, the service won't have the resources section of any outstanding jobs. Add logic to fetch the resources section of jobs on demand.

Problem: t1000-dws-dependencies.t does not verify that the jobs it starts actually clean up. Add cleanup checks.

Problem: the fake coral2_dws service needs to be updated to send job-manager RPCs to advance job state. Without the RPCs, jobs will be stuck in dependency, prolog, or epilog. Add the RPCs to the fake service.

Problem: the coral2_dws service does not respond to the RPCs it receives unless an error occurs while processing them. This keeps the RPC sender from knowing that its message was processed successfully. Respond with 'success: True' when the RPC has been processed.

Problem: 'flux_log_error' apends errno to log messages, and errno is set to a spurious value when logging the fact that an RPC returned with 'success: False'. Remove the errno appending by swapping 'flux_log_error' calls for 'flux_log( ..., LOG_ERR, ...)' calls.

Problem: the dws-jobtap's prolog-remove RPC callback does not update the prolog_active shared state, so the plugin may attempt to remove a prolog multiple times. Make the callback update the prolog_active shared state.

Problem: the dws-jobtap plugin does not track whether a workflow was successfully created for a job. This can cause issues when the plugin tries to tell the coral2_dws service to delete a workflow, when in fact no workflow was ever created. Before sending a workflow deletion request, check that a workflow was created to begin with, by looking for a boolean aux set after creation.

Problem: the dws jobtap plugin treats all exceptions the same. When any exception occurs, the plugin removes the jobtap prolog for the job. However, that is not appropriate when a severity nonzero exception occurs. Check exception severity and only take action if the severity is zero.

Problem: HPE software is slow, and sometimes tests time out, although they would pass successfully with longer timeouts. Increase timeouts in the test suite.

Problem: the testsuite does not contain any tests for behavior when the coral2_dws service crashes while jobs/workflows are in flight. Add a test.

Problem: a function fetches the jobspec for a job but doesn't use it. Remove the fetch.

jameshcorbett · 2023-07-20T18:17:06Z

Thanks for the reviews @grondo and @cmoussa1 ! I addressed your formatting/whitespace comments (breaking them out into a separate PR), and will set MWP.

jameshcorbett requested review from grondo and cmoussa1 July 20, 2023 05:31

jameshcorbett added 6 commits July 19, 2023 22:38

cray-pals: remove duplicate function call

37da484

Problem: there is an unnecessary flux_plugin_set_name call in cray_pals_port_distributor.c. The plugin has its name set twice. Remove the duplicate set_name call.

dws: remove underscore from public function

b22fdce

Problem: a public module-level function has a name prefixed with an underscore, making it private by convention. Remove the underscore to keep with public function naming convention.

dws: don't cache computes resource

cdf9426

Problem: the coral2-dws service caches the computes section of workflows. This makes the service more stateful than it needs to be. Remove the caching.

dws: make WorkflowInfo name optional

98b84f2

Problem: Workflows are named predictably, but the WorkflowInfo constructor requires a 'name' argument. This makes constructing WorkflowInfo instances based on jobid only difficult. Make the 'name' field optional.

jameshcorbett force-pushed the dws-service-crashes branch from 24f9d50 to ceb50c6 Compare July 20, 2023 05:50

jameshcorbett mentioned this pull request Jul 20, 2023

NEWS.md: add 0.6.0 release notes #82

Merged

grondo approved these changes Jul 20, 2023

View reviewed changes

cmoussa1 approved these changes Jul 20, 2023

View reviewed changes

src/job-manager/plugins/dws-jobtap.c Outdated Show resolved Hide resolved

jameshcorbett added 9 commits July 20, 2023 10:39

dws: handle unrecognized workflows in post_run

ae65e78

Problem: the coral2_dws service does not handle workflows that it doesn't have in its cache in the post_run RPC callback. Add handling.

testsuite: ensure dws jobs clean up

e2ee4b0

Problem: t1000-dws-dependencies.t does not verify that the jobs it starts actually clean up. Add cleanup checks.

dws: update t1000-dws-dependencies.t

64f0713

Problem: the fake coral2_dws service needs to be updated to send job-manager RPCs to advance job state. Without the RPCs, jobs will be stuck in dependency, prolog, or epilog. Add the RPCs to the fake service.

dws: add prolog-tracking logic

0b65f41

Problem: the dws-jobtap's prolog-remove RPC callback does not update the prolog_active shared state, so the plugin may attempt to remove a prolog multiple times. Make the callback update the prolog_active shared state.

jameshcorbett force-pushed the dws-service-crashes branch from ceb50c6 to e9a3e32 Compare July 20, 2023 18:03

jameshcorbett added 5 commits July 20, 2023 11:13

testsuite: increase timeouts

664b120

Problem: HPE software is slow, and sometimes tests time out, although they would pass successfully with longer timeouts. Increase timeouts in the test suite.

testsuite: add test for dws service crashes

23b3d47

Problem: the testsuite does not contain any tests for behavior when the coral2_dws service crashes while jobs/workflows are in flight. Add a test.

dws: remove unused variable

65c2fef

Problem: a function fetches the jobspec for a job but doesn't use it. Remove the fetch.

jameshcorbett force-pushed the dws-service-crashes branch from e9a3e32 to 65c2fef Compare July 20, 2023 18:13

jameshcorbett added the merge-when-passing label Jul 20, 2023

mergify bot merged commit 4f5119a into flux-framework:master Jul 20, 2023
8 checks passed

jameshcorbett deleted the dws-service-crashes branch July 20, 2023 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DWS service crashes #81

DWS service crashes #81

jameshcorbett commented Jul 20, 2023 •

edited

Loading

grondo left a comment

cmoussa1 left a comment

garlick commented Jul 20, 2023

jameshcorbett commented Jul 20, 2023 •

edited

Loading

jameshcorbett commented Jul 20, 2023

DWS service crashes #81

DWS service crashes #81

Conversation

jameshcorbett commented Jul 20, 2023 • edited Loading

grondo left a comment

Choose a reason for hiding this comment

cmoussa1 left a comment

Choose a reason for hiding this comment

garlick commented Jul 20, 2023

jameshcorbett commented Jul 20, 2023 • edited Loading

jameshcorbett commented Jul 20, 2023

jameshcorbett commented Jul 20, 2023 •

edited

Loading

jameshcorbett commented Jul 20, 2023 •

edited

Loading