-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DWS service crashes #81
DWS service crashes #81
Conversation
Problem: there is an unnecessary flux_plugin_set_name call in cray_pals_port_distributor.c. The plugin has its name set twice. Remove the duplicate set_name call.
Problem: a public module-level function has a name prefixed with an underscore, making it private by convention. Remove the underscore to keep with public function naming convention.
Problem: the coral2-dws service caches the computes section of workflows. This makes the service more stateful than it needs to be. Remove the caching.
Problem: the coral2-dws service caches the directivebreakdowns associated with a workflow, rendering the service more stateful than it needs to be. The breakdowns are only needed once. Fetch the breakdowns when they are needed, and don't cache them.
Problem: the coral2-dws service saves incoming RPCs so it can respond to them later. However, this means that if the service crashes, it can no longer communicate with the jobtap plugin about jobs that were in flight. Stop saving incoming RPCs and instead of responding to incoming RPCs, send new RPCs instead.
Problem: Workflows are named predictably, but the WorkflowInfo constructor requires a 'name' argument. This makes constructing WorkflowInfo instances based on jobid only difficult. Make the 'name' field optional.
24f9d50
to
ceb50c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! The separate commits in this PR really made it easy to review, thanks!
Just a few minor suggestions inline, mainly regarding mixing actual code changes and formatting/whitespace updates, which can make it difficult to tell what is going on. However, your breakdown was done so well I'd consider it only a minor inconvenience this time.
Also, as an aside it seems like the line breaks in your commit messages might be set too small? Usually the commit title is broken at ~52 characters and commit body at 72 - it seemed like you might be wrapping the commit bodies at around 52? Not a big deal, just thought I'd point it out in case it was an editor setting.
I didn't see anything mandatory to fix, so approving!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too! I agree with @grondo that this was a very well structured PR, great job! I noticed just one commented-out code block (but maybe that's intended so feel free to ignore if so).
Just curious: would the problem with the RPC as it was before be addressed if pending RPCs were automatically sent an error response if the service disconnects? If so then maybe there is a flux-core issue to open... |
That would be an improvement over the original, but I think this PR is an improvement over that. If RPCs automatically sent an error response, when the service crashed the jobtap plugin could use the RPC error response to raise an exception and remove the prolog/epilog, so jobs would never be stuck. But users' jobs would fail simply because the service crashed. However, with this PR, the service can crash repeatedly and users' jobs will be unaffected. |
Problem: the flux-coral2 service doesn't handle workflows it finds in the k8s event stream that aren't also in its cache. This means that after the service crashes, all workflows in flight are stranded. Provided a workflow is named properly, have the service recognize and handle it.
Problem: the coral2_dws service does not handle workflows that it doesn't have in its cache in the post_run RPC callback. Add handling.
Problem: the dws jobtap plugin progresses jobs to new states when it receives responses to RPCs it has sent. However, that forces the original RPC recipient to hold on to messages until it is ready to respond. If the recipient crashes, the job is stranded. Add a service and handlers to the jobtap plugin. Move job-state logic from RPC responses to incoming RPCs to the service.
Problem: the coral2_dws service needs to modify the 'resources' section of a jobspec. It currently relies on being sent the resources in the workflow creation RPC. However, in the event that the service restarts, the service won't have the resources section of any outstanding jobs. Add logic to fetch the resources section of jobs on demand.
Problem: t1000-dws-dependencies.t does not verify that the jobs it starts actually clean up. Add cleanup checks.
Problem: the fake coral2_dws service needs to be updated to send job-manager RPCs to advance job state. Without the RPCs, jobs will be stuck in dependency, prolog, or epilog. Add the RPCs to the fake service.
Problem: the coral2_dws service does not respond to the RPCs it receives unless an error occurs while processing them. This keeps the RPC sender from knowing that its message was processed successfully. Respond with 'success: True' when the RPC has been processed.
Problem: 'flux_log_error' apends errno to log messages, and errno is set to a spurious value when logging the fact that an RPC returned with 'success: False'. Remove the errno appending by swapping 'flux_log_error' calls for 'flux_log( ..., LOG_ERR, ...)' calls.
Problem: the dws-jobtap's prolog-remove RPC callback does not update the prolog_active shared state, so the plugin may attempt to remove a prolog multiple times. Make the callback update the prolog_active shared state.
ceb50c6
to
e9a3e32
Compare
Problem: the dws-jobtap plugin does not track whether a workflow was successfully created for a job. This can cause issues when the plugin tries to tell the coral2_dws service to delete a workflow, when in fact no workflow was ever created. Before sending a workflow deletion request, check that a workflow was created to begin with, by looking for a boolean aux set after creation.
Problem: the dws jobtap plugin treats all exceptions the same. When any exception occurs, the plugin removes the jobtap prolog for the job. However, that is not appropriate when a severity nonzero exception occurs. Check exception severity and only take action if the severity is zero.
Problem: HPE software is slow, and sometimes tests time out, although they would pass successfully with longer timeouts. Increase timeouts in the test suite.
Problem: the testsuite does not contain any tests for behavior when the coral2_dws service crashes while jobs/workflows are in flight. Add a test.
Problem: a function fetches the jobspec for a job but doesn't use it. Remove the fetch.
e9a3e32
to
65c2fef
Compare
This PR is a substantial rewrite of
dws-jobtap.c
andcoral2_dws.py
to solve the problem of handlingcoral2_dws.py
crashes. Before this PR is merged, thecoral2_dws.py
service assumes that its lifetime encompasses the lifetime of any job that it sees. It can't handle rabbit jobs that existed before it started. This is because it keeps an internal data structure of active rabbit jobs, and jobs can only be added to the data structure when they are being created.Furthermore, the
dws-jobtap.c
plugin sends a RPC tocoral2_dws.py
when a job is ready (by flux-core standards) to transition states, andcoral2_dws.py
responds to that RPC when HPE rabbit software gives the go-ahead. But HPE rabbit software can take a considerable amount of time to give the go-ahead. If thecoral2_dws.py
service crashes or restarts in the meanwhile, it won't be able to respond to the RPC it received, and the job will be stranded. In the worst case, the job is stranded in the dws epilog, and can't be canceled. That is intensely obnoxious.Instead of making the
coral2_dws.py
service hold on to RPCs, make it respond more or less immediately, and send a new RPC to the jobtap plugin once the go-ahead is given by HPE rabbit software.Fixes #64