Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set exit-timeout=none for flux batch and flux alloc jobs #6304

Open
grondo opened this issue Sep 25, 2024 · 12 comments
Open

set exit-timeout=none for flux batch and flux alloc jobs #6304

grondo opened this issue Sep 25, 2024 · 12 comments

Comments

@grondo
Copy link
Contributor

grondo commented Sep 25, 2024

Flux instances as jobs have a certain level of resilience -- they can lose compute nodes that are leaves in the TBON and will not be terminated. The idea here is that the node failure within the batch/alloc job will terminate any job using that node. If the instance was running just one full-size job, then termination of that job will cause the batch script to exit and the instance will terminate. If the instance is running SCR or has many small jobs, though, it can continue to get work done.

However, it seems like exit-timeout=30s is getting in the way of this. When a node is lost, the job shell is lost too so the exit timer is started. In the first case, 30s may not be enough time for the parallel job within the instance to terminate, batch script finish, and instance normally exit. So users see a "doom: first task exited 30s ago" message that is pretty confusing after a node failure. In the second case, users of SCR or who want their job to continue have to remember to set -o exit-timeout=none to get the benefits.

Perhaps it would be best to just set exit-timeout=none in flux batch and flux alloc.

@garlick
Copy link
Member

garlick commented Sep 25, 2024

When a node is lost, the job shell is lost too so the exit timer is started.

I had to go look at code to find out why this is true because the primary mechanism in the doom plugin is an RPC that rank > 0 shells send to the rank 0 shell when a task exits, and if the node dies that wouldn't happen.

But the doom plugin also registers a shell.lost handler which is driven by lost-shell job exceptions posted by the job-exec module when it loses contact with a shell. When that handler is called, it also starts the doom timer.

If I am reading the code correctly, this exception is only raised as a severity=0 (fatal) exception on critical nodes, but the severity is ignored in the shell exception handler.

Would it make sense to pass the severity along as an argument to the shell.lost handler, and change the doom plugin to only take action on severity=0?

@garlick
Copy link
Member

garlick commented Sep 25, 2024

As I reread my statement above, maybe it is correct the way it is. tasks did "exit" after all. So disabling the doom plugin is probably the correct approach.

@grondo
Copy link
Contributor Author

grondo commented Sep 25, 2024

But the doom plugin also registers a shell.lost handler which is driven by lost-shell job exceptions posted by the job-exec module when it loses contact with a shell. When that handler is called, it also starts the doom timer.

Yes, otherwise normal jobs that lose a shell for some reason (unexpected shell exit for example) would not be able to take advantage of the exit timer.

Would it make sense to pass the severity along as an argument to the shell.lost handler, and change the doom plugin to only take action on severity=0?

This is an interesting idea. Maybe this would handle the general case of jobs that modify their critical-ranks set rather than only batch/alloc jobs? (For instance flux run OPTIONS... flux start....)

@grondo
Copy link
Contributor Author

grondo commented Sep 25, 2024

As I reread my statement above, maybe it is correct the way it is. tasks did "exit" after all. So disabling the doom plugin is probably the correct approach.

Ah, sorry, didn't see this comment until after I posted.

I'd actually like to explore your idea because it may be able to handle other methods of Flux instance launch, and also other jobs that may modify their critical-ranks set. I guess what it would mean is that tasks lost on non-critical ranks in general would not start the exit-timer.. does that still make sense?

@grondo
Copy link
Contributor Author

grondo commented Sep 25, 2024

For reference, this is the commit that added the shell.lost handler to the doom plugin:

commit 59c8c3a
Author: Mark A. Grondona [email protected]
Date: Fri Mar 8 19:45:06 2024 +0000

shell: doom: support exit-on-error and timeout for lost shells

Problem: The doom plugin supports raising a job exception when one
or more tasks exit early or with an error, but the plugin does not
track lost shells. This may cause jobs that unexpectedly lose a job
shell to hang indefinitely.

Add a `shell.lost` callback to the doom plugin and treat this the same
as an abnormal task exit. Modify the exception note in this case to
indicate that a shell and not an individual task exited.

@garlick
Copy link
Member

garlick commented Sep 25, 2024

Pondering :-)

I guess two concerns, not show stoppers at all:

  • inconsistent behavior if exit-timer is armed and a non-critical node dies (exit timer is NOT started) vs exit-timer is armed and a broker (not the shell) on a non-critical node dies (exit timer IS started).
  • the inconsistency of how exit-timer behaves on critical vs non-critical nodes, although maybe this would just need to be documented

@grondo
Copy link
Contributor Author

grondo commented Sep 25, 2024

Good points 🤔

@garlick
Copy link
Member

garlick commented Sep 25, 2024

This is an interesting idea. Maybe this would handle the general case of jobs that modify their critical-ranks set rather than only batch/alloc jobs?

Oh I see we have a job-exec.critical-ranks RPC that allows the critical ranks to be reset at runtime. Should we just have the doom shell plugin fetch that set and unconditionally ignore non-critical tasks that exit? That would address my minor consistency concerns above

Edit: er, I guess it would have to watch that value since it could be set/reset at any time.

@grondo
Copy link
Contributor Author

grondo commented Sep 25, 2024

The job-exec module can already send notifications to the rank 0 shell, maybe it could send an RPC when the critical-ranks set is modified? The job shell could add a new callback so multiple plugins can subscribe to the updates (though I can't think of other use cases right now, this would prevent the need to add a new service in the job-exec module, and possibly multiple watch RPCs from different plugins in the shell for every job)

@garlick
Copy link
Member

garlick commented Sep 25, 2024

That sounds good. It would need to get a starting value though.

@grondo
Copy link
Contributor Author

grondo commented Sep 26, 2024

Unless I'm missing something, the starting value for critical-ranks is the set of all ranks. This is only updated if the job manually issues the critical-ranks RPC to the job-exec module, which would then notify the job shell. Though this seems slightly convoluted, it is nice in that the shell doesn't have to do anything and there are no extra RPCs for the common case (no critical ranks update).

@grondo
Copy link
Contributor Author

grondo commented Sep 27, 2024

Hm, I thought of a counter-example here that suggests ignoring exit-timeout for non-critical ranks should be opt-in only.

If a batch script runs a series of full-size jobs, and a node goes down during one of them, the next job would get stuck in the SCHED state until timeout if the instance persisted by default. So, perhaps instance persistence should actually require a flag and not be automatic 🤔 There may be other solutions I haven't thought of as well (for example, if there was a way to tell the scheduler that a down resource was unlikely to come back, or some other way to reject jobs in this case..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants