-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set exit-timeout=none
for flux batch
and flux alloc
jobs
#6304
Comments
I had to go look at code to find out why this is true because the primary mechanism in the doom plugin is an RPC that rank > 0 shells send to the rank 0 shell when a task exits, and if the node dies that wouldn't happen. But the doom plugin also registers a If I am reading the code correctly, this exception is only raised as a severity=0 (fatal) exception on critical nodes, but the severity is ignored in the shell exception handler. Would it make sense to pass the severity along as an argument to the |
As I reread my statement above, maybe it is correct the way it is. tasks did "exit" after all. So disabling the doom plugin is probably the correct approach. |
Yes, otherwise normal jobs that lose a shell for some reason (unexpected shell exit for example) would not be able to take advantage of the exit timer.
This is an interesting idea. Maybe this would handle the general case of jobs that modify their critical-ranks set rather than only batch/alloc jobs? (For instance |
Ah, sorry, didn't see this comment until after I posted. I'd actually like to explore your idea because it may be able to handle other methods of Flux instance launch, and also other jobs that may modify their critical-ranks set. I guess what it would mean is that tasks lost on non-critical ranks in general would not start the exit-timer.. does that still make sense? |
For reference, this is the commit that added the commit 59c8c3a
|
Pondering :-) I guess two concerns, not show stoppers at all:
|
Good points 🤔 |
Oh I see we have a Edit: er, I guess it would have to watch that value since it could be set/reset at any time. |
The job-exec module can already send notifications to the rank 0 shell, maybe it could send an RPC when the critical-ranks set is modified? The job shell could add a new callback so multiple plugins can subscribe to the updates (though I can't think of other use cases right now, this would prevent the need to add a new service in the job-exec module, and possibly multiple watch RPCs from different plugins in the shell for every job) |
That sounds good. It would need to get a starting value though. |
Unless I'm missing something, the starting value for critical-ranks is the set of all ranks. This is only updated if the job manually issues the critical-ranks RPC to the job-exec module, which would then notify the job shell. Though this seems slightly convoluted, it is nice in that the shell doesn't have to do anything and there are no extra RPCs for the common case (no critical ranks update). |
Hm, I thought of a counter-example here that suggests ignoring exit-timeout for non-critical ranks should be opt-in only. If a batch script runs a series of full-size jobs, and a node goes down during one of them, the next job would get stuck in the SCHED state until timeout if the instance persisted by default. So, perhaps instance persistence should actually require a flag and not be automatic 🤔 There may be other solutions I haven't thought of as well (for example, if there was a way to tell the scheduler that a down resource was unlikely to come back, or some other way to reject jobs in this case..) |
Flux instances as jobs have a certain level of resilience -- they can lose compute nodes that are leaves in the TBON and will not be terminated. The idea here is that the node failure within the batch/alloc job will terminate any job using that node. If the instance was running just one full-size job, then termination of that job will cause the batch script to exit and the instance will terminate. If the instance is running SCR or has many small jobs, though, it can continue to get work done.
However, it seems like
exit-timeout=30s
is getting in the way of this. When a node is lost, the job shell is lost too so the exit timer is started. In the first case, 30s may not be enough time for the parallel job within the instance to terminate, batch script finish, and instance normally exit. So users see a "doom: first task exited 30s ago" message that is pretty confusing after a node failure. In the second case, users of SCR or who want their job to continue have to remember to set-o exit-timeout=none
to get the benefits.Perhaps it would be best to just set
exit-timeout=none
influx batch
andflux alloc
.The text was updated successfully, but these errors were encountered: