-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle permanently dead Nomad jobs #612
Comments
Let's consider multiple scenarios and analyse Nomad's events for these. Startup and Environment CreationnomadEventDump-Startup-No-Environment-Job.txt Learnings
nomadEventDump-Startup-Complete-Environment-Job.txt Questions
nomadEventDump-Create-Environment.txt Simultaneous RestartnomadEventDump-RestartTogether-withoutAlert.txt Questions
Learnings
nomadEventDump-RestartTogether-PrewarmigPoolAlert.txt Sequential RestartnomadEventDump-RestartAfterEachOther-Success1-5.txt Learnings
nomadEventDump-RestartAfterEachOther-Success3-100.txt With 200 runners, we can create a severe failure behavior. Questions
|
I sperated the many questions identified by this issue into their own issues.
In later examples, we also recognised runner jobs in this state. |
We created dedicated sub issues for the different aspects identified in this ticket. Those are tracked as #673, #674, #675, #676, #677. Until those issues have been completed, we expect that the same "erroneous" behavior initially reported for this issue remains. Most promising to resolve the root cause is likely to #673. Since all work as been split into dedicated issues, we are closing this one for better visibility. |
Related to #587
In a recent deployment, we have observed that some (but not all) runners are lost when all Nomad agents restart.
Within this issue, we should identify the Nomad event that notifies Poseidon that a job is lost and will not be restarted nor rescheduled, and deal with it by trying to request a new runner. [Jobs] [Allocations].
This should be fixed together with #602
The text was updated successfully, but these errors were encountered: