Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Rescheduling #639

Closed
mpass99 opened this issue Aug 7, 2024 · 2 comments
Closed

Nomad Rescheduling #639

mpass99 opened this issue Aug 7, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@mpass99
Copy link
Contributor

mpass99 commented Aug 7, 2024

On the 2024-07-23 (02:21, agent 2) and the 2024-07-24 (02:34, agent 2), we observed that Nomad did not (successfully) reschedule runners. On both days, this behavior was triggered by an unattended upgrade of docker-ce.

In the syslogs, we see:

  • Docker starting to restart
  • Nomad starting to restart gracefully
  • Docker warning about ShouldRestart failed, container will not be restarted
  • Docker ignoring event topic=/tasks/delete
  • Containerd warning about runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n
  • Systemd remarking Found left-over process 1680662 (nomad) in control group while starting unit. Ignoring. This usually indicates unclean termination of a previous run, or service implementation deficiencies.
  • Nomad throwing many times error reading from server: EOF
@mpass99 mpass99 added the bug Something isn't working label Aug 7, 2024
@mpass99
Copy link
Contributor Author

mpass99 commented Sep 12, 2024

In #612, we are currently investigating whether batch jobs restart/reschedule at all.
I suggest to

@MrSerth
Copy link
Member

MrSerth commented Sep 25, 2024

Since we have two dedicated issues for #673 and #587, this issue is only about the sequential restart of Nomad agents together with the rescheduling behavior. Restarting Nomad sequentially is more fault tolerant than a simultaneous restart, showing less errors (according to our past experience). That's why we also included a rolling restart of Nomad in our Ansible pipeline.

Since the upstream issue created for #673 is not really about simultaneous restarts (but rather restarting Nomad in general with the batch jobs we use), currently this issue does not provide many additional insights. To keep a better visibility of pending issues and since we expect that #673 will improve the situation anyway, we are closing this one.

@MrSerth MrSerth closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants