You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On the 2024-07-23 (02:21, agent 2) and the 2024-07-24 (02:34, agent 2), we observed that Nomad did not (successfully) reschedule runners. On both days, this behavior was triggered by an unattended upgrade of docker-ce.
In the syslogs, we see:
Docker starting to restart
Nomad starting to restart gracefully
Docker warning about ShouldRestart failed, container will not be restarted
Docker ignoring event topic=/tasks/delete
Containerd warning about runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n
Systemd remarking Found left-over process 1680662 (nomad) in control group while starting unit. Ignoring. This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Nomad throwing many times error reading from server: EOF
The text was updated successfully, but these errors were encountered:
Since we have two dedicated issues for #673 and #587, this issue is only about the sequential restart of Nomad agents together with the rescheduling behavior. Restarting Nomad sequentially is more fault tolerant than a simultaneous restart, showing less errors (according to our past experience). That's why we also included a rolling restart of Nomad in our Ansible pipeline.
Since the upstream issue created for #673 is not really about simultaneous restarts (but rather restarting Nomad in general with the batch jobs we use), currently this issue does not provide many additional insights. To keep a better visibility of pending issues and since we expect that #673 will improve the situation anyway, we are closing this one.
On the 2024-07-23 (02:21, agent 2) and the 2024-07-24 (02:34, agent 2), we observed that Nomad did not (successfully) reschedule runners. On both days, this behavior was triggered by an unattended upgrade of
docker-ce
.In the syslogs, we see:
ShouldRestart failed, container will not be restarted
ignoring event topic=/tasks/delete
runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n
Found left-over process 1680662 (nomad) in control group while starting unit. Ignoring. This usually indicates unclean termination of a previous run, or service implementation deficiencies.
error reading from server: EOF
The text was updated successfully, but these errors were encountered: