Nomad Rescheduling #639

mpass99 · 2024-08-07T09:56:33Z

On the 2024-07-23 (02:21, agent 2) and the 2024-07-24 (02:34, agent 2), we observed that Nomad did not (successfully) reschedule runners. On both days, this behavior was triggered by an unattended upgrade of docker-ce.

In the syslogs, we see:

Docker starting to restart
Nomad starting to restart gracefully
Docker warning about ShouldRestart failed, container will not be restarted
Docker ignoring event topic=/tasks/delete
Containerd warning about runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n
Systemd remarking Found left-over process 1680662 (nomad) in control group while starting unit. Ignoring. This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Nomad throwing many times error reading from server: EOF

The text was updated successfully, but these errors were encountered:

mpass99 · 2024-09-12T09:06:04Z

In #612, we are currently investigating whether batch jobs restart/reschedule at all.
I suggest to

Wait for Nomad Agent Simultaneous Restart Behavior #673
Test for lost runners using our (sequential) Ansible Deployments
- Maybe inject a Nomad or Docker service restart
Keep an eye open for the Prewarming Pool Alert #587 and reduced numbers of idle runners after deployments and unattended-upgrades

MrSerth · 2024-09-25T13:28:09Z

Since we have two dedicated issues for #673 and #587, this issue is only about the sequential restart of Nomad agents together with the rescheduling behavior. Restarting Nomad sequentially is more fault tolerant than a simultaneous restart, showing less errors (according to our past experience). That's why we also included a rolling restart of Nomad in our Ansible pipeline.

Since the upstream issue created for #673 is not really about simultaneous restarts (but rather restarting Nomad in general with the batch jobs we use), currently this issue does not provide many additional insights. To keep a better visibility of pending issues and since we expect that #673 will improve the situation anyway, we are closing this one.

mpass99 added the bug Something isn't working label Aug 7, 2024

MrSerth closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad Rescheduling #639

Nomad Rescheduling #639

mpass99 commented Aug 7, 2024

mpass99 commented Sep 12, 2024 •

edited by MrSerth

Loading

MrSerth commented Sep 25, 2024

Nomad Rescheduling #639

Nomad Rescheduling #639

Comments

mpass99 commented Aug 7, 2024

mpass99 commented Sep 12, 2024 • edited by MrSerth Loading

MrSerth commented Sep 25, 2024

mpass99 commented Sep 12, 2024 •

edited by MrSerth

Loading