Nomad Agents DoS on Migration #676

mpass99 · 2024-09-04T13:27:21Z

In #612, we noticed that the Nomad Agents are being caught in a crash loop, when too many allocations are being migrated (in response to a restart).

We find OOM killing errors (telegraf, systemd) and Nomad errors.

mpass99 · 2024-09-05T12:28:34Z

Having telegraf running and Nomad not, the server idles around.
Nomad starts successfully (when the server does not want to schedule many allocations on that agent)
Up to 60 runners, the agent works fine.
When requesting 80 runners, the agent crashes.
- The server displays it as down
- The CPU Usage is at 100%, and the memory usage is at 96%.
- Telegraf gets OOM Killed which changes almost nothing at the CPU and Memory usage.
- Stopping the nomad server frees the CPU usage; the memory usage remains at 65%
Disabling telegraf
Restarting the node
When starting Nomad, the agent still crashes
- The server no longer tries to place many allocations on the node (Only 4)
- It appears some locally stored data make the agent create many containers (~100; both CNI and runners)
- Everything gets OOM Killed
It normalizes when restarting Docker and Nomad
Therefore, telegraf does not seem to have a huge impact on the observed behavior

mpass99 added the bug Something isn't working label Sep 4, 2024

mpass99 mentioned this issue Sep 5, 2024

Fix OOM Killing Nomad Agent #681

Merged

mpass99 closed this as completed in #681 Sep 5, 2024

MrSerth mentioned this issue Sep 25, 2024

Handle permanently dead Nomad jobs #612

Closed

Provide feedback