tasks fail to recover on host restart until connection to servers restored #24236

Alexsandr-Random · 2024-10-17T09:22:38Z

Nomad version

Nomad v1.8.4
BuildDate 2024-09-17T20:18:34Z
Revision 22ab32e

Operating system and Environment details

OS: Ubuntu 22.04.5 LTS
Kernel: Linux a784.tso.net.ua 6.8.0-47-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 2 16:16:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Issue

Background:
In conditions of power outages and unstable internet, it was decided to use the parameter heatbeat_grace = 12 h in the master server configuration. This allowed us to ignore unstable internet during power outages and shelling.
Because with the standard heatbeat_grace = 60s, with a bad internet, we got into a situation where jobs regularly go into the "lost" status, after which they are recreated by the masters and thus we get into a ring of endless reboot of jobs on the client server, provided that the internet is either there or not (mobile 3G)

To solve this problem, we used heatbeat_grace = 12 h, so with an unstable internet, there is no constant restart of jobs and status changes.
However, this gave rise to a strange problem.
When restarting the client (it doesn't matter which one, manually or with emergency power)
CNI plugins are not loaded, nomad interface is not created, iptables rules and chains are not loaded, which should be created by CNI plugins, while docker rules in iptables are created, as are containers. The problem is at the junction of the interaction of nomad and cni plugins.
Can only be fixed with node drain\undrain

It turns out that the agent does not track its own state and completely relies on tracking from the center, which we changed in order to solve the problem mentioned above.

Reproduction steps

Set heatbeat_grace = 12h in all master server configs like here:
server {
heatbeat_grace = 12h
}

Then reboot one of client nodes what run jobs that requires CNI plugins usage and uses docker.
After reboot login and check:

Does interface (ip a) named "nomad" exist on agent node?
Does CNI rules in iptables created after reboot and jobs start by agent?

Expected Result

Better integration with CNI plugins, docker iptables rules and chains created when jobs started after agent restart. i expect that cni rules also created like docker if job require cni to run.

Actual Result

CNI plugins are not loaded, nomad interface is not created, iptables rules and chains are not loaded, which should be created by CNI plugins

Nomad Client logs (if appropriate)

Here is logs of nomad agent when we set heatbeat_grace = 12h

Oct 17 11:23:04 client systemd[1]: Stopping Nomad...
Oct 17 11:23:04 client nomad[1947]: ==> Caught signal: interrupt
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.573+0300 [INFO]  agent: requesting shutdown
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.574+0300 [INFO]  client: shutting down
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.589+0300 [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.589+0300 [INFO]  client.plugin: plugin manager finished: plugin-type=device
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.589+0300 [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.603+0300 [INFO]  client.plugin: plugin manager finished: plugin-type=driver
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.603+0300 [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.603+0300 [INFO]  client.plugin: plugin manager finished: plugin-type=csi
Oct 17 11:23:04 client consul[1073]: 2024-10-17T11:23:04.715+0300 [INFO]  agent: Synced service: service=_nomad-task-f8ebc7da-cf91-f55d-a6fd-3cb0437d40e4-group-traefik-traefik-ingress-http
Oct 17 11:23:04 client consul[1073]: agent: Synced service: service=_nomad-task-f8ebc7da-cf91-f55d-a6fd-3cb0437d40e4-group-traefik-traefik-ingress-http
Oct 17 11:23:04 client consul[1073]: 2024-10-17T11:23:04.882+0300 [INFO]  agent: Deregistered service: service=_nomad-client-ne3isvaq5zltnpoffewy3hm6xifswcvs
Oct 17 11:23:04 client consul[1073]: agent: Deregistered service: service=_nomad-client-ne3isvaq5zltnpoffewy3hm6xifswcvs
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.883+0300 [INFO]  agent: shutdown complete
Oct 17 11:23:04 client systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Oct 17 11:23:04 client systemd[1]: nomad.service: Failed with result 'exit-code'.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2274 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2275 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2285 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2293 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2307 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2308 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2309 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2310 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2702 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2841 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2918 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3148 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3519 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3622 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3695 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3768 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: Stopped Nomad.
Oct 17 11:23:04 client systemd[1]: nomad.service: Consumed 15.103s CPU time.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2274 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2275 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2285 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2293 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2307 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2308 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2309 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2310 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2702 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2841 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2918 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3148 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3519 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3622 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3695 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3768 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: Starting Nomad...
Oct 17 11:23:04 client nomad[5669]: ==> Loaded configuration from /usr/local/etc/nomad.d/base.hcl, /usr/local/etc/nomad.d/client.hcl, /usr/local/etc/nomad.d/mtls.hcl
Oct 17 11:23:04 client nomad[5669]: ==> Starting Nomad agent...
Oct 17 11:23:12 client consul[1073]: 2024-10-17T11:23:12.971+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:12 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:12 client consul[1073]: 2024-10-17T11:23:12.974+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:12 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:14 client consul[1073]: 2024-10-17T11:23:14.361+0300 [WARN]  agent: Check is now critical: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:14 client consul[1073]: agent: Check is now critical: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:14 client consul[1073]: 2024-10-17T11:23:14.456+0300 [INFO]  agent: Synced check: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:14 client consul[1073]: agent: Synced check: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.412+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.419+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.429+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.435+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.445+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.456+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.463+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.472+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: ==> Nomad agent configuration:
Oct 17 11:23:15 client nomad[5669]:        Advertise Addrs: HTTP: 10.100.20.37:4646
Oct 17 11:23:15 client nomad[5669]:             Bind Addrs: HTTP: [0.0.0.0:4646]
Oct 17 11:23:15 client nomad[5669]:                 Client: true
Oct 17 11:23:15 client nomad[5669]:              Log Level: INFO
Oct 17 11:23:15 client nomad[5669]:                 Region: test (DC: test)
Oct 17 11:23:15 client nomad[5669]:                 Server: false
Oct 17 11:23:15 client nomad[5669]:                Version: 1.8.4
Oct 17 11:23:15 client nomad[5669]: ==> Nomad agent started! Log data will stream in below:
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.935+0300 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/local/nomad/plugins
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  client: using state directory: state_dir=/var/local/nomad/client
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  client: using alloc directory: alloc_dir=/var/local/nomad/alloc
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.942+0300 [INFO]  client.fingerprint_mgr.consul: consul agent is available: cluster=default
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.948+0300 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:05.386+0300 [INFO]  client.fingerprint_mgr.vault: Vault is available: cluster=default
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.389+0300 [INFO]  client.proclib.cg2: initializing nomad cgroups: cores=0-11
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.390+0300 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.390+0300 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.390+0300 [INFO]  client.plugin: starting plugin manager: plugin-type=device
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.402+0300 [WARN]  client.server_mgr: no servers available
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.405+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=53ccf457-ddce-2d3b-3f56-48b5a384f6cf task=node-red type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.405+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=53ccf457-ddce-2d3b-3f56-48b5a384f6cf task=promtail type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.422+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=594865ac-39a6-8b24-013a-10266c3eed7c task=node-web type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.422+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=594865ac-39a6-8b24-013a-10266c3eed7c task=promtail type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.438+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=6cd39838-d4d4-0e1a-ff3d-f7cc5f9887d5 task=nginx-helper type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.449+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b4d43287-27a2-218a-864c-b21efdd8752a task=redis type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.449+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b4d43287-27a2-218a-864c-b21efdd8752a task=redis-commander type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.465+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f8ebc7da-cf91-f55d-a6fd-3cb0437d40e4 task=traefik type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.474+0300 [INFO]  client: started client: node_id=9716bd4b-95b6-53fb-645f-030f2f0fbdea
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.474+0300 [WARN]  client.server_mgr: no servers available
Oct 17 11:23:15 client systemd[1]: Started Nomad.
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.485+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.486+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.487+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.514+0300 [WARN]  client.server_mgr: no servers available
Oct 17 11:23:15 client consul[1073]: 2024-10-17T11:23:15.550+0300 [INFO]  agent: Synced service: service=_nomad-client-ne3isvaq5zltnpoffewy3hm6xifswcvs
Oct 17 11:23:15 client consul[1073]: agent: Synced service: service=_nomad-client-ne3isvaq5zltnpoffewy3hm6xifswcvs
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.554+0300 [INFO]  client.consul: discovered following servers: servers=[10.100.10.11:4647, 10.100.10.12:4647, 10.100.10.13:4647]
Oct 17 11:23:15 client consul[1073]: 2024-10-17T11:23:15.616+0300 [INFO]  agent: Synced check: check=_nomad-check-61d7fb6fe8c1dc5d8d63715e58d5506ac318a5a7
Oct 17 11:23:15 client consul[1073]: agent: Synced check: check=_nomad-check-61d7fb6fe8c1dc5d8d63715e58d5506ac318a5a7
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.717+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.717+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.717+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.795+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.795+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.795+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.863+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.863+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.863+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.888+0300 [INFO]  client: node registration complete
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.924+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.924+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.924+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.016+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.017+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.017+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.078+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.078+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.078+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.170+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.170+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.170+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:22 client consul[1073]: 2024-10-17T11:23:22.972+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:22 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:22 client consul[1073]: 2024-10-17T11:23:22.975+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:22 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:23 client nomad[5669]:     2024-10-17T11:23:23.319+0300 [INFO]  client: node registration complete
Oct 17 11:23:24 client consul[1073]: 2024-10-17T11:23:24.331+0300 [INFO]  agent: Synced check: check=_nomad-check-61d7fb6fe8c1dc5d8d63715e58d5506ac318a5a7
Oct 17 11:23:24 client consul[1073]: agent: Synced check: check=_nomad-check-61d7fb6fe8c1dc5d8d63715e58d5506ac318a5a7
Oct 17 11:23:24 client consul[1073]: 2024-10-17T11:23:24.454+0300 [INFO]  agent: Synced check: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:24 client consul[1073]: agent: Synced check: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:32 client consul[1073]: 2024-10-17T11:23:32.974+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:32 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:32 client consul[1073]: 2024-10-17T11:23:32.975+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:32 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:42 client consul[1073]: 2024-10-17T11:23:42.974+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:42 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:42 client consul[1073]: 2024-10-17T11:23:42.976+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:42 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:45 client nomad[5669]: ==> Newer Nomad version available: 1.9.0 (currently running: 1.8.4)
Oct 17 11:23:52 client consul[1073]: 2024-10-17T11:23:52.976+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:52 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:52 client consul[1073]: 2024-10-17T11:23:52.977+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:52 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:24:02 client consul[1073]: 2024-10-17T11:24:02.977+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
....

The text was updated successfully, but these errors were encountered:

tgross · 2024-10-17T12:45:54Z

@Alexsandr-Random creating the bridge and iptables chains at startup has been something we've wanted to do for a while. See #6618.

But in the case you're talking about here, the tasks are all dead when the client comes up so it needs to restart the tasks. But because the host has power-cycled, all the allocation-level hooks (like CNI plugins) will have lost local state as well. So what we need to cover this case is for the client to not just restart the tasks but restart the allocation.

The tricky thing is that we've done this intentionally -- the client has been offline so it can't know whether the server has rescheduled the workload (or maybe the user has stopped it!). So if the tasks are gone we fail the "restore" and then we wait until the client contacts the server to attempt to create a new allocation.

You may want to look at the disconnect block to get closer to the behavior you're looking for here.

Alexsandr-Random · 2024-10-18T10:45:00Z

Hi!
Thanks for quick reply!
What we changed (all masters in cluster restarted and configs updated):

server {
heatbeat_grace = 60s 
}

and than we setup the following policy in job spec:

    restart {
      interval = "24h"
      delay    = "3m"
      mode     = "delay"
    }
    disconnect {
        lost_after = "12h"
        replace = false
        reconcile = "keep_original"
    }

So we trying to set job state as lost after 12h before it - we need to set it as unknown.
Than we trying some tests.

Just 5-10 minutes of network downtime simulation for host (cluster mark node as disconnected) and job states as unknown. And then recover it properly.
Simulate reboot (internet connection persist after host comes up) and here we get some problems:
2.1 Nomad interface created and CNI rules (general like CHAINS) - created, but
we can't retrieve vault secrets from storage for first few seconds (nomad starts after consul via systemd), task fails and than - restarted, meanwhile i could see significant nomad delay (2m) while he trying to recover state of allocs, also i think if host rebooted and internet connection lost it will not recover job with that policy (without - nomad could do it)
That's very strange because all templates with secrets have change_mode = "noop"
Here some logs:

Oct 18 12:44:59 a784 nomad[1771]:     2024-10-18T12:44:49.329+0300 [INFO]  client.fingerprint_mgr.vault: Vault is available: cluster=default
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.DeriveVaultToken server=10.100.10.11:4647
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.DeriveVaultToken server=10.100.10.11:4647
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.vault: error making derive token RPC: error="rpc error: EOF"
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.vault: error deriving token: name=default error="DeriveVaultToken RPC failed: rpc error: EOF" alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task_names=["node-web"]
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task=node-web error="DeriveVaultToken RPC failed: rpc error: EOF" recoverable=false
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.476+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task=node-web type=Killing msg="Vault: failed to derive vault token: DeriveVaultToken RPC failed: rpc error: EOF" failed=true
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.881+0300 [WARN]  agent: (view) vault.read(test/data/registry-auth): vault.read(test/data/registry-auth): Error making API request.
Oct 18 12:45:00 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/registry-auth
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.881+0300 [WARN]  agent: (view) vault.read(test/data/loki-auth): vault.read(test/data/loki-auth): Error making API request.
Oct 18 12:45:00 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/loki-auth
Oct 18 12:45:01 a784 nomad[1771]:     2024-10-18T12:45:01.184+0300 [WARN]  agent: (view) vault.read(test/data/loki-auth): vault.read(test/data/loki-auth): Error making API request.
Oct 18 12:45:01 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/loki-auth
Oct 18 12:45:01 a784 nomad[1771]:     2024-10-18T12:45:01.184+0300 [WARN]  agent: (view) vault.read(test/data/registry-auth): vault.read(test/data/registry-auth): Error making API request.
Oct 18 12:45:01 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/registry-auth
Oct 18 12:45:01 a784 nomad[1771]:     2024-10-18T12:45:01.747+0300 [WARN]  agent: (view) vault.read(test/data/loki-auth): vault.read(test/data/loki-auth): Error making API request.
Oct 18 12:45:01 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/loki-auth
Oct 18 12:45:01 a784 nomad[1771]:     2024-10-18T12:45:01.777+0300 [WARN]  agent: (view) vault.read(test/data/registry-auth): vault.read(test/data/registry-auth): Error making API request.
Oct 18 12:45:01 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/registry-auth
Oct 18 12:45:02 a784 nomad[1771]:     2024-10-18T12:45:02.811+0300 [WARN]  agent: (view) vault.read(test/data/loki-auth): vault.read(test/data/loki-auth): Error making API request.
Oct 18 12:45:02 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/loki-auth
Oct 18 12:45:02 a784 nomad[1771]:     2024-10-18T12:45:02.865+0300 [WARN]  agent: (view) vault.read(test/data/registry-auth): vault.read(test/data/registry-auth): Error making API request.
Oct 18 12:45:02 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/registry-auth
Oct 18 12:45:03 a784 nomad[1771]:     2024-10-18T12:45:03.480+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task=node-web type=Template msg="Missing: vault.read(test/data/registry-auth)" failed=false
Oct 18 12:45:03 a784 nomad[1771]:     2024-10-18T12:45:03.503+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task=promtail type=Template msg="Missing: vault.read(test/data/loki-auth)" failed=false

2.2 Also with that policy even after alloc restarted from failed state - CNI rules for this job not created. and we have situation where alloc not listen on port what it should be. (dynamic port mapping)

tgross · 2024-10-18T14:10:48Z

tl;dr The root cause of the problem here is that the task's Vault token is written to the secrets dir which is a tmpfs, which gets destroyed on client reboot. Read on for more... 😀

Simulate reboot

Can you describe this? There's a good bit of difference between a host reboot and restarting the Nomad agent, so I want to make sure that's well-covered.

But assuming that's ok, I'm pretty sure I understand why you're seeing the sequence of events you've described.

If a task dies while a client is stopped, what's happens is the following:

The client starts and attempts to restore the allocation client.go#L1286-L1294.
The allocrunner attempts to restore the task alloc_runner.go#L485-L487.
The taskrunner fails to restore the handle to the task task_runner.go#L1247-L1271.

This should be writing an event for the task: "failed to restore task; will not run until server is contacted". Are you seeing that?
This does not return an error back up to ar.Restore(), so the client will then attempt to run the allocrunner.

The allocrunner runs (firing the CNI hooks) and completes the "prestart" phase.
The taskrunner runs, and hits the vault_hook.
The vault_hook find the Vault token in the NOMAD_SECRETS_DIR (a tmpfs) and restores that into its own state.
Any template blocks that read from Vault use this token.

Where things are going wrong for the reboot case is step (6). That's because the host has been rebooted, so the tmpfs has been cleared! So we need to get the Vault token again and using the Legacy Workflow that happens by contacting the Nomad server.

We should _not be seeing subsequent requests to Vault for the template blocks though. We might have a bug in how we handle the failure case. At vault_hook.go#L384-L392 we write a task event and kill the task:

// Check if this is a server side error
if structs.IsServerSide(err) {
    h.logger.Error("failed to derive Vault token", "error", err, "server_side", true)
    h.lifecycle.Kill(h.ctx,
        structs.NewTaskEvent(structs.TaskKilling).
            SetFailsTask().
            SetDisplayMessage(fmt.Sprintf("Vault: server failed to derive vault token: %v", err)))
    return "", true
}

But for some reason we're seeing subsequent Vault API requests from what looks like a template block. Unfortunately the logs you've provided are at INFO-level so I'm not able to dig in much further than that as to why that is.

However, as noted in the tl;dr, the real problem you're facing is actually about not being able restore from a tmpfs! The only way to fix this would be to write the Vault token (and Workload Identity token, if using that workflow) to durable storage. This would be less secure on the client but maybe useful for folks who anticipate having clients completely disconnected for long periods of time.

Let me bring that back to the team for discussion and see if there's a way we could be comfortable with that. It probably doesn't help with template blocks though, as those still need to get data from their data sources to render (or even check if anything has changed while the client was down!)

tgross · 2024-10-18T14:34:36Z

One more thing are you getting the task event "failed to restore task; will not run until server is contacted" for these failed tasks when we try to restore them?

Alexsandr-Random · 2024-10-21T09:46:18Z

Can you describe this? There's a good bit of difference between a host reboot and restarting the Nomad agent, so I want to make sure that's well-covered.

I mean reboot of host machine via "reboot command"), that simulates lost of power.

The only way to fix this would be to write the Vault token (and Workload Identity token, if using that workflow) to durable storage.

Maybe it's good option to start with. Can we do it via nomad config files?

One more thing are you getting the task event "failed to restore task; will not run until server is contacted" for these failed tasks when we try to restore them?

I am searched this key words and phrases in logs (not only syslog files) and did not find anything.

tgross · 2024-10-21T12:52:30Z

Maybe it's good option to start with. Can we do it via nomad config files?

Not currently, so that's something we need to discuss as a team whether that can be done in a way that won't break our security model.

I am searched this key words and phrases in logs (not only syslog files) and did not find anything.

It should be on the Task Events (visible via nomad alloc status $allocid

Alexsandr-Random · 2024-10-22T08:17:55Z

@tgross
For now we are back to the original scheme, without any manipulations with heatbeat_grace and without the disconnect block, but I still have the results of tests with the disconnect block and heatbeat_grace = 60s
This is what Task Events looks like when viewed in the Nomad GUI:

Alexsandr-Random added the type/bug label Oct 17, 2024

tgross changed the title ~~Nomad does not use CNI plugins on restart~~ Nomad does not use CNI plugins on host restart Oct 17, 2024

tgross self-assigned this Oct 17, 2024

tgross added theme/client-restart stage/waiting-reply theme/cni labels Oct 17, 2024

tgross removed the stage/waiting-reply label Oct 18, 2024

tgross changed the title ~~Nomad does not use CNI plugins on host restart~~ tasks fail to recover on host restart until connection to servers restored Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tasks fail to recover on host restart until connection to servers restored #24236

tasks fail to recover on host restart until connection to servers restored #24236

Alexsandr-Random commented Oct 17, 2024

tgross commented Oct 17, 2024

Alexsandr-Random commented Oct 18, 2024

tgross commented Oct 18, 2024 •

edited

Loading

tgross commented Oct 18, 2024

Alexsandr-Random commented Oct 21, 2024

tgross commented Oct 21, 2024

Alexsandr-Random commented Oct 22, 2024

tasks fail to recover on host restart until connection to servers restored #24236

tasks fail to recover on host restart until connection to servers restored #24236

Comments

Alexsandr-Random commented Oct 17, 2024

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Client logs (if appropriate)

tgross commented Oct 17, 2024

Alexsandr-Random commented Oct 18, 2024

tgross commented Oct 18, 2024 • edited Loading

tgross commented Oct 18, 2024

Alexsandr-Random commented Oct 21, 2024

tgross commented Oct 21, 2024

Alexsandr-Random commented Oct 22, 2024

tgross commented Oct 18, 2024 •

edited

Loading