Handle permanently dead Nomad jobs #612

mpass99 · 2024-06-12T19:44:56Z

Related to #587

In a recent deployment, we have observed that some (but not all) runners are lost when all Nomad agents restart.

Within this issue, we should identify the Nomad event that notifies Poseidon that a job is lost and will not be restarted nor rescheduled, and deal with it by trying to request a new runner. [Jobs] [Allocations].

This should be fixed together with #602

mpass99 · 2024-09-04T09:36:53Z

Let's consider multiple scenarios and analyse Nomad's events for these.

Startup and Environment Creation

nomadEventDump-Startup-No-Environment-Job.txt
The most basic case, is a Poseidon startup without an existing environment in Nomad.
We observe that we do not handle events on Poseidon startup. If runner jobs exist without a respective environment job, these runner jobs are ignored.
When creating an environment, we receive Nomad events corresponding to the template job and the runner jobs (in this case just one).

Learnings

Existing runner jobs without an environment job are not being recognised until a recovery after the environment got recreated.
The Job JobRegistered event is sent before the allocation/task is started completely.

nomadEventDump-Startup-Complete-Environment-Job.txt
Next we can observe a special state of environment jobs. In this state, the Nomad job is complete but still exists.
On startup recovery, the environment is being ignored with a special log message.
When creating the environment again, the environment job remains in that complete state. Just when deleting the environment, we can escape that state and create a running environment.

Questions

What is this state of a complete but yet existing Nomad Job?
Why do we only see environment jobs but not runner jobs in this state?

nomadEventDump-Create-Environment.txt
Here, we have an environment creation with five started runners.

Simultaneous Restart

nomadEventDump-RestartTogether-withoutAlert.txt
When restarting all Nomad agents with one environment and five runners running, we observe that the environment job goes into the previously described complete state, two runner jobs survive the restart, and three are completely removed.

Questions

Are we handling such lost runner jobs right?

Learnings

Job-Events do not reliably notify about stopped jobs (on failed migrations)
Allocation-Events do not clearly state that a runner has stopped completely. All we receive is one event stating that the allocation is about to stop.
Do not restart all agents at the same time

nomadEventDump-RestartTogether-PrewarmigPoolAlert.txt
This scenario shows that our Prewarming Pool Alert would fix cases with lost runners and recreate them.

Sequential Restart

nomadEventDump-RestartAfterEachOther-Success1-5.txt
nomadEventDump-RestartAfterEachOther-Success2-5.txt
We have two examples with one environment job and five runner jobs where all jobs are rescheduled correctly.

Learnings

We just receive Allocation-events for reschedulings, no Job-events.

nomadEventDump-RestartAfterEachOther-Success3-100.txt
With 100 runner jobs it also succeeds, but we receive the warning Failed to allocate directory watch: Too many open files in the ssh connection of Agent 3.

With 200 runners, we can create a severe failure behavior.
We receive the Failed to allocate directory watch: Too many open files error on all agents.
The rescheduling of the allocations in case of a restart leads to a for >25min ongoing denial of service attack. An agent starts and becomes ready, then it is flooded with allocations, and crashes again.

Questions

How severe is the Failed to allocate directory watch: Too many open files error?
In the syslogs we find many systemd nomad.service: Unit process 3620 (nomad) remains running after unit stopped. warnings.
- How severe are they?
- After stopping, when starting, two other warnings appear.
  - nomad.service: Found left-over process 3620 (nomad) in control group while starting unit. Ignoring.
  - nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
- Why is the Nomad service started directly (after 6s) and does not wait for the graceful shutdown period?
We find many nomad errors: [ERROR] client: error restoring alloc: error="failed to decode alloc: key not found: alloc".
After 2min 46s kernel: containerd invoked oom-killer, Out of memory: Killed process 8758 (systemd), telegraf invoked oom-killer, Killed process 725 (telegraf)
- Is that related to your issue Hitting Container Kernel Memory Limit causes OOM on Docker Host docker/for-linux#1001
While the nomad systemd service is running, Nomad is not marked as ready and does not respond to cli requests any longer.

mpass99 · 2024-09-04T13:35:47Z

I sperated the many questions identified by this issue into their own issues.

Why do we only see environment jobs but not runner jobs in this state?

In later examples, we also recognised runner jobs in this state.

MrSerth · 2024-09-25T12:55:19Z

We created dedicated sub issues for the different aspects identified in this ticket. Those are tracked as #673, #674, #675, #676, #677.

Until those issues have been completed, we expect that the same "erroneous" behavior initially reported for this issue remains. Most promising to resolve the root cause is likely to #673.

Since all work as been split into dedicated issues, we are closing this one for better visibility.

mpass99 added the bug Something isn't working label Jun 12, 2024

This was referenced Jun 13, 2024

Investigate leaking allocation storage data #615

Closed

Sentry: Could not perform the requested updateFileSystem. #406

Closed

mpass99 mentioned this issue Jun 27, 2024

Nomad Restart and Reschedule Policy #611

Merged

1 task

This was referenced Aug 7, 2024

Started Runner is already in use #597

Open

Prewarming Pool Alert #587

Open

mpass99 mentioned this issue Aug 16, 2024

No allocation found while updateFileSystem #649

Closed

This was referenced Sep 4, 2024

Environment stopped unexpectedly #672

Closed

Missing template task for some environments #522

Closed

This was referenced Sep 4, 2024

Nomad Agent Simultaneous Restart Behavior #673

Open

Check Event Handling on Nomad Agent Simultaneous Restart #674

Closed

Nomad Agent "Too many open files" #675

Open

Nomad Agents DoS on Migration #676

Closed

Nomad Agent Restart Processes #677

Open

mpass99 mentioned this issue Sep 12, 2024

Nomad Rescheduling #639

Closed

MrSerth closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle permanently dead Nomad jobs #612

Handle permanently dead Nomad jobs #612

mpass99 commented Jun 12, 2024 •

edited

Loading

mpass99 commented Sep 4, 2024

mpass99 commented Sep 4, 2024

MrSerth commented Sep 25, 2024

Handle permanently dead Nomad jobs #612

Handle permanently dead Nomad jobs #612

Comments

mpass99 commented Jun 12, 2024 • edited Loading

mpass99 commented Sep 4, 2024

Startup and Environment Creation

Learnings

Questions

Simultaneous Restart

Questions

Learnings

Sequential Restart

Learnings

Questions

mpass99 commented Sep 4, 2024

MrSerth commented Sep 25, 2024

mpass99 commented Jun 12, 2024 •

edited

Loading