Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle permanently dead Nomad jobs #612

Closed
mpass99 opened this issue Jun 12, 2024 · 3 comments
Closed

Handle permanently dead Nomad jobs #612

mpass99 opened this issue Jun 12, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@mpass99
Copy link
Contributor

mpass99 commented Jun 12, 2024

Related to #587

In a recent deployment, we have observed that some (but not all) runners are lost when all Nomad agents restart.

Within this issue, we should identify the Nomad event that notifies Poseidon that a job is lost and will not be restarted nor rescheduled, and deal with it by trying to request a new runner. [Jobs] [Allocations].

This should be fixed together with #602

@mpass99
Copy link
Contributor Author

mpass99 commented Sep 4, 2024

Let's consider multiple scenarios and analyse Nomad's events for these.

Startup and Environment Creation

nomadEventDump-Startup-No-Environment-Job.txt
The most basic case, is a Poseidon startup without an existing environment in Nomad.
We observe that we do not handle events on Poseidon startup. If runner jobs exist without a respective environment job, these runner jobs are ignored.
When creating an environment, we receive Nomad events corresponding to the template job and the runner jobs (in this case just one).

Learnings

  • Existing runner jobs without an environment job are not being recognised until a recovery after the environment got recreated.
  • The Job JobRegistered event is sent before the allocation/task is started completely.

nomadEventDump-Startup-Complete-Environment-Job.txt
Next we can observe a special state of environment jobs. In this state, the Nomad job is complete but still exists.
On startup recovery, the environment is being ignored with a special log message.
When creating the environment again, the environment job remains in that complete state. Just when deleting the environment, we can escape that state and create a running environment.

Questions

  • What is this state of a complete but yet existing Nomad Job?
  • Why do we only see environment jobs but not runner jobs in this state?

nomadEventDump-Create-Environment.txt
Here, we have an environment creation with five started runners.

Simultaneous Restart

nomadEventDump-RestartTogether-withoutAlert.txt
When restarting all Nomad agents with one environment and five runners running, we observe that the environment job goes into the previously described complete state, two runner jobs survive the restart, and three are completely removed.

Questions

  • Are we handling such lost runner jobs right?

Learnings

  • Job-Events do not reliably notify about stopped jobs (on failed migrations)
  • Allocation-Events do not clearly state that a runner has stopped completely. All we receive is one event stating that the allocation is about to stop.
  • Do not restart all agents at the same time

nomadEventDump-RestartTogether-PrewarmigPoolAlert.txt
This scenario shows that our Prewarming Pool Alert would fix cases with lost runners and recreate them.

Sequential Restart

nomadEventDump-RestartAfterEachOther-Success1-5.txt
nomadEventDump-RestartAfterEachOther-Success2-5.txt
We have two examples with one environment job and five runner jobs where all jobs are rescheduled correctly.

Learnings

  • We just receive Allocation-events for reschedulings, no Job-events.

nomadEventDump-RestartAfterEachOther-Success3-100.txt
With 100 runner jobs it also succeeds, but we receive the warning Failed to allocate directory watch: Too many open files in the ssh connection of Agent 3.

With 200 runners, we can create a severe failure behavior.
We receive the Failed to allocate directory watch: Too many open files error on all agents.
The rescheduling of the allocations in case of a restart leads to a for >25min ongoing denial of service attack. An agent starts and becomes ready, then it is flooded with allocations, and crashes again.

Questions

  • How severe is the Failed to allocate directory watch: Too many open files error?
  • In the syslogs we find many systemd nomad.service: Unit process 3620 (nomad) remains running after unit stopped. warnings.
    • How severe are they?
    • After stopping, when starting, two other warnings appear.
      • nomad.service: Found left-over process 3620 (nomad) in control group while starting unit. Ignoring.
      • nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
    • Why is the Nomad service started directly (after 6s) and does not wait for the graceful shutdown period?
  • We find many nomad errors: [ERROR] client: error restoring alloc: error="failed to decode alloc: key not found: alloc".
  • After 2min 46s kernel: containerd invoked oom-killer, Out of memory: Killed process 8758 (systemd), telegraf invoked oom-killer, Killed process 725 (telegraf)
  • While the nomad systemd service is running, Nomad is not marked as ready and does not respond to cli requests any longer.

@mpass99
Copy link
Contributor Author

mpass99 commented Sep 4, 2024

I sperated the many questions identified by this issue into their own issues.

Why do we only see environment jobs but not runner jobs in this state?

In later examples, we also recognised runner jobs in this state.

@MrSerth
Copy link
Member

MrSerth commented Sep 25, 2024

We created dedicated sub issues for the different aspects identified in this ticket. Those are tracked as #673, #674, #675, #676, #677.

Until those issues have been completed, we expect that the same "erroneous" behavior initially reported for this issue remains. Most promising to resolve the root cause is likely to #673.

Since all work as been split into dedicated issues, we are closing this one for better visibility.

@MrSerth MrSerth closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants