Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentry: Could not perform the requested updateFileSystem. #406

Closed
sentry-io bot opened this issue Jul 14, 2023 · 13 comments
Closed

Sentry: Could not perform the requested updateFileSystem. #406

sentry-io bot opened this issue Jul 14, 2023 · 13 comments
Labels
bug Something isn't working

Comments

@sentry-io
Copy link

sentry-io bot commented Jul 14, 2023

Sentry Issue: POSEIDON-3N (former: POSEIDON-G)

Common error details are that the task cannot be found or the task is not running.
It should be investigated if the task is not running because it was OOM Killed in the file system update process, or which reason led to this situation.

file copy failed: stderr output '' and stdout output 'task 170473f7d28fa6f3dc5fbf2f832c2bf0cc3b64113ff9bbd7a3dfcb1cb2eb5ebb not found: not found

communication with executor failed: nomad error during file copy: error executing command in job 29-847a8d10-213d-11ee-b98b-fa163e079f19: error executing command in allocation: task "default-task" is not running.

@MrSerth
Copy link
Member

MrSerth commented Jul 14, 2023

If an instance is OOM killed, CodeOcean still tries to read the file system. This behavior is fixed with openHPI/codeocean#1766.

@mpass99
Copy link
Contributor

mpass99 commented Jul 14, 2023

This does not apply to this issue as we have separate error messages for listFileSystem and updateFileSystem. This issue deals just with the updateFileSystem.

@MrSerth
Copy link
Member

MrSerth commented Jul 14, 2023

Ah, right, sorry. I missed that...

@mpass99
Copy link
Contributor

mpass99 commented Aug 21, 2023

As we do not have that many error events regarding this issue, let's discuss them individually:

  • Aug 15, 2023 12:59:50 PM UTC Error: communication with executor failed: [...] unexpected EOF
    • Docker, Nomad, and Poseidon got restarted due to a deployment. Therefore, the file copy connection got interrupted
  • Aug 09, 2023 1:42:19 PM UTC Error: file copy failed: stderr output '' and stdout output ''
    • A second file copy request arrived just in the moment the runner got destroyed.
    • ToDo: The log statements are not 100% clear (if the allocation got OOMKilled, why its job got registered twice, or why it has been destroyed). See Fix/#406 missing log #422
    • ToDo: Why are we removing allocations only when they are done stopping and not when they start stopping? Unrelated
  • Aug 9, 2023 1:40:04 PM UTC Error: file copy failed: stderr output '' and stdout output ''
    • The client closed the WebSocket connection. We received a (successful) delete runner request, but a few ms before also a file copy request started. This request failed.
    • See Merge Context #426

This was referenced Aug 21, 2023
@mpass99
Copy link
Contributor

mpass99 commented Oct 6, 2023

All events within the last 14 days seem to be caused by a deployment. Therefore, I would suggest to further track this issue in the context of #465.

@MrSerth MrSerth closed this as completed Oct 6, 2023
@mpass99
Copy link
Contributor

mpass99 commented May 15, 2024

Still, this issue is one of our most frequent Sentry issues. We can differentiate the events into two kinds:

  • communication with executor failed: nomad error during file copy: error executing command in job 29-[...]: error executing command in allocation:unexpected EOF [1]
    • In our last evaluation above, we declared it as deployment specific and ignored it
    • However, we might verify if a network issue caused this or if this runner got stopped in the copy process
    • Then, we might apply specific measures
  • communication with executor failed: nomad error during file copy: error executing command in job 10-[...]:no allocation found [1] [2] [3] [4] [5]
    • What happened to this runner before? Why was this request still made?

@mpass99 mpass99 reopened this May 15, 2024
@MrSerth MrSerth added the bug Something isn't working label May 29, 2024
@mpass99
Copy link
Contributor

mpass99 commented Jun 13, 2024

On the 12th we had 39 updateFileSystem errors that were caused by OpenStack network/migration issues.
We can classify the errors into three root causes:

  1. error executing command in allocation: unexpected EOF
  • This error happens when Poseidon loses connection to the Nomad Server or the Nomad Server loses connection to the Nomad Agent
  • However, this error might also happen when we delete the runner while copying files. We should check this.
  • In the case of the previous comment [1] the Allocation got restarted (and in response destroyed). Docker returned Docker container exited with non-zero exit code: 2 as the reason for the restart. In this case, it might be interesting what the submission was as tars exit code 2 might be due to a failing read. This is the exit code of the PID 1 process but not the tar copy connection.
  • What is our action point here? This error is caused by an interruption of the connection to the allocation, however, this might be caused by the copy-payload, deployment, migration/network issues
  1. error executing command in allocation: node down
  • We expect this error to happen, when Nomad still tracks the Job but has no connection to any Nomad agent.
  • We might extract this root cause into a separate Sentry issue.
  1. no allocation found
  • We expect this error to happen when Jobs are permanently lost when Nomad cannot schedule it (e.g. due to a deployment restart or network issues).
  • In the case of the previous comment [1-5], the only hint we receive from Nomad is on Job-topic stating the JobDeregistered without any further description or notification on Allocation-topic.
  • We should identify the Nomad event that tells us about the job being lost permanently and handle it. We should fix this in the context of Handle permanently dead Nomad jobs #612
  1. error executing command in allocation: Unknown allocation
  • This seems to be just a temporary error, as the same request succeeded five seconds later
  • time="2024-06-12T11:00:56.292654Z" level=debug code=500 duration=2.015312539s method=PATCH path=/api/v1/runners/29-33eaa850-28a3-11ef-920d-fa163efe023e/files user_agent="Faraday v2.9.0"
  • time="2024-06-12T11:01:01.543767Z" level=debug code=204 duration=161.572382ms method=PATCH path=/api/v1/runners/29-33eaa850-28a3-11ef-920d-fa163efe023e/files user_agent="Faraday v2.9.0"
  • Maybe a Nomad event got lost when Poseidon (and Nomad) crashed during a migration. See Investigate leaking allocation storage data #615

@MrSerth
Copy link
Member

MrSerth commented Jun 13, 2024

Thanks for checking the various issues and grouping these. For the third (and forth group), I feel that the ticket linked and action identified is fine. For the first two, I am currently not sure how to continue.

  1. error executing command in allocation: unexpected EOF
    • It really seems that this issue is mostly network-related (and thus cannot be "resolved" permanently).
    • If possible, I would further extract the "delete during copy" issue, since this might be something we could avoid.
    • The submission looks normal to me.
  2. error executing command in allocation: node down
    • To me, this sounds like something we could try to handle better.

For those requests were Nomad is not reachable (or all nodes are down), we should probably ensure to return a correct error through the API, handle this one in CodeOcean gracefully, and otherwise reduce our logging. When something is permanently broken, we should (hopefully?) notice that through the other monitoring systems.

@mpass99
Copy link
Contributor

mpass99 commented Jun 21, 2024

Thank you for your input. From it, I took the need for issue #619.

From the current perspective, groups 1 and 2 occur very rarely. If considered necessary, we might introduce a lookup that is able to include the error description from the Nomad event stream into the error logging.

@mpass99
Copy link
Contributor

mpass99 commented Aug 7, 2024

We track the main error, unexpected EOF in a separate (Sentry) issue #641, and handle the other error causes in separate Sentry issues. Therefore, all current causes for this high-level error are mitigated.

@mpass99 mpass99 closed this as completed Aug 7, 2024
@MrSerth
Copy link
Member

MrSerth commented Aug 15, 2024

This issue just reappeared on production three times. Here's the latest occurrence, which doesn't seem related to unexpected EOF but no allocation found.

Is there something we missed? This should not be related to the deployment I did an hour earlier, or is it?

@MrSerth MrSerth reopened this Aug 15, 2024
@mpass99
Copy link
Contributor

mpass99 commented Aug 16, 2024

I opened a separate issue for it: #649.
We can't rule out the possibility that the deployments are related.

@MrSerth
Copy link
Member

MrSerth commented Aug 16, 2024

Thanks for opening another issue. As mentioned there, I forgot one of the deployments in-between 🙈. Since we have a follow-up issue, I am closing this one again.

@MrSerth MrSerth closed this as completed Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants