Sentry: Could not perform the requested updateFileSystem. #406

sentry-io · 2023-07-14T12:47:48Z

Sentry Issue: POSEIDON-3N (former: POSEIDON-G)

Common error details are that the task cannot be found or the task is not running.
It should be investigated if the task is not running because it was OOM Killed in the file system update process, or which reason led to this situation.

file copy failed: stderr output '' and stdout output 'task 170473f7d28fa6f3dc5fbf2f832c2bf0cc3b64113ff9bbd7a3dfcb1cb2eb5ebb not found: not found

communication with executor failed: nomad error during file copy: error executing command in job 29-847a8d10-213d-11ee-b98b-fa163e079f19: error executing command in allocation: task "default-task" is not running.

The text was updated successfully, but these errors were encountered:

MrSerth · 2023-07-14T16:16:06Z

If an instance is OOM killed, CodeOcean still tries to read the file system. This behavior is fixed with openHPI/codeocean#1766.

mpass99 · 2023-07-14T16:54:13Z

This does not apply to this issue as we have separate error messages for listFileSystem and updateFileSystem. This issue deals just with the updateFileSystem.

MrSerth · 2023-07-14T17:01:29Z

Ah, right, sorry. I missed that...

mpass99 · 2023-08-21T10:21:29Z

As we do not have that many error events regarding this issue, let's discuss them individually:

Aug 15, 2023 12:59:50 PM UTC Error: communication with executor failed: [...] unexpected EOF
- Docker, Nomad, and Poseidon got restarted due to a deployment. Therefore, the file copy connection got interrupted
Aug 09, 2023 1:42:19 PM UTC Error: file copy failed: stderr output '' and stdout output ''
- A second file copy request arrived just in the moment the runner got destroyed.
- ToDo: The log statements are not 100% clear (if the allocation got OOMKilled, why its job got registered twice, or why it has been destroyed). See Fix/#406 missing log #422
- ~~ToDo: Why are we removing allocations only when they are done stopping and not when they start stopping?~~ Unrelated
Aug 9, 2023 1:40:04 PM UTC Error: file copy failed: stderr output '' and stdout output ''
- The client closed the WebSocket connection. We received a (successful) delete runner request, but a few ms before also a file copy request started. This request failed.
- See Merge Context #426

mpass99 · 2023-10-06T11:33:43Z

All events within the last 14 days seem to be caused by a deployment. Therefore, I would suggest to further track this issue in the context of #465.

mpass99 · 2024-05-15T10:33:15Z

Still, this issue is one of our most frequent Sentry issues. We can differentiate the events into two kinds:

communication with executor failed: nomad error during file copy: error executing command in job 29-[...]: error executing command in allocation:unexpected EOF [1]
- In our last evaluation above, we declared it as deployment specific and ignored it
- However, we might verify if a network issue caused this or if this runner got stopped in the copy process
- Then, we might apply specific measures
communication with executor failed: nomad error during file copy: error executing command in job 10-[...]:no allocation found [1] [2] [3] [4] [5]
- What happened to this runner before? Why was this request still made?

mpass99 · 2024-06-13T13:14:02Z

On the 12th we had 39 updateFileSystem errors that were caused by OpenStack network/migration issues.
We can classify the errors into three root causes:

error executing command in allocation: unexpected EOF

This error happens when Poseidon loses connection to the Nomad Server or the Nomad Server loses connection to the Nomad Agent
However, this error might also happen when we delete the runner while copying files. We should check this.
In the case of the previous comment [1] the Allocation got restarted (and in response destroyed). Docker returned Docker container exited with non-zero exit code: 2 as the reason for the restart. ~~In this case, it might be interesting what the submission was as tars exit code 2 might be due to a failing read.~~ This is the exit code of the PID 1 process but not the tar copy connection.
What is our action point here? This error is caused by an interruption of the connection to the allocation, however, this might be caused by the copy-payload, deployment, migration/network issues

error executing command in allocation: node down

We expect this error to happen, when Nomad still tracks the Job but has no connection to any Nomad agent.
We might extract this root cause into a separate Sentry issue.

no allocation found

We expect this error to happen when Jobs are permanently lost when Nomad cannot schedule it (e.g. due to a deployment restart or network issues).
In the case of the previous comment [1-5], the only hint we receive from Nomad is on Job-topic stating the JobDeregistered without any further description or notification on Allocation-topic.
We should identify the Nomad event that tells us about the job being lost permanently and handle it. We should fix this in the context of Handle permanently dead Nomad jobs #612

error executing command in allocation: Unknown allocation

This seems to be just a temporary error, as the same request succeeded five seconds later
time="2024-06-12T11:00:56.292654Z" level=debug code=500 duration=2.015312539s method=PATCH path=/api/v1/runners/29-33eaa850-28a3-11ef-920d-fa163efe023e/files user_agent="Faraday v2.9.0"
time="2024-06-12T11:01:01.543767Z" level=debug code=204 duration=161.572382ms method=PATCH path=/api/v1/runners/29-33eaa850-28a3-11ef-920d-fa163efe023e/files user_agent="Faraday v2.9.0"
Maybe a Nomad event got lost when Poseidon (and Nomad) crashed during a migration. See Investigate leaking allocation storage data #615

MrSerth · 2024-06-13T20:54:55Z

Thanks for checking the various issues and grouping these. For the third (and forth group), I feel that the ticket linked and action identified is fine. For the first two, I am currently not sure how to continue.

error executing command in allocation: unexpected EOF
- It really seems that this issue is mostly network-related (and thus cannot be "resolved" permanently).
- If possible, I would further extract the "delete during copy" issue, since this might be something we could avoid.
- The submission looks normal to me.
error executing command in allocation: node down
- To me, this sounds like something we could try to handle better.

For those requests were Nomad is not reachable (or all nodes are down), we should probably ensure to return a correct error through the API, handle this one in CodeOcean gracefully, and otherwise reduce our logging. When something is permanently broken, we should (hopefully?) notice that through the other monitoring systems.

mpass99 · 2024-06-21T12:38:47Z

Thank you for your input. From it, I took the need for issue #619.

From the current perspective, groups 1 and 2 occur very rarely. If considered necessary, we might introduce a lookup that is able to include the error description from the Nomad event stream into the error logging.

mpass99 · 2024-08-07T14:16:20Z

We track the main error, unexpected EOF in a separate (Sentry) issue #641, and handle the other error causes in separate Sentry issues. Therefore, all current causes for this high-level error are mitigated.

MrSerth · 2024-08-15T16:56:12Z

This issue just reappeared on production three times. Here's the latest occurrence, which doesn't seem related to unexpected EOF but no allocation found.

Is there something we missed? This should not be related to the deployment I did an hour earlier, or is it?

mpass99 · 2024-08-16T10:49:33Z

I opened a separate issue for it: #649.
We can't rule out the possibility that the deployments are related.

MrSerth · 2024-08-16T14:05:31Z

Thanks for opening another issue. As mentioned there, I forgot one of the deployments in-between 🙈. Since we have a follow-up issue, I am closing this one again.

mpass99 mentioned this issue Jul 14, 2023

Check Sentry issues #135

Open

This was referenced Aug 21, 2023

Fix/#406 missing log #422

Merged

Merge Context #426

Merged

MrSerth closed this as completed Oct 6, 2023

mpass99 reopened this May 15, 2024

MrSerth added the bug Something isn't working label May 29, 2024

mpass99 mentioned this issue Jun 13, 2024

Investigate leaking allocation storage data #615

Closed

mpass99 closed this as completed Aug 7, 2024

MrSerth reopened this Aug 15, 2024

MrSerth closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentry: Could not perform the requested updateFileSystem. #406

Sentry: Could not perform the requested updateFileSystem. #406

sentry-io bot commented Jul 14, 2023 •

edited by mpass99

Loading

MrSerth commented Jul 14, 2023

mpass99 commented Jul 14, 2023

MrSerth commented Jul 14, 2023

mpass99 commented Aug 21, 2023 •

edited

Loading

mpass99 commented Oct 6, 2023

mpass99 commented May 15, 2024

mpass99 commented Jun 13, 2024 •

edited

Loading

MrSerth commented Jun 13, 2024

mpass99 commented Jun 21, 2024

mpass99 commented Aug 7, 2024

MrSerth commented Aug 15, 2024

mpass99 commented Aug 16, 2024

MrSerth commented Aug 16, 2024

Sentry: Could not perform the requested updateFileSystem. #406

Sentry: Could not perform the requested updateFileSystem. #406

Comments

sentry-io bot commented Jul 14, 2023 • edited by mpass99 Loading

MrSerth commented Jul 14, 2023

mpass99 commented Jul 14, 2023

MrSerth commented Jul 14, 2023

mpass99 commented Aug 21, 2023 • edited Loading

mpass99 commented Oct 6, 2023

mpass99 commented May 15, 2024

mpass99 commented Jun 13, 2024 • edited Loading

MrSerth commented Jun 13, 2024

mpass99 commented Jun 21, 2024

mpass99 commented Aug 7, 2024

MrSerth commented Aug 15, 2024

mpass99 commented Aug 16, 2024

MrSerth commented Aug 16, 2024

sentry-io bot commented Jul 14, 2023 •

edited by mpass99

Loading

mpass99 commented Aug 21, 2023 •

edited

Loading

mpass99 commented Jun 13, 2024 •

edited

Loading