-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bail on broken pipe when reading udp streams #509
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: arlyon The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Tests failed. @containers/packit-build please check. |
@@ -87,6 +87,7 @@ impl CoreDns { | |||
}, | |||
v = receiver.next() => { | |||
let msg_received = match v { | |||
Some(Err(err)) if err.kind() == std::io::ErrorKind::BrokenPipe => {break}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right as you break the loop for tcp and udp and there is no logic to "retry" (open the socket again). At the very least it would need to be removed from the thread_handles map so a new container spawn would allow us to add it again.
And we likely should check what kind of errors we can get here and which of those or fatal and which are recoverable in general when reading from the socket. EPIPE just doesn't make any sense to me, it is not documented in recvmsg which should be the underlying syscall (or does the tokio abstraction do anything weird)
Can you attach strace to it and provide the output of when this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will attempt to trigger it again and see what I can do. I have attached a test as well that demonstrates the hang when a broken pipe occurs using just a simple facade.
I agree that simply breaking the loop is not ideal but it at least stops the server from completely hanging.
(also thanks for the quick reply!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without understanding why this happens it doesn't make much sense to just merge "random" fixes and hope the best. I don't see what generates EPIPE here so I like to see a strace at least to understand which syscall is returning that. I agree there might be fatal errors that should cause us to abort but this is not specific to this one I presume and not specific to udp either?
And if we do that then we must update the thread_handles correctly and I guess try to bind the socket again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attached strace on my coreOS instance using a rootful toolbox container and have managed to trigger the issue again but rather stupidly did not trace child threads / processes. Trying again now.
a91adc4
to
49dd0b8
Compare
49dd0b8
to
c6a2fe3
Compare
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I looked a bit around and there seem to be some errors hickory considers hard errors when reading from the socket but they do not include EPIPE hickory-dns/hickory-dns@d2e64d8
So I would still like to understand how you can EPIPE here.
} | ||
|
||
// we need 2 threads or tokio::spawn will block since it never yields | ||
#[test_log::test(tokio::test(flavor = "multi_thread", worker_threads = 2))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of test_log? It seem this can just be dropped so we do not need extra dependencies.
I have an issue with netavark + aardvark-dns + podman where under certain conditions (in my case a container running tailscale as an exit node) aardvark-dns will end up stuck burning 100% cpu (and taking down container dns resolution in the process).
When it happens, aardvark-dns continually prints log lines along the lines of
Error parsing dns message: broken pipe
. I can't see anywhere where we bail as a result so this PR attempts to resolve that. I am hoping that if we break the loop and the socket is closed then it will just reconnect and continue to work as usual.I am still figuring out how to get this into my version of coreOS (since it is quite deeply integrated) so that I may try to repro but I wanted to open this PR in case there was any insight here (or whether this line of reasoning is sound).
Additionally, some advice on writing a test for this would be very helpful. I am trying to set up a basic rust test but it is very involved.