-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition/bug on waiting for and killing process #80
Comments
I'm glad you filed the issue. I'll try to brainstorm with you here, and maybe one of the cases I know of will sound like it could fit your situation. Thinking about I don't know of any way That said, this scenario wouldn't fit with the detail of child process being already gone in the process monitor. But there's a lot that confuses me about that detail. How can the waiting thread Are you certain the child process is gone? That's a very important detail if so, and it strongly suggests some deadlock to do with duct's internal condvar (immediately below). But if there's any chance your process monitor might be e.g. hiding zombie processes or something, that could send us down the wrong track. Thinking about deadlocks in We do have a couple mutexes that we acquire in different places. It's possible there's a deadlock there. But we should always be taking the macOS specific issues One platform issue we have on macOS is that it doesn't implement Windows specific issues Windows had an issue similar to the above, but it was a bug in grandchildren issues I have a longer discussion about grandchildren in the gotchas file. If your child processes are spawning further child processes of their own, you could be running into an issue with grandchildren. The |
Ah it looks like you're using Duct v0.12 in most places. In that version, I wonder if these lines could be the issue:
The idea there is that we've already seen that the child has exited, so calling |
Thanks for the very thorough brainstorming. Regarding long-lasting, uninterruptible system call. Our subprocess here is OpenVPN. It does not do anything related to FUSE to my knowledge. Not sure exactly what other system calls it makes. But as you say, probably irrelevant since the process was gone from the process table by the time I got to inspect the computer.
I did
I think we only have one thread waiting for it. But the kill might come in from different threads. And the way we implement killing the process nicely is by first sending a SIGTERM and wait poll the process with
I don't think OpenVPN uses any child processes in the setup we use.
We use |
Out of curiosity, how do you send SIGTERM? Duct hasn't exposed a way to get the child PID, so you must be using some side channel? (As an aside, I really should expose such a thing.) |
Here's a test program that tries to discover any deadlocks in [package]
name = "example"
version = "0.1.0"
edition = "2018"
[dependencies]
shared_child = "0.3.4"
libc = "0.2.65"
os_pipe = "0.9.1"
anyhow = "1.0.18"
crossbeam-utils = "0.6.6" use anyhow::Result;
use shared_child::unix::SharedChildExt;
use shared_child::SharedChild;
use std::os::unix::prelude::*;
use std::process::{Command, ExitStatus};
use std::sync::Barrier;
fn main() -> Result<()> {
// Loop forever spawning a child and racing to kill it with both SIGTERM
// and SIGKILL, while waiting with both wait and try_wait.
loop {
let handle = SharedChild::spawn(Command::new("/usr/bin/sleep").arg("1000"))?;
let barrier = Barrier::new(4);
crossbeam_utils::thread::scope(|s| -> Result<()> {
let sigterm_thread = {
s.spawn(|_| -> Result<()> {
barrier.wait();
handle.send_signal(libc::SIGTERM)?;
Ok(())
})
};
let sigkill_thread = {
s.spawn(|_| -> Result<()> {
barrier.wait();
handle.kill()?;
Ok(())
})
};
let try_wait_thread = {
s.spawn(|_| -> Result<Option<ExitStatus>> {
barrier.wait();
let maybe_status = handle.try_wait()?;
Ok(maybe_status)
})
};
barrier.wait();
let status = handle.wait()?;
let maybe_status = match try_wait_thread.join().unwrap()? {
Some(status) => status.signal().unwrap().to_string(),
None => "_".to_string(),
};
sigterm_thread.join().unwrap()?;
sigkill_thread.join().unwrap()?;
eprint!("{}/{} ", status.signal().unwrap(), maybe_status);
Ok(())
})
.unwrap()?;
}
} |
Sorry, my bad. I thought we did, and we probably did or tried to do at some point. But since that does not exist on Windows, and we need to make OpenVPN close gracefully on all platforms, we went with another solution. We have our own patched version of OpenVPN that tries to read from stdin and whenever the stdin pipe is closed it treats that the same way it would treat a ctrl-c/SIGTERM. So the way we gracefully close it is by attaching a pipe to stdin and then just close that pipe.
Yes! Just giving access to the PID is important and a nice thing to have. Otherwise people can't work around limitations in the library. Maybe I want to read the |
Just uploaded v0.13.3 with a new |
You probably know this, but if you pass any pipes to |
I did not. But we don't store the Expression anywhere, it goes out of scope just after spawning the child anyway. This works fine 99.999% of the time. So that's not the issue here. |
This is going to be a very vague issue report. Sorry for that. But we have seen this for quite some time now, and I figured better a vague report than nothing. It could be that we do something wrong on our end, but we think there might be a bug in
duct
. This only shows up very rarely and we don't have any reliable way to reproduce it. Which is why we have not been able to debug it properly. But we are hit with what we think is a race condition or other type of bug related to waiting for and killing subprocesses in https://github.com/mullvad/mullvadvpn-app/We have one thread blocked on
handle.wait()
to detect if the subprocess dies. And sometimes we send aSIGTERM
from another thread, wait a few seconds and then try to force-kill the process iff it's still alive. This works 99.999% of the times. But sometimes our logs report that the handle thinks the process is alive after the SIGTERM and it goes on to kill it, only to get stuck inkill()
, forever. The thread waiting inwait()
never returns neither. When looking at the active processes it seems the subprocess is gone already. So both threads are stuck on onwait()
/kill()
for a PID that no longer exists.This seems most common on macOS, but I think we have seen issues that could be the same thing on Linux and Windows as well. But not sure.
The text was updated successfully, but these errors were encountered: