-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hangs due to tasks being stuck inside of local queues #104
Conversation
I've confirmed that these tests pass on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bevy was failing because we were ticking on another thread that wasn't spawning new tasks as a way to send and execute futures on a single target thread. All of these seem to either block on or locally tick the executor in one way or another.
use std::time::Duration; | ||
|
||
fn do_run<Fut: Future<Output = ()>>(mut f: impl FnMut(Arc<Executor<'static>>) -> Fut) { | ||
// This should not run for longer than two minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to split into multiple to figure out which specific ones are deadlocking, and which are working.
|
This has a worrying effect on benchmarks. Specifically it significantly impacts the time needed to spawn futures.
Still, removing bugs takes priority over speed here. |
A cursory test on Bevy's end seems to confirm that this fixes the issue; however, unlike #102, this fix causes upwards of a 50+% regression in frame time, with our scheduler tasks taking up to 10x longer on average. It does seem like all of the threads are actively running tasks now, which tackles #100, but the contention on them seems to negatively impact all of our running tasks. From the instrumentation on the scheduler tasks on a common Bevy benchmark: |
src/lib.rs
Outdated
return false; | ||
} | ||
// Try for both again. | ||
if let Some(runnable) = local_ticker(true).or_else(|| self.queue.pop().ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we try to pull from self.queue.pop
twice simultaneously here usually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, local_ticker
only reads from local queues.
This is fixable but it'll have to be fixed in a later release. Since |
We're using |
Why does this change interact with the way tasks are spawned? (in benchmarks) |
Because spawning a task immediately schedules it. |
This should catch the errors from earlier. Signed-off-by: John Nunley <[email protected]>
It turns out that with the current strategy it is possible for tasks to be stuck in the local queue without any hope of being picked back up. In practice this seems to happen when the only entities polling the system are tickers, as opposed to runners. Since tickets don't steal tasks, it is possible for tasks to be left over in the local queue that don't filter out. One possible solution is to make it so tickers steal tasks, but this kind of defeats the point of tickers. So I've instead elected to replace the current strategy with one that accounts for the corner cases with local queues. The main difference is that I replace the Sleepers struct with two event_listener::Event's. One that handles tickers subscribed to the global queue and one that handles tickers subscribed to the local queue. The other main difference is that each local queue now has a reference counter. If this count reaches zero, no tasks will be pushed to this queue. Only runners increment or decrement this counter. This makes the previously instituted tests pass, so hopefully this works for most use cases. Signed-off-by: John Nunley <[email protected]>
This should catch the errors from earlier.