-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partial syncScope livelock #121
Conversation
Unfortunately this is not fully fixed: https://travis-ci.com/github/mratsim/weave/jobs/323526172#L1854 |
There is a more problematic root cause, merging for now as it still helps a lot. |
Perf note when trying to use both The FSM seems to have slightly higher overhead. 20ms slower on Fib(40) on both eager (387ms runtime) and lazy flowvars (204 ms runtime). This is acceptable for now but hopefully we find the real bug. |
…titasks in queue (followup #121)
* model checking - 1st try to fix MPSC queue (the model checker crashes with not enough memory :/) * Give the thread the opportunity to not deadlock on sleep on Mac/with Clang * whoopsie * Add impl of Weave MPSC channel in C++ for CDSChecker model checking + comment out fences * Comment out GEMM tests for syncRoot + Pledges: #97 * don't use sleep, it's can deadlock in the CI ... * Try get epoch time to avoid mac bugs * use `getTime` and hope that it's properly implemented on Mac * State-machine, return to CheckTask to avoid leaving task spawning multitasks in queue (followup #121) * Don't spinlock for testing, deadlocks ARM and OSX * Could it be non-mono clock jitter? * Add some log for MacOS debug * Race condition between spawning the thread and entering the spinlock in the `isReady` test * a,d obviously I mess up the function call
Try to address #119
From the previous state machine
And some traces of a stuck process:
What seem to happen is:
recvElseSteal
which is the prologue ofSB_Steal
state.Analysis (barring other bugs)
SB_Steal
because all direct child tasks were processed (Note that the code assumes that all child tasks are at the beginning of the deque inpopFirstIfChild
)Conclusion and fix
It is stuck because either it was not a direct child but a grandchildren task at least
or because order assumptions are wrong and there is an unrelated task that couldn't be popped in front of the child.
weave/weave/state_machines/sync_scope.nim
Lines 123 to 134 in 943d04a
This can happen if all threads are idle.
2 solutions are possible:
SB_Steal
, don't only answer steal requests but also work sharing requests from idle workers