Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial syncScope livelock #121

Merged
merged 3 commits into from
May 4, 2020
Merged

partial syncScope livelock #121

merged 3 commits into from
May 4, 2020

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Apr 26, 2020

Try to address #119

From the previous state machine
syncScope

And some traces of a stuck process:
image
image
image

What seem to happen is:

  • The root thread seems to be the only thread left with all the other backed off
  • The root thread get stuck in a loop recvElseSteal which is the prologue of SB_Steal state.

Analysis (barring other bugs)

  • Hypothesis: All other threads are sleeping because they have no other tasks
  • Hypothesis: The root thread didn't exit the state machine because it is pending a descendant task
  • Hypothesis: The root thread is in SB_Steal because all direct child tasks were processed (Note that the code assumes that all child tasks are at the beginning of the deque in popFirstIfChild)

Conclusion and fix

  • At least one of the descendant task is stuck in the root thread.
    It is stuck because either it was not a direct child but a grandchildren task at least
    or because order assumptions are wrong and there is an unrelated task that couldn't be popped in front of the child.
  • The root thread didn't receive any steal request to dispatch the stuck tasks
    behavior(syncScopeFSA):
    steady: SB_Steal
    transition:
    # We might inadvertently remove our own steal request in
    # dispatchElseDecline so resteal
    profile_stop(idle)
    trySteal(isOutOfTasks = false)
    # If someone wants our non-child tasks, let's oblige
    var req: StealRequest
    while recv(req):
    dispatchElseDecline(req)
    profile_start(idle)

    This can happen if all threads are idle.

2 solutions are possible:

  1. Drain the whole task queue before switching to SB_Steal
  2. Or in SB_Steal, don't only answer steal requests but also work sharing requests from idle workers

@mratsim
Copy link
Owner Author

mratsim commented Apr 26, 2020

We use solution 2.

No impact on overhead as measured by fibonacci with lazy flowvars (to not measure memory overhead) under 200ms
image

And with normal Flowvar under 400ms
image

load distribution seems to be the same.

What may have changed is that on sync and syncScope in the steal phase the worker sends its non-direct child tasks first to its children which may be sleeping instead of its thief. If the task was short we could have saved energy by only sending to the thief.
Inversely, the load distribution might be better since we give the runtime the opportunity to wake up sleeping threads as otherwise sleeping threads are only woken up on a successful theft even though the current workers might have extra tasks. I.e. the change is more greedy and so more asymptotically optimal.

@mratsim
Copy link
Owner Author

mratsim commented Apr 26, 2020

Unfortunately this is not fully fixed: https://travis-ci.com/github/mratsim/weave/jobs/323526172#L1854

@mratsim
Copy link
Owner Author

mratsim commented Apr 26, 2020

After trying to mix both solutions we still have the bug
(now rarer)
2020-04-26_21-39

2020-04-26_21-38
2020-04-26_21-38_1

@mratsim mratsim changed the title Fix syncScope livelock partial syncScope livelock May 4, 2020
@mratsim
Copy link
Owner Author

mratsim commented May 4, 2020

There is a more problematic root cause, merging for now as it still helps a lot.

@mratsim mratsim merged commit 6dbf0e5 into master May 4, 2020
@mratsim
Copy link
Owner Author

mratsim commented May 5, 2020

Perf note when trying to use both

The FSM seems to have slightly higher overhead. 20ms slower on Fib(40) on both eager (387ms runtime) and lazy flowvars (204 ms runtime).

This is acceptable for now but hopefully we find the real bug.

mratsim added a commit that referenced this pull request May 9, 2020
mratsim added a commit that referenced this pull request May 9, 2020
* model checking - 1st try to fix MPSC queue (the model checker crashes with not enough memory :/)

* Give the thread the opportunity to not deadlock on sleep on Mac/with Clang

* whoopsie

* Add impl of Weave MPSC channel in C++ for CDSChecker model checking + comment out fences

* Comment out GEMM tests for syncRoot + Pledges: #97

* don't use sleep, it's can deadlock in the CI ...

* Try get epoch time to avoid mac bugs

* use `getTime` and hope that it's properly implemented on Mac

* State-machine, return to CheckTask to avoid leaving task spawning multitasks in queue (followup #121)

* Don't spinlock for testing, deadlocks ARM and OSX

* Could it be non-mono clock jitter?

* Add some log for MacOS debug

* Race condition between spawning the thread and entering the spinlock in the `isReady` test

* a,d obviously I mess up the function call
@mratsim mratsim deleted the fix-sync-scope-livelock branch May 17, 2020 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant