partial syncScope livelock #121

mratsim · 2020-04-26T14:54:39Z

Try to address #119

From the previous state machine

And some traces of a stuck process:

What seem to happen is:

The root thread seems to be the only thread left with all the other backed off
The root thread get stuck in a loop recvElseSteal which is the prologue of SB_Steal state.

Analysis (barring other bugs)

Hypothesis: All other threads are sleeping because they have no other tasks
Hypothesis: The root thread didn't exit the state machine because it is pending a descendant task
Hypothesis: The root thread is in SB_Steal because all direct child tasks were processed (Note that the code assumes that all child tasks are at the beginning of the deque in popFirstIfChild)

Conclusion and fix

At least one of the descendant task is stuck in the root thread.
It is stuck because either it was not a direct child but a grandchildren task at least
or because order assumptions are wrong and there is an unrelated task that couldn't be popped in front of the child.

The root thread didn't receive any steal request to dispatch the stuck tasks

weave/weave/state_machines/sync_scope.nim

Lines 123 to 134 in 943d04a

    
           behavior(syncScopeFSA): 
        
             steady: SB_Steal 
        
             transition: 
        
               # We might inadvertently remove our own steal request in 
        
               # dispatchElseDecline so resteal 
        
               profile_stop(idle) 
        
               trySteal(isOutOfTasks = false) 
        
               # If someone wants our non-child tasks, let's oblige 
        
               var req: StealRequest 
        
               while recv(req): 
        
                 dispatchElseDecline(req) 
        
               profile_start(idle)

This can happen if all threads are idle.

2 solutions are possible:

Drain the whole task queue before switching to SB_Steal
Or in SB_Steal, don't only answer steal requests but also work sharing requests from idle workers

mratsim · 2020-04-26T15:59:21Z

We use solution 2.

No impact on overhead as measured by fibonacci with lazy flowvars (to not measure memory overhead) under 200ms

And with normal Flowvar under 400ms

load distribution seems to be the same.

What may have changed is that on sync and syncScope in the steal phase the worker sends its non-direct child tasks first to its children which may be sleeping instead of its thief. If the task was short we could have saved energy by only sending to the thief.
Inversely, the load distribution might be better since we give the runtime the opportunity to wake up sleeping threads as otherwise sleeping threads are only woken up on a successful theft even though the current workers might have extra tasks. I.e. the change is more greedy and so more asymptotically optimal.

mratsim · 2020-04-26T16:14:30Z

Unfortunately this is not fully fixed: https://travis-ci.com/github/mratsim/weave/jobs/323526172#L1854

mratsim · 2020-04-26T19:52:38Z

After trying to mix both solutions we still have the bug
(now rarer)

mratsim · 2020-05-04T22:53:34Z

There is a more problematic root cause, merging for now as it still helps a lot.

mratsim · 2020-05-05T21:25:02Z

Perf note when trying to use both

The FSM seems to have slightly higher overhead. 20ms slower on Fib(40) on both eager (387ms runtime) and lazy flowvars (204 ms runtime).

This is acceptable for now but hopefully we find the real bug.

…titasks in queue (followup #121)

* model checking - 1st try to fix MPSC queue (the model checker crashes with not enough memory :/) * Give the thread the opportunity to not deadlock on sleep on Mac/with Clang * whoopsie * Add impl of Weave MPSC channel in C++ for CDSChecker model checking + comment out fences * Comment out GEMM tests for syncRoot + Pledges: #97 * don't use sleep, it's can deadlock in the CI ... * Try get epoch time to avoid mac bugs * use `getTime` and hope that it's properly implemented on Mac * State-machine, return to CheckTask to avoid leaving task spawning multitasks in queue (followup #121) * Don't spinlock for testing, deadlocks ARM and OSX * Could it be non-mono clock jitter? * Add some log for MacOS debug * Race condition between spawning the thread and entering the spinlock in the `isReady` test * a,d obviously I mess up the function call

mratsim added 2 commits April 26, 2020 16:34

Fix syncScope livelock #119

111298b

Alternative fix by ensuring dispatch to idle workers

bed6042

stash - this seems hopeless

7393884

mratsim changed the title ~~Fix syncScope livelock~~ partial syncScope livelock May 4, 2020

mratsim merged commit 6dbf0e5 into master May 4, 2020

mratsim added a commit that referenced this pull request May 9, 2020

State-machine, return to CheckTask to avoid leaving task spawning mul…

c31e45f

…titasks in queue (followup #121)

mratsim deleted the fix-sync-scope-livelock branch May 17, 2020 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partial syncScope livelock #121

partial syncScope livelock #121

mratsim commented Apr 26, 2020 •

edited

Loading

mratsim commented Apr 26, 2020

mratsim commented Apr 26, 2020

mratsim commented Apr 26, 2020

mratsim commented May 4, 2020

mratsim commented May 5, 2020

	behavior(syncScopeFSA):
	steady: SB_Steal
	transition:
	# We might inadvertently remove our own steal request in
	# dispatchElseDecline so resteal
	profile_stop(idle)
	trySteal(isOutOfTasks = false)
	# If someone wants our non-child tasks, let's oblige
	var req: StealRequest
	while recv(req):
	dispatchElseDecline(req)
	profile_start(idle)

partial syncScope livelock #121

partial syncScope livelock #121

Conversation

mratsim commented Apr 26, 2020 • edited Loading

What seem to happen is:

Analysis (barring other bugs)

Conclusion and fix

mratsim commented Apr 26, 2020

mratsim commented Apr 26, 2020

mratsim commented Apr 26, 2020

mratsim commented May 4, 2020

mratsim commented May 5, 2020

mratsim commented Apr 26, 2020 •

edited

Loading