Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Livelock in syncScope #119

Closed
mratsim opened this issue Apr 26, 2020 · 2 comments
Closed

Livelock in syncScope #119

mratsim opened this issue Apr 26, 2020 · 2 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Apr 26, 2020

In PR #118, the Azure tests are passing but 6/8 of the Travis tests are failing due to "no output received in the past 10 min"

https://travis-ci.com/github/mratsim/weave/builds/162064695

Example

========================================================================================
Running [ c -d:danger ] benchmarks/matmul_gemm_blas/test_gemm_output.nim
========================================================================================
Test [2x2] * [2x2] -> [2x2]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x3] -> [2x3]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x9] -> [2x9]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x37] -> [2x37]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x129] -> [2x129]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x700] -> [2x700]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x2] -> [2x2]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x3] -> [2x3]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x9] -> [2x9]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x37] -> [2x37]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x129] -> [2x129]

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
The build has been terminated
@mratsim
Copy link
Owner Author

mratsim commented Apr 26, 2020

Random reproduction on my machine, with GDB trace

image

(gdb) bt
#0  0x0000558879f3c30b in findVictim__5OLJbJhLYuFgICzMiLhy5Q ()
#1  0x0000558879f3c708 in trySteal__9agQ9cSofpr3mmBNymbHAfPA ()
#2  0x0000558879f3d6c0 in recvElseSteal__gpjNZOOOBMUGVrsaD0EE1Q ()
#3  0x0000558879f3f7df in wait__OvJxRK5afaM0uqnaxQ4veA ()
#4  0x0000558879f5d61c in gemm_impl__hBgdOPbRKz85JKQLrSn0uw ()
#5  0x0000558879f6063d in gemm_strided_nestable__g3VhDY0FncuQ6N3TvHnwfg ()
#6  0x0000558879f60cbb in testVsReference__9b4XSi2lcCNz75qje9cE9aQQw ()
#7  0x0000558879f6100c in NimMainInner ()
#8  0x0000558879f61190 in NimMain ()
#9  0x0000558879f272bd in main ()

mratsim added a commit that referenced this issue Apr 26, 2020
@mratsim mratsim changed the title Livelock or Deadlock (?) Livelock in syncScope Apr 26, 2020
mratsim added a commit that referenced this issue May 4, 2020
* Fix syncScope livelock #119

* Alternative fix by ensuring dispatch to idle workers

* stash - this seems hopeless
@mratsim
Copy link
Owner Author

mratsim commented May 9, 2020

The state machine rework should prevent the root thread remaining alone with tasks created in its queue that it cannot process: c31e45f

See previous:
image

And the new one:
image

Returning to CheckTask ensures that a worker exhaust its queue and don't leave any task there while previously it would only run the stolen task which might spawn non-awaited new tasks or enqueue delayed tasks.

If there are stall left, it's probably related to parallel for #130

@mratsim mratsim closed this as completed May 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant