-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: pessimistic mode:pipeline hung #10882
Comments
悲观事务的bvt,极大概率在ci上失败,检查都是hung在2个地方,分析过后,主要是ctx被cancel了,从channel里面取值就永远等到不到了,被卡死。 |
还在处理其他s-1,让qx帮忙先处理一下试试。暂不用换owner. |
很难说只是pipeline hung住的问题,https://github.com/jensenojs/matrixone/actions/runs/5711750669/job/15473912910 这个mo-tester侧也会显示no return result in 12000 ms. 但是这个insert的失败并没有导致后续case的连锁失败。 目前的cases集中失败在unique_index中 |
因为错过 context 而卡在 (1) 的部分已经由 #10946 修复,但是该修复的 pr 偶尔还会有其他 case fail 并且出现卡在 (2) 的部分 |
(2)我来看。 |
虽然今天还没结束,但是很可能不会有什么进展,who knows.. sad. |
唉 只能说明天争取定位到问题。 |
2023/08/07 17:32:39.354906 +0800 ERROR lockop/lock_op.go:268 error: txn need retry in rc mode {"span": {"trace_id": "124985b8-3505-11ee-be66-46f6e41a5fe4", "kind": "statement"}} 2023/08/07 17:32:39.355233 +0800 ERROR types/tuple.go:709 error: internal error: unable to decode tuple element with unknown typecode 35 2023/08/07 17:32:39.355275 +0800 ERROR function/list_builtIn.go:4846 error: Duplicate entry '50' for key 'c' {"span": {"trace_id": "124985b8-3505-11ee-be66-46f6e41a5fe4", "kind": "statement"}} 2023/08/07 17:32:39.355318 +0800 ERROR frontend/util.go:552 Duplicate entry '50' for key 'c' {"session_info": "connectionId 116|127.0.0.1:53902|{account sys:dump:moadmin -- 0:1:0}|goRoutineId 48869|123923e4-3505-11ee-be66-46f6e41a5fe4", "session_id": "116", "statement_id": "\u0012I\ufffd\ufffd5\u0005\u0011\ufffd\ufffdfF\ufffd\ufffd\u001a_\ufffd"} 2023/08/07 17:32:39.355386 +0800 ERROR frontend/util.go:552 query trace status {"connection_id": 116, "statement": "Insert into test_11 values(50,50)", "status": "fail", "error": "Duplicate entry '50' for key 'c'", "span": {"trace_id": "124985b8-3505-11ee-be66-46f6e41a5fe4", "kind": "statement"}, "session_info": "connectionId 116|127.0.0.1:53902|{account sys:dump:moadmin -- 0:1:0}|goRoutineId 48869|123923e4-3505-11ee-be66-46f6e41a5fe4", "session_id": "116", "statement_id": "\u0012I\ufffd\ufffd5\u0005\u0011\ufffd\ufffdfF\ufffd\ufffd\u001a_\ufffd"} 2023/08/07 17:32:39.355630 +0800 ERROR compile/scope.go:139 [cms] start to receive pre, sql is , p is 0x14000f23998, len(pre) = 1, len(remote) = 0 有可能是报重复键错了又retry了结果导致的卡死。不是很确定。 |
唯一约束查重过程中对null值的处理不对导致了该问题,在pr中已修复。跑了3遍没有出现该问题。 |
还需要定位! |
已修复,可验证。 pr 11099和这个没有关系,后续会单独提。 |
repro: @m-schen, please continue to fix this one. |
应该不是这个原因? |
我实际定位出卡死的地方是这里。卡死时只剩下接收方在等待数据,发送方的pipeline的协程已经全部结束了。可能用close去控制漏掉了些什么东西,也可能是之前clean pipeline前没有全部的地方调用cancel去防止产生新数据导致的。这个我还不能十分肯定。用close控制逻辑上是没有问题的,感觉是具体代码实现上有问题。need check. |
my mistake, this bug hasn't been reproduced after #11137, so this issue can be downgraded and continue to be observed for a while. |
Is there an existing issue for the same bug?
Environment
Actual Behavior
can not repro locally.
https://github.com/matrixorigin/matrixone/actions/runs/5667925051/job/15357619317?pr=10877
the wrong case path:matrixone/matrixone/test/distributed/cases/dml/update/update_index.test [row:26]
https://github.com/matrixorigin/matrixone/actions/runs/5668123547/job/15358200827?pr=10881
the wrong case path:matrixone/matrixone/test/distributed/cases/dml/delete/delete_index.test [row:117]
https://github.com/matrixorigin/matrixone/actions/runs/5688079541/job/15417430315?pr=10894 [row:312]
https://github.com/matrixorigin/matrixone/actions/runs/5717238289/job/15490561744?pr=10937 [row:156]
https://github.com/matrixorigin/matrixone/actions/runs/5725316420/job/15513581631?pr=10917
Expected Behavior
Steps to Reproduce
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: