Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: pessimistic mode:pipeline hung #10882

Closed
1 task done
Ariznawlll opened this issue Jul 27, 2023 · 21 comments
Closed
1 task done

[Bug]: pessimistic mode:pipeline hung #10882

Ariznawlll opened this issue Jul 27, 2023 · 21 comments
Assignees
Labels
kind/bug Something isn't working severity/s1 High impact: Logical errors or data errors that must occur
Milestone

Comments

@Ariznawlll
Copy link
Contributor

Ariznawlll commented Jul 27, 2023

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93): bd7286d622d9be1b84a64822e2a143dc74c4e0f3
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

can not repro locally.

https://github.com/matrixorigin/matrixone/actions/runs/5667925051/job/15357619317?pr=10877
the wrong case path:matrixone/matrixone/test/distributed/cases/dml/update/update_index.test [row:26]

https://github.com/matrixorigin/matrixone/actions/runs/5668123547/job/15358200827?pr=10881
the wrong case path:matrixone/matrixone/test/distributed/cases/dml/delete/delete_index.test [row:117]

https://github.com/matrixorigin/matrixone/actions/runs/5688079541/job/15417430315?pr=10894 [row:312]

https://github.com/matrixorigin/matrixone/actions/runs/5717238289/job/15490561744?pr=10937 [row:156]

https://github.com/matrixorigin/matrixone/actions/runs/5725316420/job/15513581631?pr=10917

Expected Behavior

image image image image image

Steps to Reproduce

No response

Additional information

No response

@Ariznawlll Ariznawlll added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Jul 27, 2023
@Ariznawlll Ariznawlll added this to the 1.0.0 milestone Jul 27, 2023
@Ariznawlll Ariznawlll mentioned this issue Jul 27, 2023
7 tasks
@zhangxu19830126
Copy link
Contributor

zhangxu19830126 commented Jul 30, 2023

悲观事务的bvt,极大概率在ci上失败,检查都是hung在2个地方,分析过后,主要是ctx被cancel了,从channel里面取值就永远等到不到了,被卡死。

  1. https://github.com/matrixorigin/matrixone/blob/main/pkg/sql/colexec/receiver_operator.go#L142
  2. https://github.com/matrixorigin/matrixone/blob/main/pkg/sql/colexec/receiver_operator.go#L82

@zhangxu19830126 zhangxu19830126 changed the title [Bug]: pessimistic mode:run bvt on ci wating for lock [Bug]: pessimistic mode:pipeline hung Jul 30, 2023
@matrix-meow matrix-meow added severity/s-1 and removed severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Jul 31, 2023
@m-schen
Copy link
Contributor

m-schen commented Jul 31, 2023

还在处理其他s-1,让qx帮忙先处理一下试试。暂不用换owner.

@jensenojs
Copy link
Contributor

很难说只是pipeline hung住的问题,https://github.com/jensenojs/matrixone/actions/runs/5711750669/job/15473912910

这个mo-tester侧也会显示no return result in 12000 ms.

image

但是这个insert的失败并没有导致后续case的连锁失败。

目前的cases集中失败在unique_index中

@dongdongyang33
Copy link
Contributor

dongdongyang33 commented Aug 1, 2023

悲观事务的bvt,极大概率在ci上失败,检查都是hung在2个地方,分析过后,主要是ctx被cancel了,从channel里面取值就永远等到不到了,被卡死。

  1. https://github.com/matrixorigin/matrixone/blob/main/pkg/sql/colexec/receiver_operator.go#L142
  2. https://github.com/matrixorigin/matrixone/blob/main/pkg/sql/colexec/receiver_operator.go#L82

因为错过 context 而卡在 (1) 的部分已经由 #10946 修复,但是该修复的 pr 偶尔还会有其他 case fail 并且出现卡在 (2) 的部分
出现在 (2) 意味着并不是因为 context done 没被捕获,而是这个 pipeline 有子节点确实没有结束,有可能是 pipeline 逻辑上出现了死锁

@m-schen
Copy link
Contributor

m-schen commented Aug 2, 2023

(2)我来看。

@m-schen m-schen assigned m-schen and unassigned dongdongyang33 Aug 2, 2023
@sukki37
Copy link
Contributor

sukki37 commented Aug 2, 2023

@m-schen
Copy link
Contributor

m-schen commented Aug 3, 2023

虽然今天还没结束,但是很可能不会有什么进展,who knows.. sad.

@sukki37
Copy link
Contributor

sukki37 commented Aug 4, 2023

@m-schen
Copy link
Contributor

m-schen commented Aug 4, 2023

唉 只能说明天争取定位到问题。

@Ariznawlll
Copy link
Contributor Author

@Ariznawlll
Copy link
Contributor Author

image https://github.com/matrixorigin/matrixone/actions/runs/5769081834/job/15640741121?pr=11047

@m-schen
Copy link
Contributor

m-schen commented Aug 7, 2023

2023/08/07 17:32:39.354906 +0800 ERROR lockop/lock_op.go:268 error: txn need retry in rc mode {"span": {"trace_id": "124985b8-3505-11ee-be66-46f6e41a5fe4", "kind": "statement"}}

2023/08/07 17:32:39.355233 +0800 ERROR types/tuple.go:709 error: internal error: unable to decode tuple element with unknown typecode 35

2023/08/07 17:32:39.355275 +0800 ERROR function/list_builtIn.go:4846 error: Duplicate entry '50' for key 'c' {"span": {"trace_id": "124985b8-3505-11ee-be66-46f6e41a5fe4", "kind": "statement"}}

2023/08/07 17:32:39.355318 +0800 ERROR frontend/util.go:552 Duplicate entry '50' for key 'c' {"session_info": "connectionId 116|127.0.0.1:53902|{account sys:dump:moadmin -- 0:1:0}|goRoutineId 48869|123923e4-3505-11ee-be66-46f6e41a5fe4", "session_id": "116", "statement_id": "\u0012I\ufffd\ufffd5\u0005\u0011\ufffd\ufffdfF\ufffd\ufffd\u001a_\ufffd"}

2023/08/07 17:32:39.355386 +0800 ERROR frontend/util.go:552 query trace status {"connection_id": 116, "statement": "Insert into test_11 values(50,50)", "status": "fail", "error": "Duplicate entry '50' for key 'c'", "span": {"trace_id": "124985b8-3505-11ee-be66-46f6e41a5fe4", "kind": "statement"}, "session_info": "connectionId 116|127.0.0.1:53902|{account sys:dump:moadmin -- 0:1:0}|goRoutineId 48869|123923e4-3505-11ee-be66-46f6e41a5fe4", "session_id": "116", "statement_id": "\u0012I\ufffd\ufffd5\u0005\u0011\ufffd\ufffdfF\ufffd\ufffd\u001a_\ufffd"}

2023/08/07 17:32:39.355630 +0800 ERROR compile/scope.go:139 [cms] start to receive pre, sql is , p is 0x14000f23998, len(pre) = 1, len(remote) = 0

有可能是报重复键错了又retry了结果导致的卡死。不是很确定。

@m-schen
Copy link
Contributor

m-schen commented Aug 8, 2023

唯一约束查重过程中对null值的处理不对导致了该问题,在pr中已修复。跑了3遍没有出现该问题。
不过还没想明白为什么经过怎样的流程会导致卡死,pr暂不合入。

@m-schen
Copy link
Contributor

m-schen commented Aug 9, 2023

还需要定位!
感觉和昨天的pr关系不是很紧密,只是凑巧减少了复现的概率。

@m-schen
Copy link
Contributor

m-schen commented Aug 10, 2023

已修复,可验证。
卡住的原因是
当pipeline为 A -> B 且由B(接收方)结束了pipeline时,清理B的过程中会尝试继续从A收取数据,直到收到nil(结束的标记)。
由于A的上下文继承于B,会同步结束,导致A不一定会发出nil数据(select case done : case send走了第一个case)。因此卡死(与协程的调度有关,因此偶现)。

pr 11099和这个没有关系,后续会单独提。

@m-schen m-schen assigned Ariznawlll and unassigned m-schen Aug 10, 2023
@sukki37
Copy link
Contributor

sukki37 commented Aug 10, 2023

repro:
https://github.com/matrixorigin/matrixone/actions/runs/5818682422/job/15775588094
image

@m-schen, please continue to fix this one.

@sukki37 sukki37 assigned m-schen and unassigned Ariznawlll Aug 10, 2023
@dongdongyang33
Copy link
Contributor

dongdongyang33 commented Aug 10, 2023

已修复,可验证。 卡住的原因是 当pipeline为 A -> B 且由B(接收方)结束了pipeline时,清理B的过程中会尝试继续从A收取数据,直到收到nil(结束的标记)。 由于A的上下文继承于B,会同步结束,导致A不一定会发出nil数据(select case done : case send走了第一个case)。因此卡死(与协程的调度有关,因此偶现)。

pr 11099和这个没有关系,后续会单独提。

应该不是这个原因?
虽然会不小心走到没有发送 nil 的路径(设计上这是出错关闭的路径)但是最终一定会关闭 channel。清理的过程是收到 nil 或者发现 channel 关闭都可以退出,所以不应该会卡住。
不过可能会收到很多误报的 children pipeline closed unexpectedly

@m-schen
Copy link
Contributor

m-schen commented Aug 11, 2023

image 这个pr没有包含修复的代码,因此卡住了。

@sukki37 sukki37 assigned sukki37 and unassigned m-schen Aug 11, 2023
@m-schen
Copy link
Contributor

m-schen commented Aug 11, 2023

已修复,可验证。 卡住的原因是 当pipeline为 A -> B 且由B(接收方)结束了pipeline时,清理B的过程中会尝试继续从A收取数据,直到收到nil(结束的标记)。 由于A的上下文继承于B,会同步结束,导致A不一定会发出nil数据(select case done : case send走了第一个case)。因此卡死(与协程的调度有关,因此偶现)。
pr 11099和这个没有关系,后续会单独提。

应该不是这个原因? 虽然会不小心走到没有发送 nil 的路径(设计上这是出错关闭的路径)但是最终一定会关闭 channel。清理的过程是收到 nil 或者发现 channel 关闭都可以退出,所以不应该会卡住。 不过可能会收到很多误报的 children pipeline closed unexpectedly

我实际定位出卡死的地方是这里。卡死时只剩下接收方在等待数据,发送方的pipeline的协程已经全部结束了。可能用close去控制漏掉了些什么东西,也可能是之前clean pipeline前没有全部的地方调用cancel去防止产生新数据导致的。这个我还不能十分肯定。用close控制逻辑上是没有问题的,感觉是具体代码实现上有问题。need check.

@sukki37
Copy link
Contributor

sukki37 commented Aug 11, 2023

repro: https://github.com/matrixorigin/matrixone/actions/runs/5818682422/job/15775588094 image

@m-schen, please continue to fix this one.

my mistake, this bug hasn't been reproduced after #11137, so this issue can be downgraded and continue to be observed for a while.

@sukki37 sukki37 added severity/s1 High impact: Logical errors or data errors that must occur and removed severity/s-1 labels Aug 11, 2023
@sukki37 sukki37 closed this as completed Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working severity/s1 High impact: Logical errors or data errors that must occur
Projects
None yet
Development

No branches or pull requests

7 participants