Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: bvt failed due to insert into t select * from t timeout(reported not return result in 120000 ms) #9447

Closed
1 task done
heni02 opened this issue May 15, 2023 · 35 comments
Assignees
Labels
kind/bug Something isn't working severity/s-1
Milestone

Comments

@heni02
Copy link
Contributor

heni02 commented May 15, 2023

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):nightly-90f59ec
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/4972764075/jobs/8898201825
image

2023-05-14 14:16:47 ERROR Executor:175 - [/runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/test/distributed/cases/dml/delete/delete.test][row:205][insert into t select * from t;] was executed failed, con[id=0, user=dump, ***
2023-05-14 14:16:47 ERROR Executor:177 - [EXPECT RESULT]:
null
2023-05-14 14:16:47 ERROR Executor:178 - [ACTUAL RESULT]:
MO does not return result in 120000 ms.
2023-05-14 14:16:47 WARN Executor:181 - The mo-tester tries to re-connect to mo, con[id=0, user=dump, *** db=delete], please wait.....
2023-05-14 14:16:47 INFO ConnectionManager:67 - New conneciton from mo with[user=dump, *** has been initialized.
2023-05-14 14:16:47 ERROR Executor:212 - [/runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/test/distributed/cases/dml/delete/delete.test][row:206][delete from t where a > 1;] was executed failed, con[id=0, user=dump, ***
2023-05-14 14:16:47 ERROR Executor:214 - [EXPECT RESULT]:
null
2023-05-14 14:16:47 ERROR Executor:215 - [ACTUAL RESULT]:
no such table delete.t
2023-05-14 14:16:47 ERROR Executor:212 - [/runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/test/distributed/cases/dml/delete/delete.test][row:207][select count() from t;] was executed failed, con[id=0, user=dump, ***
2023-05-14 14:16:47 ERROR Executor:214 - [EXPECT RESULT]:
count(
)
512
2023-05-14 14:16:47 ERROR Executor:215 - [ACTUAL RESULT]:
SQL parser error: table "t" does not exist
2023-05-14 14:16:47 ERROR Executor:212 - [/runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/test/distributed/cases/dml/delete/delete.test][row:209][select count() from t;] was executed failed, con[id=0, user=dump, ***
2023-05-14 14:16:47 ERROR Executor:214 - [EXPECT RESULT]:
count(
)
2097152
2023-05-14 14:16:47 ERROR Executor:215 - [ACTUAL RESULT]:
SQL parser error: table "t" does not exist
2023-05-14 14:16:47 ERROR Executor:212 - [/runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/test/distributed/cases/dml/delete/delete.test][row:211][insert into t select * from t;] was executed failed, con[id=0, user=dump, ***
2023-05-14 14:16:47 ERROR Executor:214 - [EXPECT RESULT]:
null
2023-05-14 14:16:47 ERROR Executor:215 - [ACTUAL RESULT]:
no such table delete.t
Previous DML conflicts with existing constraints or data format. This transaction has to be aborted
2023-05-14 14:16:47 ERROR Executor:212 - [/runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/test/distributed/cases/dml/delete/delete.test][row:212][delete from t where a > 1;] was executed failed, con[id=0, user=dump, ***
2023-05-14 14:16:47 ERROR Executor:214 - [EXPECT RESULT]:
null
2023-05-14 14:16:47 ERROR Executor:215 - [ACTUAL RESULT]:
no such table delete.t
2023-05-14 14:16:47 ERROR Executor:212 - [/runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/test/distributed/cases/dml/delete/delete.test][row:213][select count() from t;] was executed failed, con[id=0, user=dump, ***
2023-05-14 14:16:47 ERROR Executor:214 - [EXPECT RESULT]:
count(
)
512
2023-05-14 14:16:47 ERROR Executor:215 - [ACTUAL RESULT]:
SQL parser error: table "t" does not exist
2023-05-14 14:16:47 ERROR Executor:212 - [/runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/test/distributed/cases/dml/delete/delete.test][row:215][select count() from t;] was executed failed, con[id=0, user=dump, ***
2023-05-14 14:16:47 ERROR Executor:214 - [EXPECT RESULT]:
count(
)
512
2023-05-14 14:16:47 ERROR Executor:215 - [ACTUAL RESULT]:
SQL parser error: table "t" does not exist

mo log path: s3://heni-test/mo-data-push/etl_20230514143207

Expected Behavior

No response

Steps to Reproduce

No response

Additional information

No response

@heni02 heni02 added this to the V0.8.0 milestone May 15, 2023
@heni02
Copy link
Contributor Author

heni02 commented May 15, 2023

@JackTan25 plz confirm this issue

@LeftHandCold
Copy link
Contributor

This case is not delete timeout, but insert 4 million rows of data timeout

@heni02 heni02 changed the title [Bug]: eks bvt test: delete case error [Bug]: eks bvt test: delete.test run error May 15, 2023
@heni02 heni02 changed the title [Bug]: eks bvt test: delete.test run error [Bug]: eks bvt test: delete.test execute error May 15, 2023
@LeftHandCold
Copy link
Contributor

{"level":"ERROR","time":"2023/05/14 14:16:47.336467 +0000","caller":"engine/entire_engine.go:73","msg":"error: internal error: temporary engine not init yet","span":{"trace_id":"f61f15d8-f261-11ed-90e5-92c62a464ff4","kind":"statement"},"stacktrace":"github.com/matrixorigin/matrixone/pkg/vm/engine.(*EntireEngine).Database\n\t/go/src/github.com/matrixorigin/matrixone/pkg/vm/engine/entire_engine.go:73\ngithub.com/matrixorigin/matrixone/pkg/frontend.(*TxnCompilerContext).getTmpRelation\n\t/go/src/github.com/matrixorigin/matrixone/pkg/frontend/compiler_context.go:219\ngithub.com/matrixorigin/matrixone/pkg/frontend.(*TxnCompilerContext).getRelation\n\t/go/src/github.com/matrixorigin/matrixone/pkg/frontend/compiler_context.go:202\ngithub.com/matrixorigin/matrixone/pkg/frontend.(*TxnCompilerContext).Resolve\n\t/go/src/github.com/matrixorigin/matrixone/pkg/frontend/compiler_context.go:263\ngithub.com/matrixorigin/matrixone/pkg/sql/plan.setTableExprToDmlTableInfo\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/plan/build_constraint_util.go:240\ngithub.com/matrixorigin/matrixone/pkg/sql/plan.getDmlTableInfo\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/plan/build_constraint_util.go:335\ngithub.com/matrixorigin/matrixone/pkg/sql/plan.buildDelete\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/plan/build_delete.go:27\ngithub.com/matrixorigin/matrixone/pkg/sql/plan.BuildPlan\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/plan/build.go:71\ngithub.com/matrixorigin/matrixone/pkg/sql/plan.(*BaseOptimizer).Optimize\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/plan/optimize.go:56\ngithub.com/matrixorigin/matrixone/pkg/frontend.buildPlan\n\t/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:1514\ngithub.com/matrixorigin/matrixone/pkg/frontend.(*TxnComputationWrapper).Compile\n\t/go/src/github.com/matrixorigin/matrixone/pkg/frontend/computation_wrapper.go:198\ngithub.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).doComQuery\n\t/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:2809\ngithub.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).ExecRequest\n\t/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:3397\ngithub.com/matrixorigin/matrixone/pkg/frontend.(*Routine).handleRequest\n\t/go/src/github.com/matrixorigin/matrixone/pkg/frontend/routine.go:194\ngithub.com/matrixorigin/matrixone/pkg/frontend.(*RoutineManager).Handler\n\t/go/src/github.com/matrixorigin/matrixone/pkg/frontend/routine_manager.go:432\ngithub.com/fagongzi/goetty/v2.(*server).doConnection\n\t/go/pkg/mod/github.com/matrixorigin/goetty/[email protected]/application.go:381\ngithub.com/fagongzi/goetty/v2.(*server).doStart.func2.1\n\t/go/pkg/mod/github.com/matrixorigin/goetty/[email protected]/application.go:344"}

@LeftHandCold
Copy link
Contributor

LeftHandCold commented May 15, 2023

After executing the insert into statement, CN cannot get any table relationship after two minutes

@LeftHandCold
Copy link
Contributor

@LeftHandCold
Copy link
Contributor

image

@JackTan25
Copy link
Contributor

in progress

@JackTan25
Copy link
Contributor

we need to wait pessimistic transaction to solve this. The hidden pk's conflict is slow. So sometimes it will result this over time occationally.

@JackTan25
Copy link
Contributor

I think we change the s3 size threshold to lift the system performace

@JackTan25
Copy link
Contributor

local insert is fast, eks environment is strange.

@sukki37 sukki37 changed the title [Bug]: eks bvt test: delete.test execute error [Bug]: eks bvt test: delete.test execute error(reported not return result in 120000 ms) May 19, 2023
@LeftHandCold
Copy link
Contributor

image

@triump2020
Copy link
Contributor

Exec.Run spend about 3 minutes before txn committing.

@sukki37 sukki37 changed the title [Bug]: eks bvt test: delete.test execute error(reported not return result in 120000 ms) [Bug]: bvt failed due to insert into t select * from t timeout(reported not return result in 120000 ms) May 22, 2023
@JackTan25
Copy link
Contributor

JackTan25 commented May 23, 2023

对比了昨天和今天的日志,基本上都存在这样的现象,就是跑insert的文件还没有结束,链接断开了,后面又开了新的链接去跑下面的语句,对于后面的t找不到原因是这样的,我写的测试文件本身是用的test这个数据库,然后第一次session挂了后,重新建立新的session,bvt执行默认的就是用的自己的database也就是文件名delete,
image
显示的是找不到delete.t,而不是test.t,所以这样就合理解释了

@JackTan25
Copy link
Contributor

JackTan25 commented May 23, 2023

1684815593116
日志可以看到,对于insert的执行过程当中有很多这种http的错误

@JackTan25
Copy link
Contributor

image
这是最开始的session

@JackTan25
Copy link
Contributor

我把对应的日志都截取到这里

@JackTan25
Copy link
Contributor

JackTan25 commented May 23, 2023

result.zip
这是这条语句执行时对应的日志

@fengttt
Copy link
Contributor

fengttt commented May 23, 2023

Related, #9646 and #9647

@nnsgmsone
Copy link
Contributor

@zhangxu19830126 旭哥先确定一下是dn没收到commit请求还是dn commit慢。

@zhangxu19830126
Copy link
Contributor

测试发现DN写log service超过rpc大小。先尝试修复这个后,在看。

@zhangxu19830126
Copy link
Contributor

@volgariver6 修复rpc的问题。

@zhangxu19830126
Copy link
Contributor

@volgariver6 修复 rpc 的问题在看

@zhangxu19830126
Copy link
Contributor

还没有找到问题

1 similar comment
@zhangxu19830126
Copy link
Contributor

还没有找到问题

@volgariver6
Copy link
Contributor

have fixed rpc message max size limit locally, but still have other bug. Need more investigation.

@volgariver6
Copy link
Contributor

对dn发送 message做了拆分之后,运行一些测试还有问题,因为logtail没有同步到cn导致事务执行失败:

2023/06/01 12:14:05.791414 +0800 ERROR disttae/logtailPushModel.go:138 error: transaction error: new txn failed. please retry {"span": {"trace_id": "97569c62-0032-11ee-aaec-f644820d18b1", "kind": "statement"}}

@cnutshell 正在帮忙分析

@volgariver6
Copy link
Contributor

修复了一些问题之后,还存在问题:
在事务commit的时候,dn 的 txn server 发送 response 失败:

2023/06/02 21:24:56.336178 +0800 ERROR dn-service morpc/server.go:361 write response failed {"uuid": "dd4dccb4-4d3c-41f8-b482-5251dc7a41bf", "name": "txn-server", "request-id": 68, "error": "write tcp4 127.0.0.1:22000->127.0.0.1:65435: use of closed network connection"}

这个问题还没找到原因

@volgariver6
Copy link
Contributor

delete 失败的问题

事务commit的时候,dn 的 txn server 发送 response 失败的原因找到了。原因是commit事务的时间很长,导致connection长时间没有数据发送,rpc框架中将其认定为idle connection,并close。
调整idle判定超时时间,又遇到了另外两个超时,一个是logtail消费时间长,导致事务执行失败;另一个是事务提交5分钟超时。
总之,根本原因是delete执行时间太长。

@XuPeng-SH @triump2020 沟通后,会对delete做一些优化看看效果。

该问题是 #9647

insert 超时

dn在执行insert时,并没有拉取s3数据并检查冲突的过程,因此理论上insert时间不会太长,该issue在本地也没有复现出来。
继续跟进该问题。

@volgariver6
Copy link
Contributor

insert 超时问题没有太大进展。

@volgariver6
Copy link
Contributor

no progress.

@volgariver6
Copy link
Contributor

还是没有办法重现

@volgariver6
Copy link
Contributor

在eks上跑了一下午,没有出现insert超时或者session挂掉的情况

@heni02
Copy link
Contributor Author

heni02 commented Jun 12, 2023

eks regression do not appear this issue , turn it off first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working severity/s-1
Projects
None yet
Development

No branches or pull requests