-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: lots of txn context deadline exceeded error when running tpcc 10w test for about 2 hours #8042
Comments
I'll be at 0.8 fix. |
I will deal with the memory issue first |
context的问题等彭正重构后测试,如果还有这个问题再看 |
@daviszhen pls tracking this issue |
@taofengliu 复现一下看看 |
I am not working on it |
2 similar comments
I am not working on it |
I am not working on it |
明天会跟陶峰讨论下计划, |
working on it soon |
work on it soon |
1 similar comment
work on it soon |
work on it soon |
1 similar comment
work on it soon |
没进展 |
今天找到复现方法。复现了一次。 |
明确了初步原因,是有两方面的原因。明天给出详细的分析。 |
log-service task/task_scheduler.go中报错的context deadline exceeded 问题。 原因异步调度框架2秒超时。不是前端的context超时。 有2种context超时情况。 第一种超时情况:mo执行查询任务的sql超过2秒。 这里截取典型的case。 1,查询任务开始 2023/05/09 15:55:01.945763 +0800 INFO log-service task/task_scheduler.go:125 ts.QueryTask {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "scheduler.queryTasks": 1, "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:886ns farawayToDeadline:1.999988112s"} 2,发请求给mo。 // mo接受新连接 2.1 执行use mo_task; //发送use mo_task //mo接收use mo_task // use mo_task执行成功。 2023/05/09 15:55:02.358917 +0800 INFO frontend/util.go:500 query trace status {"connection_id": 1099, "statement": "use mo_task", "status": "success", "span": {"trace_id": "cd7bf59a-ee3e-11ed-8f62-aa665a28570d", "kind": "statement"}, "session_info": "connectionId 1099 client 127.0.0.1:55132"} 2.2 执行show tables; //发送 show tables //mo接收show tables // 执行成功,距离deadline还有1.08秒 2023/05/09 15:55:02.863098 +0800 INFO log-service taskservice/mysql_task_storage.go:438 mysqlTaskStorage.Query getDB ok {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:917.31114ms farawayToDeadline:1.082677773s"} 3,执行select 查询task 2023/05/09 15:55:02.956391 +0800 INFO log-service taskservice/mysql_task_storage.go:469 mysqlTaskStorage.Query send {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:1.010597402s farawayToDeadline:989.391427ms", "deadline": "2023-05-09 15:55:03.945745 +0800 CST m=+324.288625093", "query": "select \n \t\t\t\t\t\ttask_id,\n\t\t\t\t\t\t\ttask_metadata_id,\n\t\t\t\t\t\t\ttask_metadata_executor,\n\t\t\t\t\t\t\ttask_metadata_context,\n\t\t\t\t\t\t\ttask_metadata_option,\n\t\t\t\t\t\t\ttask_parent_id,\n\t\t\t\t\t\t\ttask_status,\n\t\t\t\t\t\t\ttask_runner,\n\t\t\t\t\t\t\ttask_epoch,\n\t\t\t\t\t\t\tlast_heartbeat,\n\t\t\t\t\t\t\tresult_code,\n\t\t\t\t\t\t\terror_msg,\n\t\t\t\t\t\t\tcreate_at,\n\t\t\t\t\t\t\tend_at \n\t\t\t\t\t\tfrom mo_task.sys_async_task where task_status=0 order by task_id"} // 发送查询select [mysql] 2023/05/09 15:55:03 connection.go:499: QueryContext127.0.0.1:55132 127.0.0.1:6001select // mo接收查询select 2023/05/09 15:55:03.132243 +0800 INFO frontend/util.go:500 query trace {"connection_id": 1099, "query": "select \n \t\t\t\t\t\ttask_id,\n\t\t\t\t\t\t\ttask_metadata_id,\n\t\t\t\t\t\t\ttask_metadata_executor,\n\t\t\t\t\t\t\ttask_metadata_context,\n\t\t\t\t\t\t\ttask_metadata_option,\n\t\t\t\t\t\t\ttask_parent_id,\n\t\t\t\t\t\t\ttask_status,\n\t\t\t\t\t\t\ttask_runner,\n\t\t\t\t\t\t\ttask_epoch,\n\t\t\t\t\t\t\tlast_heartbeat,\n\t\t\t\t\t\t\tresult_code,\n\t\t\t\t\t\t\terror_msg,\n\t\t\t\t\t\t\tcreate_at,\n\t\t\t\t\t\t\tend_at \n\t\t\t\t\t\tfrom mo_task.sys_async_task where task_status=0 order by task_id", "session_info": "connectionId 1099 client 127.0.0.1:55132"} // 已经超时。 2023/05/09 15:55:03.945936 +0800 INFO log-service taskservice/mysql_task_storage.go:476 mysqlTaskStorage.Query get resp error {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "time": "835.672943ms", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:2.000116795s farawayToDeadline:-128.234µs", "query": "select \n \t\t\t\t\t\ttask_id,\n\t\t\t\t\t\t\ttask_metadata_id,\n\t\t\t\t\t\t\ttask_metadata_executor,\n\t\t\t\t\t\t\ttask_metadata_context,\n\t\t\t\t\t\t\ttask_metadata_option,\n\t\t\t\t\t\t\ttask_parent_id,\n\t\t\t\t\t\t\ttask_status,\n\t\t\t\t\t\t\ttask_runner,\n\t\t\t\t\t\t\ttask_epoch,\n\t\t\t\t\t\t\tlast_heartbeat,\n\t\t\t\t\t\t\tresult_code,\n\t\t\t\t\t\t\terror_msg,\n\t\t\t\t\t\t\tcreate_at,\n\t\t\t\t\t\t\tend_at \n\t\t\t\t\t\tfrom mo_task.sys_async_task where task_status=0 order by task_id", "error": "context deadline exceeded"} // 查询select的构建时间有3.47秒 2023/05/09 15:55:06.612266 +0800 INFO frontend/util.go:515 time of Exec.Build : 3.47020101s connectionId 1099 client 127.0.0.1:55132 // 发送给client失败。因为client超时并把连接关了。 2023/05/09 15:55:06.763881 +0800 ERROR frontend/util.go:510 query trace status {"connection_id": 1099, "statement": "select task_id, task_metadata_id, task_metadata_executor, task_metadata_context, task_metadata_option, task_parent_id, task_status, task_runner, task_epoch, last_heartbeat, result_code, error_msg, create_at, end_at from mo_task.sys_async_task where task_status = 0 order by task_id", "status": "fail", "error": "write tcp4 127.0.0.1:6001->127.0.0.1:55132: write: broken pipe", "span": {"trace_id": "ce180f0c-ee3e-11ed-8f64-aa665a28570d", "kind": "statement"}, "session_info": "connectionId 1099 client 127.0.0.1:55132"} 2023/05/09 15:55:10.506287 +0800 INFO log-service taskservice/task_service_holder.go:243 refreshableTaskStorage.Query end {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:8.56033546s farawayToDeadline:-6.560346595s"} // 查询任务失败。 2023/05/09 15:55:10.506345 +0800 ERROR log-service task/task_scheduler.go:133 failed to query tasks {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:8.560397367s farawayToDeadline:-6.560408592s", "status": "Created", "error": "context deadline exceeded"} |
第二种超时情况:建立连接+use mo_task+show tables 超时 这里给出典型的case。 1,查询任务开始 2,发请求给mo。 //mo收到use mo_task //mo接收show tables //show tables正常响应。但此时已经超过deadline //context已经超时 |
此外,下面这种情况,没有复现出来。 2023/02/15 15:13:27.352729 +0800 ERROR frontend/session.go:1704 GetTxn. error:context deadline exceeded |
这个问题指定s-1不合理。 |
改进方法,要么mo查询性能提升,要么增大异步调度任务的超时阈值。 |
我实验了将超时时间改为10s,依然会有超时错误context deadline exceeded。但是超时的频率大幅下降。 典型的case: 1,查询任务开始 2023/05/10 11:15:40.481549 +0800 INFO log-service task/task_scheduler.go:125 ts.QueryTask {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "scheduler.queryTasks": 1, "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:1.218µs farawayToDeadline:9.999983462s"} 2,连接mo 3,发送use_task //mo 收到use mo_task 4,发送show tables 5,mo use_task 执行完成。耗时2s // mo 收到show tables //show tables 构建计划耗时7.4s //show tables执行完成。总耗时7.5s //use mo_task; show tables 距离10s超时时间还有516ms。 //发送查询select //已经超时。 |
1,在tpcc执行过程中,mo 执行select时间变长,导致2s超时。具体案例分析见:#8042 (comment) 2,select执行效率的提升需要时间。短时间内比较困难。 因此先增大任务调度框架的超时时间到10s。测试显示超时的频率降低明显。 但是无法杜绝超时情况的出现。这类案例分析见:#8042 (comment) 3,tpcc 过程中30 分钟左右的sql执行数据。整体看sql不会越来越慢。而是有些时间段内会变慢。数据见:[eb_t.csv](https://github.com/matrixorigin/matrixone/files/11439569/eb_t.csv)。第一列是时间戳。第二列是构建时间(Exec.Build),sql 查询执行时间(Exec.Run)。第三列是对应的时间。 Approved by: @zhangxu19830126, @w-zr, @nnsgmsone, @reusee
通过增加了超时时间。降低了出现频率。但是改不彻底。 |
testing |
1,在tpcc执行过程中,mo 执行select时间变长,导致2s超时。具体案例分析见:matrixorigin#8042 (comment) 2,select执行效率的提升需要时间。短时间内比较困难。 因此先增大任务调度框架的超时时间到10s。测试显示超时的频率降低明显。 但是无法杜绝超时情况的出现。这类案例分析见:matrixorigin#8042 (comment) 3,tpcc 过程中30 分钟左右的sql执行数据。整体看sql不会越来越慢。而是有些时间段内会变慢。数据见:[eb_t.csv](https://github.com/matrixorigin/matrixone/files/11439569/eb_t.csv)。第一列是时间戳。第二列是构建时间(Exec.Build),sql 查询执行时间(Exec.Run)。第三列是对应的时间。 Approved by: @zhangxu19830126, @w-zr, @nnsgmsone, @reusee
closed |
Is there an existing issue for the same bug?
Environment
Actual Behavior
error log:
2023/02/15 13:15:56.913432 +0800 ERROR log-service task/task_scheduler.go:118 failed to query tasks {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "status": "Running", "error": "context deadline exceeded"}
github.com/matrixorigin/matrixone/pkg/hakeeper/task.(*scheduler).queryTasks
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/hakeeper/task/task_scheduler.go:118
github.com/matrixorigin/matrixone/pkg/hakeeper/task.(*scheduler).Schedule
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/hakeeper/task/task_scheduler.go:53
github.com/matrixorigin/matrixone/pkg/logservice.(*store).taskSchedule
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/logservice/store_hakeeper_check.go:199
github.com/matrixorigin/matrixone/pkg/logservice.(*store).hakeeperCheck
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/logservice/store_hakeeper_check.go:143
github.com/matrixorigin/matrixone/pkg/logservice.(*store).ticker
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/logservice/store.go:671
github.com/matrixorigin/matrixone/pkg/logservice.(*store).startHAKeeperReplica.func1
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/logservice/store.go:224
github.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/common/stopper/stopper.go:259
2023/02/15 13:15:56.913491 +0800 INFO frontend/util.go:500 query trace {"connection_id": 1002, "query": "deallocate prepare __mo_stmt_id_64", "session_info": "connectionId 1002"}
2023/02/15 15:13:27.352729 +0800 ERROR frontend/session.go:1704 GetTxn. error:context deadline exceeded
github.com/matrixorigin/matrixone/pkg/frontend.(*TxnHandler).GetTxn
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/session.go:1704
github.com/matrixorigin/matrixone/pkg/frontend.doUse
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:1041
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).handleChangeDB
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:1059
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).doComQuery
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:3583
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).ExecRequest
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:4375
github.com/matrixorigin/matrixone/pkg/frontend.(*Routine).handleRequest
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/routine.go:175
github.com/matrixorigin/matrixone/pkg/frontend.(*RoutineManager).Handler
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/routine_manager.go:320
github.com/fagongzi/goetty/v2.(*server).doConnection
/home/go/pkg/mod/github.com/fagongzi/goetty/[email protected]/application.go:381
Expected Behavior
No response
Steps to Reproduce
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: