Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: lots of txn context deadline exceeded error when running tpcc 10w test for about 2 hours #8042

Closed
1 task done
aressu1985 opened this issue Feb 15, 2023 · 30 comments
Closed
1 task done
Assignees
Labels
kind/bug Something isn't working needs-triage severity/s1 High impact: Logical errors or data errors that must occur size/S Denotes a PR that changes [10,99] lines
Milestone

Comments

@aressu1985
Copy link
Contributor

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):6ca6dc7bed252a2ef72d97f57e37342fccfeabf3
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

error log:

2023/02/15 13:15:56.913432 +0800 ERROR log-service task/task_scheduler.go:118 failed to query tasks {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "status": "Running", "error": "context deadline exceeded"}
github.com/matrixorigin/matrixone/pkg/hakeeper/task.(*scheduler).queryTasks
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/hakeeper/task/task_scheduler.go:118
github.com/matrixorigin/matrixone/pkg/hakeeper/task.(*scheduler).Schedule
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/hakeeper/task/task_scheduler.go:53
github.com/matrixorigin/matrixone/pkg/logservice.(*store).taskSchedule
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/logservice/store_hakeeper_check.go:199
github.com/matrixorigin/matrixone/pkg/logservice.(*store).hakeeperCheck
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/logservice/store_hakeeper_check.go:143
github.com/matrixorigin/matrixone/pkg/logservice.(*store).ticker
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/logservice/store.go:671
github.com/matrixorigin/matrixone/pkg/logservice.(*store).startHAKeeperReplica.func1
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/logservice/store.go:224
github.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/common/stopper/stopper.go:259
2023/02/15 13:15:56.913491 +0800 INFO frontend/util.go:500 query trace {"connection_id": 1002, "query": "deallocate prepare __mo_stmt_id_64", "session_info": "connectionId 1002"}

2023/02/15 15:13:27.352729 +0800 ERROR frontend/session.go:1704 GetTxn. error:context deadline exceeded
github.com/matrixorigin/matrixone/pkg/frontend.(*TxnHandler).GetTxn
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/session.go:1704
github.com/matrixorigin/matrixone/pkg/frontend.doUse
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:1041
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).handleChangeDB
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:1059
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).doComQuery
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:3583
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).ExecRequest
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:4375
github.com/matrixorigin/matrixone/pkg/frontend.(*Routine).handleRequest
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/routine.go:175
github.com/matrixorigin/matrixone/pkg/frontend.(*RoutineManager).Handler
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/routine_manager.go:320
github.com/fagongzi/goetty/v2.(*server).doConnection
/home/go/pkg/mod/github.com/fagongzi/goetty/[email protected]/application.go:381

Expected Behavior

No response

Steps to Reproduce

No response

Additional information

No response

@aressu1985 aressu1985 added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Feb 15, 2023
@aressu1985 aressu1985 added this to the V0.8.0 milestone Feb 15, 2023
@matrix-meow matrix-meow modified the milestones: V0.8.0, v0.7.0 Feb 15, 2023
@aressu1985 aressu1985 modified the milestones: v0.7.0, V0.8.0 Feb 16, 2023
@nnsgmsone
Copy link
Contributor

I'll be at 0.8 fix.

@aressu1985
Copy link
Contributor Author

#8327

@nnsgmsone
Copy link
Contributor

I will deal with the memory issue first

@nnsgmsone
Copy link
Contributor

context的问题等彭正重构后测试,如果还有这个问题再看

@aressu1985 aressu1985 assigned daviszhen and unassigned nnsgmsone Mar 23, 2023
@aressu1985
Copy link
Contributor Author

@daviszhen pls tracking this issue

@daviszhen
Copy link
Contributor

@taofengliu 复现一下看看

@daviszhen
Copy link
Contributor

I am not working on it

2 similar comments
@daviszhen
Copy link
Contributor

I am not working on it

@daviszhen
Copy link
Contributor

I am not working on it

@daviszhen
Copy link
Contributor

明天会跟陶峰讨论下计划,

@daviszhen daviszhen assigned taofengliu and unassigned daviszhen and taofengliu Apr 3, 2023
@taofengliu
Copy link
Contributor

working on it soon

@taofengliu
Copy link
Contributor

work on it soon

1 similar comment
@taofengliu
Copy link
Contributor

work on it soon

@taofengliu
Copy link
Contributor

work on it soon

1 similar comment
@taofengliu
Copy link
Contributor

work on it soon

@fengttt fengttt added size/S Denotes a PR that changes [10,99] lines severity/s-1 and removed severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels May 4, 2023
@daviszhen daviszhen assigned daviszhen and unassigned taofengliu May 5, 2023
@daviszhen
Copy link
Contributor

没进展

@daviszhen
Copy link
Contributor

今天找到复现方法。复现了一次。

@daviszhen
Copy link
Contributor

明确了初步原因,是有两方面的原因。明天给出详细的分析。

@daviszhen
Copy link
Contributor

daviszhen commented May 9, 2023

log-service task/task_scheduler.go中报错的context deadline exceeded 问题。

原因异步调度框架2秒超时。不是前端的context超时。

有2种context超时情况。

第一种超时情况:mo执行查询任务的sql超过2秒。

这里截取典型的case。

1,查询任务开始

2023/05/09 15:55:01.945763 +0800 INFO log-service task/task_scheduler.go:125 ts.QueryTask {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "scheduler.queryTasks": 1, "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:886ns farawayToDeadline:1.999988112s"}
2023/05/09 15:55:01.945799 +0800 INFO log-service taskservice/task_service_holder.go:241 refreshableTaskStorage.Query begin {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:38.229µs farawayToDeadline:1.999950605s"}
2023/05/09 15:55:01.945834 +0800 INFO log-service taskservice/mysql_task_storage.go:429 mysqlTaskStorage.Query begin {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:73.537µs farawayToDeadline:1.999915298s"}

2,发请求给mo。

// mo接受新连接
2023/05/09 15:55:01.952036 +0800 INFO frontend/util.go:515 check password succeeded connectionId 1099 client 127.0.0.1:55132
// 发送ping
[mysql] 2023/05/09 15:55:01 connection.go:459: ping127.0.0.1:55132 127.0.0.1:6001

2.1 执行use mo_task;

//发送use mo_task
[mysql] 2023/05/09 15:55:02 connection.go:519: ExecContext127.0.0.1:55132 127.0.0.1:6001use mo_task
[mysql] 2023/05/09 15:55:02 connection.go:298: Exec127.0.0.1:55132 127.0.0.1:6001use mo_task

//mo接收use mo_task
2023/05/09 15:55:02.118249 +0800 INFO frontend/util.go:500 query trace {"connection_id": 1099, "query": "use mo_task", "session_info": "connectionId 1099 client 127.0.0.1:55132"}

// use mo_task执行成功。

2023/05/09 15:55:02.358917 +0800 INFO frontend/util.go:500 query trace status {"connection_id": 1099, "statement": "use mo_task", "status": "success", "span": {"trace_id": "cd7bf59a-ee3e-11ed-8f62-aa665a28570d", "kind": "statement"}, "session_info": "connectionId 1099 client 127.0.0.1:55132"}

2.2 执行show tables;

//发送 show tables
[mysql] 2023/05/09 15:55:02 connection.go:499: QueryContext127.0.0.1:55132 127.0.0.1:6001show tables

//mo接收show tables
2023/05/09 15:55:02.382682 +0800 INFO frontend/util.go:500 query trace {"connection_id": 1099, "query": "show tables", "session_info": "connectionId 1099 client 127.0.0.1:55132"}
2023/05/09 15:55:02.639175 +0800 INFO frontend/util.go:515 time of Exec.Build : 95.883049ms connectionId 1099 client 127.0.0.1:55132
2023/05/09 15:55:02.769169 +0800 INFO frontend/util.go:515 time of Exec.Run : 12.683438ms connectionId 1099 client 127.0.0.1:55132

// 执行成功,距离deadline还有1.08秒

2023/05/09 15:55:02.863098 +0800 INFO log-service taskservice/mysql_task_storage.go:438 mysqlTaskStorage.Query getDB ok {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:917.31114ms farawayToDeadline:1.082677773s"}
2023/05/09 15:55:02.863121 +0800 INFO frontend/util.go:500 query trace status {"connection_id": 1099, "statement": "show tables", "status": "success", "span": {"trace_id": "cdbcb4d6-ee3e-11ed-8f64-aa665a28570d", "kind": "statement"}, "session_info": "connectionId 1099 client 127.0.0.1:55132"}
2023/05/09 15:55:02.870806 +0800 INFO log-service taskservice/mysql_task_storage.go:448 mysqlTaskStorage.Query Conn ok {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:925.018865ms farawayToDeadline:1.074970076s"}

3,执行select 查询task

2023/05/09 15:55:02.956391 +0800 INFO log-service taskservice/mysql_task_storage.go:469 mysqlTaskStorage.Query send {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:1.010597402s farawayToDeadline:989.391427ms", "deadline": "2023-05-09 15:55:03.945745 +0800 CST m=+324.288625093", "query": "select \n \t\t\t\t\t\ttask_id,\n\t\t\t\t\t\t\ttask_metadata_id,\n\t\t\t\t\t\t\ttask_metadata_executor,\n\t\t\t\t\t\t\ttask_metadata_context,\n\t\t\t\t\t\t\ttask_metadata_option,\n\t\t\t\t\t\t\ttask_parent_id,\n\t\t\t\t\t\t\ttask_status,\n\t\t\t\t\t\t\ttask_runner,\n\t\t\t\t\t\t\ttask_epoch,\n\t\t\t\t\t\t\tlast_heartbeat,\n\t\t\t\t\t\t\tresult_code,\n\t\t\t\t\t\t\terror_msg,\n\t\t\t\t\t\t\tcreate_at,\n\t\t\t\t\t\t\tend_at \n\t\t\t\t\t\tfrom mo_task.sys_async_task where task_status=0 order by task_id"}

// 发送查询select

[mysql] 2023/05/09 15:55:03 connection.go:499: QueryContext127.0.0.1:55132 127.0.0.1:6001select

// mo接收查询select

2023/05/09 15:55:03.132243 +0800 INFO frontend/util.go:500 query trace {"connection_id": 1099, "query": "select \n \t\t\t\t\t\ttask_id,\n\t\t\t\t\t\t\ttask_metadata_id,\n\t\t\t\t\t\t\ttask_metadata_executor,\n\t\t\t\t\t\t\ttask_metadata_context,\n\t\t\t\t\t\t\ttask_metadata_option,\n\t\t\t\t\t\t\ttask_parent_id,\n\t\t\t\t\t\t\ttask_status,\n\t\t\t\t\t\t\ttask_runner,\n\t\t\t\t\t\t\ttask_epoch,\n\t\t\t\t\t\t\tlast_heartbeat,\n\t\t\t\t\t\t\tresult_code,\n\t\t\t\t\t\t\terror_msg,\n\t\t\t\t\t\t\tcreate_at,\n\t\t\t\t\t\t\tend_at \n\t\t\t\t\t\tfrom mo_task.sys_async_task where task_status=0 order by task_id", "session_info": "connectionId 1099 client 127.0.0.1:55132"}

// 已经超时。

2023/05/09 15:55:03.945936 +0800 INFO log-service taskservice/mysql_task_storage.go:476 mysqlTaskStorage.Query get resp error {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "time": "835.672943ms", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:2.000116795s farawayToDeadline:-128.234µs", "query": "select \n \t\t\t\t\t\ttask_id,\n\t\t\t\t\t\t\ttask_metadata_id,\n\t\t\t\t\t\t\ttask_metadata_executor,\n\t\t\t\t\t\t\ttask_metadata_context,\n\t\t\t\t\t\t\ttask_metadata_option,\n\t\t\t\t\t\t\ttask_parent_id,\n\t\t\t\t\t\t\ttask_status,\n\t\t\t\t\t\t\ttask_runner,\n\t\t\t\t\t\t\ttask_epoch,\n\t\t\t\t\t\t\tlast_heartbeat,\n\t\t\t\t\t\t\tresult_code,\n\t\t\t\t\t\t\terror_msg,\n\t\t\t\t\t\t\tcreate_at,\n\t\t\t\t\t\t\tend_at \n\t\t\t\t\t\tfrom mo_task.sys_async_task where task_status=0 order by task_id", "error": "context deadline exceeded"}
2023/05/09 15:55:03.966193 +0800 INFO log-service taskservice/mysql_task_storage.go:431 mysqlTaskStorage.Query end {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:2.020384097s farawayToDeadline:-20.395188ms"}

// 查询select的构建时间有3.47秒

2023/05/09 15:55:06.612266 +0800 INFO frontend/util.go:515 time of Exec.Build : 3.47020101s connectionId 1099 client 127.0.0.1:55132

// 发送给client失败。因为client超时并把连接关了。

2023/05/09 15:55:06.763881 +0800 ERROR frontend/util.go:510 query trace status {"connection_id": 1099, "statement": "select task_id, task_metadata_id, task_metadata_executor, task_metadata_context, task_metadata_option, task_parent_id, task_status, task_runner, task_epoch, last_heartbeat, result_code, error_msg, create_at, end_at from mo_task.sys_async_task where task_status = 0 order by task_id", "status": "fail", "error": "write tcp4 127.0.0.1:6001->127.0.0.1:55132: write: broken pipe", "span": {"trace_id": "ce180f0c-ee3e-11ed-8f64-aa665a28570d", "kind": "statement"}, "session_info": "connectionId 1099 client 127.0.0.1:55132"}

2023/05/09 15:55:10.506287 +0800 INFO log-service taskservice/task_service_holder.go:243 refreshableTaskStorage.Query end {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:8.56033546s farawayToDeadline:-6.560346595s"}
2023/05/09 15:55:10.506316 +0800 INFO log-service task/task_scheduler.go:129 ts.QueryTask {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "scheduler.queryTasks": 2, "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:8.560368847s farawayToDeadline:-6.560379975s"}

// 查询任务失败。

2023/05/09 15:55:10.506345 +0800 ERROR log-service task/task_scheduler.go:133 failed to query tasks {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: 2b21f82f-a44e-4cea-a84b-eb31e19bf53a, Start: 2023-05-09 15:55:01.945756 +0800 CST m=+322.288636012 fromStart:8.560397367s farawayToDeadline:-6.560408592s", "status": "Created", "error": "context deadline exceeded"}

@daviszhen
Copy link
Contributor

第二种超时情况:建立连接+use mo_task+show tables 超时

这里给出典型的case。

1,查询任务开始
2023/05/09 15:56:01.705104 +0800 INFO log-service task/task_scheduler.go:125 ts.QueryTask {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "scheduler.queryTasks": 1, "ctxinfo": "from: scheduler, fromUuid: d1da973c-60a1-4ba9-ab13-f07ea4e44436, Start: 2023-05-09 15:56:01.705097 +0800 CST m=+382.046677302 fromStart:987ns farawayToDeadline:1.999979905s"}
2023/05/09 15:56:01.705137 +0800 INFO log-service taskservice/task_service_holder.go:241 refreshableTaskStorage.Query begin {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: d1da973c-60a1-4ba9-ab13-f07ea4e44436, Start: 2023-05-09 15:56:01.705097 +0800 CST m=+382.046677302 fromStart:35.438µs farawayToDeadline:1.999945388s"}
2023/05/09 15:56:01.705169 +0800 INFO log-service taskservice/mysql_task_storage.go:429 mysqlTaskStorage.Query begin {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: d1da973c-60a1-4ba9-ab13-f07ea4e44436, Start: 2023-05-09 15:56:01.705097 +0800 CST m=+382.046677302 fromStart:67.404µs farawayToDeadline:1.999913402s"}

2,发请求给mo。
[mysql] 2023/05/09 15:56:01 connection.go:459: ping127.0.0.1:55381 127.0.0.1:6001
[mysql] 2023/05/09 15:56:01 connection.go:519: ExecContext127.0.0.1:55381 127.0.0.1:6001use mo_task
[mysql] 2023/05/09 15:56:01 connection.go:298: Exec127.0.0.1:55381 127.0.0.1:6001use mo_task

//mo收到use mo_task
2023/05/09 15:56:01.734215 +0800 INFO frontend/util.go:500 query trace {"connection_id": 1113, "query": "use mo_task", "session_info": "connectionId 1113 client 127.0.0.1:55381"}
[mysql] 2023/05/09 15:56:02 connection.go:499: QueryContext127.0.0.1:55381 127.0.0.1:6001show tables
2023/05/09 15:56:02.578054 +0800 INFO frontend/util.go:500 query trace status {"connection_id": 1113, "statement": "use mo_task", "status": "success", "span": {"trace_id": "f1048cde-ee3e-11ed-8fe6-aa665a28570d", "kind": "statement"}, "session_info": "connectionId 1113 client 127.0.0.1:55381"}

//mo接收show tables
2023/05/09 15:56:02.658053 +0800 INFO frontend/util.go:500 query trace {"connection_id": 1113, "query": "show tables", "session_info": "connectionId 1113 client 127.0.0.1:55381"}
2023/05/09 15:56:03.633887 +0800 INFO frontend/util.go:515 time of Exec.Build : 975.755577ms connectionId 1113 client 127.0.0.1:55381
2023/05/09 15:56:03.880185 +0800 INFO frontend/util.go:515 time of Exec.Run : 35.942242ms connectionId 1113 client 127.0.0.1:55381
2023/05/09 15:56:03.961038 +0800 INFO frontend/util.go:500 query trace status {"connection_id": 1113, "statement": "show tables", "status": "success", "span": {"trace_id": "f1918012-ee3e-11ed-8fe7-aa665a28570d", "kind": "statement"}, "session_info": "connectionId 1113 client 127.0.0.1:55381"}

//show tables正常响应。但此时已经超过deadline
2023/05/09 15:56:03.961611 +0800 INFO log-service taskservice/mysql_task_storage.go:438 mysqlTaskStorage.Query getDB ok {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: d1da973c-60a1-4ba9-ab13-f07ea4e44436, Start: 2023-05-09 15:56:01.705097 +0800 CST m=+382.046677302 fromStart:2.256450036s farawayToDeadline:-256.469221ms"}

//context已经超时
2023/05/09 15:56:04.098665 +0800 INFO log-service taskservice/mysql_task_storage.go:445 mysqlTaskStorage.Query Conn error {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: d1da973c-60a1-4ba9-ab13-f07ea4e44436, Start: 2023-05-09 15:56:01.705097 +0800 CST m=+382.046677302 fromStart:2.393475443s farawayToDeadline:-393.494877ms", "error": "context deadline exceeded"}
2023/05/09 15:56:04.137924 +0800 INFO log-service taskservice/mysql_task_storage.go:431 mysqlTaskStorage.Query end {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: d1da973c-60a1-4ba9-ab13-f07ea4e44436, Start: 2023-05-09 15:56:01.705097 +0800 CST m=+382.046677302 fromStart:2.432763892s farawayToDeadline:-432.783081ms"}
2023/05/09 15:56:05.142754 +0800 INFO log-service taskservice/task_service_holder.go:243 refreshableTaskStorage.Query end {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: d1da973c-60a1-4ba9-ab13-f07ea4e44436, Start: 2023-05-09 15:56:01.705097 +0800 CST m=+382.046677302 fromStart:3.437574548s farawayToDeadline:-1.437593875s"}
2023/05/09 15:56:05.142787 +0800 INFO log-service task/task_scheduler.go:129 ts.QueryTask {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "scheduler.queryTasks": 2, "ctxinfo": "from: scheduler, fromUuid: d1da973c-60a1-4ba9-ab13-f07ea4e44436, Start: 2023-05-09 15:56:01.705097 +0800 CST m=+382.046677302 fromStart:3.437608965s farawayToDeadline:-1.43762823s"}
2023/05/09 15:56:05.142822 +0800 ERROR log-service task/task_scheduler.go:133 failed to query tasks {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: d1da973c-60a1-4ba9-ab13-f07ea4e44436, Start: 2023-05-09 15:56:01.705097 +0800 CST m=+382.046677302 fromStart:3.437642934s farawayToDeadline:-1.437662276s", "status": "Created", "error": "context deadline exceeded"}

@daviszhen
Copy link
Contributor

此外,下面这种情况,没有复现出来。

2023/02/15 15:13:27.352729 +0800 ERROR frontend/session.go:1704 GetTxn. error:context deadline exceeded
github.com/matrixorigin/matrixone/pkg/frontend.(*TxnHandler).GetTxn
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/session.go:1704
github.com/matrixorigin/matrixone/pkg/frontend.doUse
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:1041
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).handleChangeDB
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:1059
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).doComQuery
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:3583
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).ExecRequest
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/mysql_cmd_executor.go:4375
github.com/matrixorigin/matrixone/pkg/frontend.(*Routine).handleRequest
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/routine.go:175
github.com/matrixorigin/matrixone/pkg/frontend.(*RoutineManager).Handler
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/frontend/routine_manager.go:320
github.com/fagongzi/goetty/v2.(*server).doConnection
/home/go/pkg/mod/github.com/fagongzi/goetty/[email protected]/application.go:381

@daviszhen
Copy link
Contributor

这个问题指定s-1不合理。

@daviszhen
Copy link
Contributor

改进方法,要么mo查询性能提升,要么增大异步调度任务的超时阈值。

@daviszhen
Copy link
Contributor

daviszhen commented May 10, 2023

我实验了将超时时间改为10s,依然会有超时错误context deadline exceeded。但是超时的频率大幅下降。

典型的case:

1,查询任务开始

2023/05/10 11:15:40.481549 +0800 INFO log-service task/task_scheduler.go:125 ts.QueryTask {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "scheduler.queryTasks": 1, "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:1.218µs farawayToDeadline:9.999983462s"}
2023/05/10 11:15:40.481595 +0800 INFO log-service taskservice/task_service_holder.go:241 refreshableTaskStorage.Query begin {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:48.866µs farawayToDeadline:9.999935745s"}
2023/05/10 11:15:40.481675 +0800 INFO log-service taskservice/mysql_task_storage.go:434 mysqlTaskStorage.Query begin {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:123.305µs farawayToDeadline:9.999861222s"}

2,连接mo
[mysql] 2023/05/10 11:15:40 connection.go:459: ping127.0.0.1:53120 127.0.0.1:6001

3,发送use_task
[mysql] 2023/05/10 11:15:40 connection.go:519: ExecContext127.0.0.1:53120 127.0.0.1:6001use mo_task
[mysql] 2023/05/10 11:15:40 connection.go:298: Exec127.0.0.1:53120 127.0.0.1:6001use mo_task

//mo 收到use mo_task
2023/05/10 11:15:40.495404 +0800 INFO frontend/util.go:403 query trace {"connection_id": 70, "query": "use mo_task", "session_info": "connectionId 70|127.0.0.1:53120|{account sys:1b828b80-f7ec-4479-a72b-acb90dc4583b:moadmin -- 0:4294967295:0}|e4aeb822-eee0-11ed-88c8-aa665a28570d"}

4,发送show tables
[mysql] 2023/05/10 11:15:42 connection.go:499: QueryContext127.0.0.1:53120 127.0.0.1:6001show tables

5,mo use_task 执行完成。耗时2s
2023/05/10 11:15:42.369732 +0800 INFO frontend/util.go:403 query trace status {"connection_id": 70, "statement": "use mo_task", "status": "success", "span": {"trace_id": "f130f114-eee0-11ed-88d4-aa665a28570d", "kind": "statement"}, "session_info": "connectionId 70|127.0.0.1:53120|{account sys:1b828b80-f7ec-4479-a72b-acb90dc4583b:moadmin -- 0:4294967295:0}|e4aeb822-eee0-11ed-88c8-aa665a28570d"}

// mo 收到show tables
2023/05/10 11:15:42.409551 +0800 INFO frontend/util.go:403 query trace {"connection_id": 70, "query": "show tables", "session_info": "connectionId 70|127.0.0.1:53120|{account sys:1b828b80-f7ec-4479-a72b-acb90dc4583b:moadmin -- 0:4294967295:0}|e4aeb822-eee0-11ed-88c8-aa665a28570d"}

//show tables 构建计划耗时7.4s
2023/05/10 11:15:49.909764 +0800 INFO frontend/util.go:418 time of Exec.Build : 7.440988448s connectionId 70|127.0.0.1:53120|{account sys:1b828b80-f7ec-4479-a72b-acb90dc4583b:moadmin -- 0:4294967295:0}|e4aeb822-eee0-11ed-88c8-aa665a28570d

//show tables执行完成。总耗时7.5s
2023/05/10 11:15:49.965462 +0800 INFO frontend/util.go:403 query trace status {"connection_id": 70, "statement": "show tables", "status": "success", "span": {"trace_id": "f25e014e-eee0-11ed-88d5-aa665a28570d", "kind": "statement"}, "session_info": "connectionId 70|127.0.0.1:53120|{account sys:1b828b80-f7ec-4479-a72b-acb90dc4583b:moadmin -- 0:4294967295:0}|e4aeb822-eee0-11ed-88c8-aa665a28570d"}

//use mo_task; show tables 距离10s超时时间还有516ms。
2023/05/10 11:15:49.965569 +0800 INFO log-service taskservice/mysql_task_storage.go:443 mysqlTaskStorage.Query getDB ok {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:9.483726108s farawayToDeadline:516.258155ms"}
2023/05/10 11:15:49.967655 +0800 INFO log-service taskservice/mysql_task_storage.go:453 mysqlTaskStorage.Query Conn ok {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:9.485817603s farawayToDeadline:514.16685ms"}

//发送查询select
2023/05/10 11:15:49.967726 +0800 INFO log-service taskservice/mysql_task_storage.go:474 mysqlTaskStorage.Query send {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:9.485891218s farawayToDeadline:514.093249ms", "deadline": "2023-05-10 11:15:50.481526 +0800 CST m=+212.480818234", "query": "select \n \t\t\t\t\t\ttask_id,\n\t\t\t\t\t\t\ttask_metadata_id,\n\t\t\t\t\t\t\ttask_metadata_executor,\n\t\t\t\t\t\t\ttask_metadata_context,\n\t\t\t\t\t\t\ttask_metadata_option,\n\t\t\t\t\t\t\ttask_parent_id,\n\t\t\t\t\t\t\ttask_status,\n\t\t\t\t\t\t\ttask_runner,\n\t\t\t\t\t\t\ttask_epoch,\n\t\t\t\t\t\t\tlast_heartbeat,\n\t\t\t\t\t\t\tresult_code,\n\t\t\t\t\t\t\terror_msg,\n\t\t\t\t\t\t\tcreate_at,\n\t\t\t\t\t\t\tend_at \n\t\t\t\t\t\tfrom mo_task.sys_async_task where task_status=0 order by task_id"}

//已经超时。
2023/05/10 11:15:50.482152 +0800 INFO log-service taskservice/mysql_task_storage.go:481 mysqlTaskStorage.Query get resp error {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "time": "514.302526ms", "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:10.000293423s farawayToDeadline:-309.311µs", "query": "select \n \t\t\t\t\t\ttask_id,\n\t\t\t\t\t\t\ttask_metadata_id,\n\t\t\t\t\t\t\ttask_metadata_executor,\n\t\t\t\t\t\t\ttask_metadata_context,\n\t\t\t\t\t\t\ttask_metadata_option,\n\t\t\t\t\t\t\ttask_parent_id,\n\t\t\t\t\t\t\ttask_status,\n\t\t\t\t\t\t\ttask_runner,\n\t\t\t\t\t\t\ttask_epoch,\n\t\t\t\t\t\t\tlast_heartbeat,\n\t\t\t\t\t\t\tresult_code,\n\t\t\t\t\t\t\terror_msg,\n\t\t\t\t\t\t\tcreate_at,\n\t\t\t\t\t\t\tend_at \n\t\t\t\t\t\tfrom mo_task.sys_async_task where task_status=0 order by task_id", "error": "context deadline exceeded"}
2023/05/10 11:15:50.491168 +0800 INFO log-service taskservice/mysql_task_storage.go:436 mysqlTaskStorage.Query end {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:10.009309683s farawayToDeadline:-9.325217ms"}
2023/05/10 11:15:51.853399 +0800 INFO log-service taskservice/task_service_holder.go:243 refreshableTaskStorage.Query end {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:11.371501927s farawayToDeadline:-1.371517737s"}
2023/05/10 11:15:51.875096 +0800 INFO log-service task/task_scheduler.go:129 ts.QueryTask {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "scheduler.queryTasks": 2, "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:11.39320107s farawayToDeadline:-1.393216525s"}
2023/05/10 11:15:51.916502 +0800 ERROR log-service task/task_scheduler.go:133 failed to query tasks {"uuid": "7c4dccb4-4d3c-41f8-b482-5251dc7a41bf", "ctxinfo": "from: scheduler, fromUuid: e05f826c-ee58-46ca-b728-251f6f4d6b00, Start: 2023-05-10 11:15:40.481541 +0800 CST m=+202.480833412 fromStart:11.434599542s farawayToDeadline:-1.43461503s", "status": "Created", "error": "context deadline exceeded"}

@daviszhen daviszhen mentioned this issue May 10, 2023
7 tasks
@domingozhang domingozhang added severity/s0 Extreme impact: Cause the application to break down and seriously affect the use severity/s1 High impact: Logical errors or data errors that must occur and removed severity/s-1 severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels May 10, 2023
mergify bot pushed a commit that referenced this issue May 10, 2023
1,在tpcc执行过程中,mo 执行select时间变长,导致2s超时。具体案例分析见:#8042 (comment)

2,select执行效率的提升需要时间。短时间内比较困难。
因此先增大任务调度框架的超时时间到10s。测试显示超时的频率降低明显。
但是无法杜绝超时情况的出现。这类案例分析见:#8042 (comment)

3,tpcc 过程中30 分钟左右的sql执行数据。整体看sql不会越来越慢。而是有些时间段内会变慢。数据见:[eb_t.csv](https://github.com/matrixorigin/matrixone/files/11439569/eb_t.csv)。第一列是时间戳。第二列是构建时间(Exec.Build),sql 查询执行时间(Exec.Run)。第三列是对应的时间。

Approved by: @zhangxu19830126, @w-zr, @nnsgmsone, @reusee
@daviszhen
Copy link
Contributor

通过增加了超时时间。降低了出现频率。但是改不彻底。
主要是查询太慢了。

@daviszhen daviszhen assigned aressu1985 and unassigned daviszhen May 19, 2023
@aressu1985
Copy link
Contributor Author

testing

daviszhen added a commit to daviszhen/matrixone that referenced this issue May 22, 2023
1,在tpcc执行过程中,mo 执行select时间变长,导致2s超时。具体案例分析见:matrixorigin#8042 (comment)

2,select执行效率的提升需要时间。短时间内比较困难。
因此先增大任务调度框架的超时时间到10s。测试显示超时的频率降低明显。
但是无法杜绝超时情况的出现。这类案例分析见:matrixorigin#8042 (comment)

3,tpcc 过程中30 分钟左右的sql执行数据。整体看sql不会越来越慢。而是有些时间段内会变慢。数据见:[eb_t.csv](https://github.com/matrixorigin/matrixone/files/11439569/eb_t.csv)。第一列是时间戳。第二列是构建时间(Exec.Build),sql 查询执行时间(Exec.Run)。第三列是对应的时间。

Approved by: @zhangxu19830126, @w-zr, @nnsgmsone, @reusee
@aressu1985
Copy link
Contributor Author

closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working needs-triage severity/s1 High impact: Logical errors or data errors that must occur size/S Denotes a PR that changes [10,99] lines
Projects
None yet
Development

No branches or pull requests

7 participants