Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: serveral tpcc query timeout during stability test on distributed mode #18723

Closed
1 task done
aressu1985 opened this issue Sep 12, 2024 · 20 comments
Closed
1 task done
Assignees
Labels
kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use to-next-release
Milestone

Comments

@aressu1985
Copy link
Contributor

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

main

Commit ID

c4b2445

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:

Actual Behavior

During stability test, there were several queries timeout in 60s.

2024-09-11 21:22:25 ERROR jTPCCConnection:327 - The connection[652] has not been valid.
2024-09-11 21:22:25 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[652] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,208 milliseconds ago. The last packet sent successfully to the server was 60,251 milliseconds ago.
2024-09-12 02:51:51 ERROR jTPCCConnection:327 - The connection[648] has not been valid.
2024-09-12 02:51:51 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[648] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,508 milliseconds ago. The last packet sent successfully to the server was 60,508 milliseconds ago.
2024-09-12 06:08:36 ERROR jTPCCConnection:327 - The connection[654] has not been valid.
2024-09-12 06:08:36 FATAL jTPCCTerminal:214 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[654] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,095 milliseconds ago. The last packet sent successfully to the server was 60,165 milliseconds ago.
2024-09-12 06:32:34 ERROR jTPCCConnection:327 - The connection[650] has not been valid.
2024-09-12 06:32:34 FATAL jTPCCTerminal:214 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[650] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,067 milliseconds ago. The last packet sent successfully to the server was 60,115 milliseconds ago.
2024-09-12 06:50:39 ERROR jTPCCConnection:327 - The connection[72304936] has not been valid.
2024-09-12 06:50:39 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[72304936] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,096 milliseconds ago. The last packet sent successfully to the server was 60,096 milliseconds ago.
2024-09-12 07:29:47 ERROR jTPCCConnection:327 - The connection[64795267] has not been valid.
2024-09-12 07:29:47 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[64795267] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,061 milliseconds ago. The last packet sent successfully to the server was 60,061 milliseconds ago.
2024-09-12 07:46:22 ERROR jTPCCConnection:327 - The connection[653] has not been valid.
2024-09-12 07:46:22 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[653] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,061 milliseconds ago. The last packet sent successfully to the server was 60,061 milliseconds ago.
2024-09-12 08:55:16 ERROR jTPCCConnection:327 - The connection[69671902] has not been valid.
2024-09-12 08:55:16 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[69671902] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,063 milliseconds ago. The last packet sent successfully to the server was 60,063 milliseconds ago.
2024-09-12 10:01:52 ERROR jTPCCConnection:327 - The connection[72035695] has not been valid.
2024-09-12 10:01:52 FATAL jTPCCTerminal:214 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[72035695] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,061 milliseconds ago. The last packet sent successfully to the server was 60,061 milliseconds ago.
2024-09-12 10:01:55 ERROR jTPCCConnection:327 - The connection[649] has not been valid.
2024-09-12 10:01:55 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[649] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,060 milliseconds ago. The last packet sent successfully to the server was 60,060 milliseconds ago.
2024-09-12 10:21:32 ERROR jTPCCConnection:327 - The connection[582] has not been valid.
2024-09-12 10:21:32 FATAL jTPCCTerminal:237 - [UNEXPECTED][STOCK_LEVEL][CONNECTION] The connection[582] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,060 milliseconds ago. The last packet sent successfully to the server was 60,060 milliseconds ago.
2024-09-12 10:21:44 ERROR jTPCCConnection:327 - The connection[651] has not been valid.
2024-09-12 10:21:44 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[651] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,061 milliseconds ago. The last packet sent successfully to the server was 60,952 milliseconds ago.
2024-09-12 10:25:48 ERROR jTPCCConnection:327 - The connection[74482530] has not been valid.
2024-09-12 10:25:48 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[74482530] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,039 milliseconds ago. The last packet sent successfully to the server was 60,039 milliseconds ago.
2024-09-12 10:27:22 ERROR jTPCCConnection:327 - The connection[74664736] has not been valid.
2024-09-12 10:27:22 FATAL jTPCCTerminal:237 - [UNEXPECTED][STOCK_LEVEL][CONNECTION] The connection[74664736] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,061 milliseconds ago. The last packet sent successfully to the server was 60,061 milliseconds ago.
2024-09-12 10:27:23 ERROR jTPCCConnection:327 - The connection[581] has not been valid.
2024-09-12 10:27:23 FATAL jTPCCTerminal:237 - [UNEXPECTED][STOCK_LEVEL][CONNECTION] The connection[581] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,059 milliseconds ago. The last packet sent successfully to the server was 60,059 milliseconds ago.
2024-09-12 10:27:23 ERROR jTPCCConnection:327 - The connection[72503991] has not been valid.
2024-09-12 10:27:23 FATAL jTPCCTerminal:328 - [UNEXPECTED][TT_NEW_ORDER][CONNECTION] The connection[72503991] has not been valid, caused by: Communications link failure

The last packet successfully received from the server was 60,062 milliseconds ago. The last packet sent successfully to the server was 60,126 milliseconds ago.
2024-09-12 10:27:32 ERROR ConsistencyCheck:152 - Communications link failure

There were no 'long running ' or 'leak' in mo log, and the queries were only executed very slowly for these times.

statement_info:
timeout.txt

mo-log;
https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22GYP%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-c4b2445-202409102229%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221726060945000%22,%22to%22:%221726108052000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

@aressu1985 aressu1985 added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Sep 12, 2024
@aressu1985 aressu1985 added this to the 1.3.0 milestone Sep 12, 2024
@badboynt1 badboynt1 assigned ouyuanning and unassigned matrix-meow Sep 12, 2024
@ouyuanning ouyuanning assigned volgariver6 and unassigned ouyuanning Sep 13, 2024
@ouyuanning
Copy link
Contributor

image 辛苦刘博先帮忙定位一下

@volgariver6
Copy link
Contributor

{"level":"INFO","time":"2024/09/11 18:52:08.325431 +0000","name":"cn-service","caller":"frontend/result_row_stmt.go:71","msg":"time of Exec.Run : 1m17.690626612s","service":"61373436-3363-3130-6637-356361633638","uuid":"61373436-3363-3130-6637-356361633638","session_info":"connectionId 648||account tpcc_test:admin|goRoutineId 0|migrate-goRoutineId 0|0191dc5d-5994-7c66-83ad-1422331870a5","role":"accountadmin","session_id":"0191dc5d-5994-7c66-83ad-1422331870a5","statement_id":"0191e26c-8d8a-7573-b2ed-b03e4725861d","txn_id":"c1b234c842e5f7f917f3e86cf235a71f","span":{"trace_id":"bbb7168c-b648-e355-deaa-7b8474be6897","span_id":"34eec943378f5ae4"}}

明天和 @ouyuanning 一起看一下

@daviszhen
Copy link
Contributor

未投入

@daviszhen
Copy link
Contributor

goroutine id 为0的问题

@daviszhen
Copy link
Contributor

未投入

1 similar comment
@daviszhen
Copy link
Contributor

未投入

@sukki37
Copy link
Contributor

sukki37 commented Oct 11, 2024

@sukki37
Copy link
Contributor

sukki37 commented Oct 12, 2024

@sukki37 sukki37 assigned iamlinjunhong and unassigned daviszhen Oct 12, 2024
@sukki37
Copy link
Contributor

sukki37 commented Oct 12, 2024

repro:https://github.com/matrixorigin/matrixone/actions/runs/11297350987/job/31425994909

timeout(1m)txn_id:51b551fd1d8b10ae17fd79a5008b5a24

exec_plan: 51b551fd1d8b10ae17fd79a5008b5a24_execplan.json

log:https://grafana.ci.matrixorigin.cn/goto/lGqARmiNg?orgId=1

holder txn_id: 44960eac9774aab117fd79a7273ecdc1

sql history:
sql_history_44960eac9774aab117fd79a7273ecdc1.txt

@sukki37
Copy link
Contributor

sukki37 commented Oct 13, 2024

@iamlinjunhong
Copy link
Contributor

repro:https://github.com/matrixorigin/matrixone/actions/runs/11297350987/job/31425994909

timeout(1m)txn_id:51b551fd1d8b10ae17fd79a5008b5a24

exec_plan: 51b551fd1d8b10ae17fd79a5008b5a24_execplan.json

log:https://grafana.ci.matrixorigin.cn/goto/lGqARmiNg?orgId=1

holder txn_id: 44960eac9774aab117fd79a7273ecdc1

sql history: sql_history_44960eac9774aab117fd79a7273ecdc1.txt

2024-10-12 03:18:53.337 {"level":"WARN","time":"2024/10/11 19:18:53.337488 +0000","name":"cn-service.lockservice","caller":"lockservice/waiter_events.go:252","msg":"wait too long","service":"66313462-3635-3434-3262-353837636662","uuid":"66313462-3635-3434-3262-353837636662","wait":"1m1.570324992s","key":"46013a15093a150100","bind":"312821-172867158155610327666313462-3635-3434-3262-353837636662-1728671406305098400","lock":"Exclusive(row): holder(44960eac9774aab117fd79a7273d832c), waiters(51b551fd1d8b10ae17fd79a5008e9c3d-0xc051d81080(4))","txn":"51b551fd1d8b10ae17fd79a5008b5a24"}

51b551fd1d8b10ae17fd79a5008b5a24 超过一分钟和等锁没有关系,check orphan 和 lock 是并发执行,51b551fd1d8b10ae17fd79a5008b5a24 没有在 waiters 中,表示事务已经完成了。且 51b551fd1d8b10ae17fd79a5008b5a24 在 2024-10-12 03:18:44.941 已经拿到锁了,比 44960eac9774aab117fd79a7273d832c,51b551fd1d8b10ae17fd79a5008e9c3d 早。

@sukki37
Copy link
Contributor

sukki37 commented Oct 18, 2024

@sukki37
Copy link
Contributor

sukki37 commented Oct 18, 2024

repro:https://github.com/matrixorigin/matrixone/actions/runs/11390904060/job/31696189874

info about timeout connection:

  • connId: 2567
  • goroutineId: 51970
  • txnId: 8f46ebea8f1b5fbd17ff525a39128792

log link: https://grafana.ci.matrixorigin.cn/goto/_HkndJiNR?orgId=1

@iamlinjunhong
Copy link
Contributor

事务等待队列过长

@iamlinjunhong
Copy link
Contributor

暂未处理

@sukki37 sukki37 modified the milestones: 2.0.1, 2.1.0 Nov 4, 2024
@iamlinjunhong
Copy link
Contributor

暂未处理

@sukki37
Copy link
Contributor

sukki37 commented Nov 6, 2024

The merge run was executed with 10 warehouses and 100 concurrency, which increases the cross-warehouse load, leading to a higher probability of multiple transactions locking the same (warehouse, district). From the logs and trace, it appears that the timed-out transactions indeed experienced increased lock wait times due to this reason, and the lock queue on the same (warehouse, district) is very long. We will temporarily close this issue and change the merge run to 50 warehouses with 50 concurrency to observe. If the issue persists, a new issue will be opened.

https://doc.weixin.qq.com/doc/w3_AEkAzgYFABUkPznumb3QLa5P00ohz?scode=AJsA6gc3AA80HXWdeIAEkAzgYFABU

@sukki37 sukki37 closed this as completed Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use to-next-release
Projects
None yet
Development

No branches or pull requests

7 participants