Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The summary problem and status related to stability test #8327

Closed
1 task done
aressu1985 opened this issue Mar 9, 2023 · 27 comments
Closed
1 task done

[Bug]: The summary problem and status related to stability test #8327

aressu1985 opened this issue Mar 9, 2023 · 27 comments
Assignees
Labels
kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@aressu1985
Copy link
Contributor

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

Status update on 2023-03-09:
1、point select, 100 terminals, can successfully run for 3*24 hours.
2、oltp insert, 100 terminals, mo was killed by oom after running for about 66 hours.

Hardware parameters: 16C 64G
MO topo: CN-DN standlone

Expected Behavior

No response

Steps to Reproduce

No response

Additional information

No response

@nnsgmsone
Copy link
Contributor

ok

@nnsgmsone
Copy link
Contributor

I will deal with the memory issue first

@nnsgmsone
Copy link
Contributor

nnsgmsone commented Mar 24, 2023

1 similar comment
@nnsgmsone
Copy link
Contributor

@aressu1985
Copy link
Contributor Author

Status update on 2023-03-28:
commid: 98a464d

oltp mixed case: 20-point-select 20-insert-delete 20-upadte
running time: 5000 min

The mo ran stablely during the test
image

2023-03-26 09:37:18 INFO MOPerfTest:217 - write total time = 16998139,read total time = 16998139 |

point_select 10469 11 891.11 22 22 6734904 10
inser**elete 12308 33 1050.94 19 38 5710751 0
----------------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
update_pk 10618 45 1314.92 15 15 4550602 1872
----------------- ----------- ----------- ----------- ----------- ----------- ----------- -----------

2023-03-26 09:37:20 INFO ResultProcessor:101 -
[point_select]
START : 2023-03-22 22:15:52
END : 2023-03-26 09:37:20
VUSER : 20
RT_MAX : 10469
RT_MIN : 11
RT_AVG : 891.11
TPS : 22
QPS : 22
SUCCESS : 6734904
ERROR : 10

[insert_delete]
START : 2023-03-22 22:15:52
END : 2023-03-26 09:37:20
VUSER : 20
RT_MAX : 12308
RT_MIN : 33
RT_AVG : 1050.94
TPS : 19
QPS : 38
SUCCESS : 5710751
ERROR : 0

[update_pk]
START : 2023-03-22 22:15:52
END : 2023-03-26 09:37:20
VUSER : 20
RT_MAX : 10618
RT_MIN : 45
RT_AVG : 1314.92
TPS : 15
QPS : 15
SUCCESS : 4550602
ERROR : 1872

2023-03-26 09:37:20 INFO MOPerfTest:233 - Program is shutting down,now will release the resources...
2023-03-26 09:37:21 INFO MOPerfTest:243 - Program exit completely.

@nnsgmsone
Copy link
Contributor

2 similar comments
@nnsgmsone
Copy link
Contributor

@nnsgmsone
Copy link
Contributor

@aressu1985
Copy link
Contributor Author

update on Apri 4:

TPCH 10G query loop run for 77 hours,the mo was kill by OOM

commit: ec2032d

@aressu1985
Copy link
Contributor Author

update on Apri 6:
run TPCC 10 warehouse 10 terminals for 800 minutes, the mo run stabily during the test.
but after tpcc test finished, the mem usage kept to 30+%
image

the heap is as following:
tpcc-heap

and during this test ,another functional issue was found , tracking by #8871

@nnsgmsone
Copy link
Contributor

关联issue #7891

3 similar comments
@nnsgmsone
Copy link
Contributor

关联issue #7891

@nnsgmsone
Copy link
Contributor

关联issue #7891

@nnsgmsone
Copy link
Contributor

关联issue #7891

@aressu1985
Copy link
Contributor Author

update on Apri 12:
commit: 701b6f1

run tpcc 10w 10 terminals test for about 8 hours on standlone mode for about 8 hours, the mo crashed.

run tpcc 10w 10 terminals test for about 8 hours on standlone mode for about 8 hours, the mo crashed.

9-f9774fbc-d930-11ed-8db4-525400777c3a-702-0:0-0>]"}
2023/04/12 20:59:13.063601 +0800 INFO frontend/util.go:415 time of Exec.Run : 283.045851ms connectionId 1066|127.0.0.1:52030|{account sys:dump:moadmin -- 0:1:0}|6795cf61-d8f7-11ed-8da4-525400777c3a
2023/04/12 20:59:13.063673 +0800 INFO frontend/util.go:415 time of SendResponse 120ns connectionId 1066|127.0.0.1:52030|{account sys:dump:moadmin -- 0:1:0}|6795cf61-d8f7-11ed-8da4-525400777c3a
2023/04/12 20:59:13.063760 +0800 INFO frontend/util.go:400 query trace status {"connection_id": 1066, "statement": "execute __mo_stmt_id_8 using @__mo_stmt_var_0,@__mo_stmt_var_1,@__mo_stmt_var_2,@__mo_stmt_var_3,@__mo_stmt_var_4", "status": "success", "span": {"trace_id": "d2914f14-d931-11ed-8db4-525400777c3a", "kind": "statement"}, "session_info": "connectionId 1066|127.0.0.1:52030|{account sys:dump:moadmin -- 0:1:0}|6795cf61-d8f7-11ed-8da4-525400777c3a"}
2023/04/12 20:59:13.063879 +0800 INFO frontend/util.go:400 query trace {"connection_id": 1066, "query": "execute __mo_stmt_id_8 using @__mo_stmt_var_0,@__mo_stmt_var_1,@__mo_stmt_var_2,@__mo_stmt_var_3,@__mo_stmt_var_4", "vars": "13 , 9 , 0 , 6 , 50131", "session_info": "connectionId 1066|127.0.0.1:52030|{account sys:dump:moadmin -- 0:1:0}|6795cf61-d8f7-11ed-8da4-525400777c3a"}
2023/04/12 20:59:13.064256 +0800 INFO frontend/session.go:369 session uuid: 6795cf61-d8f7-11ed-8da4-525400777c3a -> background session uuid: d2bdf447-d931-11ed-8db4-525400777c3a
2023/04/12 20:59:13.064647 +0800 INFO blockio/writer.go:147 [WriteEnd] {"operation": "name: 85f0f3f9-d931-11ed-8db4-525400777c3a_66, block count: 2, size: 266671"}
2023/04/12 20:59:13.065652 +0800 ERROR mpool/mpool.go:511 error: error: out of memory
github.com/matrixorigin/matrixone/pkg/common/mpool.(*MPool).Alloc
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/pkg/common/mpool/mpool.go:511
github.com/matrixorigin/matrixone/pkg/common/mpool.(*MPool).Realloc
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/pkg/common/mpool/mpool.go:592
github.com/matrixorigin/matrixone/pkg/common/mpool.(*MPool).Grow
/mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/pkg/common/mpool/mpool.go:658

@aressu1985
Copy link
Contributor Author

update on Apri 17:
commid: 8fe4a7f

oltp mixed case: 20-point-select 20-insert-delete 20-upadte

the mo was killed by oom after running test for about 38 hours.

[Sun Apr 16 00:06:43 2023] [4008087] 1000 4008087 2450 173 61440 0 500 bash
[Sun Apr 16 00:06:43 2023] [4008088] 1000 4008088 131 2 40960 0 500 grep
[Sun Apr 16 00:06:43 2023] [4008089] 1000 4008089 2450 74 69632 0 500 bash
[Sun Apr 16 00:06:43 2023] [4008090] 1000 4008090 2450 151 69632 0 500 bash
[Sun Apr 16 00:06:43 2023] [4008091] 1000 4008091 2450 76 69632 0 500 bash
[Sun Apr 16 00:06:43 2023] [4008092] 1000 4008092 2450 204 69632 0 500 bash
[Sun Apr 16 00:06:43 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/sshd.service,task=mo-service,pid=719739,uid=1000
[Sun Apr 16 00:06:43 2023] Out of memory: Killed process 719739 (mo-service) total-vm:65387340kB, anon-rss:59680816kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:118016kB oom_score_adj:0
[Sun Apr 16 00:06:47 2023] oom_reaper: reaped process 719739 (mo-service), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[sudong@VM-16-14-centos ~]$

@nnsgmsone
Copy link
Contributor

This issue will be significantly improved after pr 9069 is filed.

@nnsgmsone
Copy link
Contributor

pr 9069 has been submitted and should see a lot of improvements. @aressu1985

@nnsgmsone
Copy link
Contributor

may be a testing is required

@aressu1985
Copy link
Contributor Author

update on April 26
commit: 92da442
workload: tpch 10G loop run with 1 terminal

the test last 72 hours successfully. and the mo are running stably.
image

the tpch test tools log:
tpch_10.tar.gz

the mo resource usage in this duration:
resr.tar.gz

the last 5 prof(interval 1 hour):
prof.tar.gz

@nnsgmsone
Copy link
Contributor

continue

@aressu1985
Copy link
Contributor Author

update on April 30
commit: f4f9c4c
workload: tpcc 10w 10terminals 5000 minutes
the test has been lasted 5000 minutes successfully. and the mo are running stably.
The used mem is about 52 G
image

the tpch test tools log:
tpcc_10_10.tar.gz

the mo resource usage in this duration:
resr.txt.zip

the last 6 prof(interval 1 hour):
profile_tpcc.tar.gz

@nnsgmsone
Copy link
Contributor

Found that prepare did not work well and continued to track

@aressu1985
Copy link
Contributor Author

update on May 3commit:
workload: Sysbench mixed case, 300 threads (point_select 100, insert-delete 100, update 100)
the test has been lasted 5000 minutes successfully. and the mo are running stably.
The used mem is about 58 G
image

the test tools log:
mixed_5000.tar.gz

the mo resource usage in this duration:
resr.txt.zip

the last 5 prof(interval 1 hour):
mix_500_pro.tar.gz

@aressu1985
Copy link
Contributor Author

update on May 9
commit:65bdd570cb0580e82ef997e72c411ae7a1cd1b6c
workload:
1、Sysbench mixed case, 60 threads (point_select 20 insert-delete 20, update 20)
2、TPCH 10G loop
3、TPCC 10w 10terminal

the test has lasted 12 hours, and the mem is out.
image

profile is:
mixed_profile.tar.gz

@nnsgmsone
Copy link
Contributor

等plan重构后处理这个问题

@aressu1985
Copy link
Contributor Author

close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

2 participants