Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestStoreWatch is flaky #8692

Open
okJiang opened this issue Oct 11, 2024 · 4 comments · May be fixed by #8724
Open

TestStoreWatch is flaky #8692

okJiang opened this issue Oct 11, 2024 · 4 comments · May be fixed by #8724
Labels
type/ci The issue is related to CI.

Comments

@okJiang
Copy link
Member

okJiang commented Oct 11, 2024

Flaky Test

Which jobs are failing

        	Error Trace:	/home/runner/work/pd/pd/pkg/utils/testutil/testutil.go:67
        	            				/home/runner/work/pd/pd/tests/integrations/mcs/scheduling/meta_test.go:90
        	Error:      	Condition never satisfied
        	Test:       	TestMeta/TestStoreWatch

CI link

https://github.com/tikv/pd/actions/runs/11287727792/job/31394142710?pr=8685

Reason for failure (if possible)

Anything else

@okJiang okJiang added the type/ci The issue is related to CI. label Oct 11, 2024
@okJiang
Copy link
Member Author

okJiang commented Oct 14, 2024

@lhy1024
Copy link
Contributor

lhy1024 commented Oct 16, 2024

@okJiang
Copy link
Member Author

okJiang commented Oct 16, 2024

@okJiang
Copy link
Member Author

okJiang commented Oct 16, 2024

[2024/10/16 07:51:11.195 +00:00] [WARN] [cluster.go:1352] ["store has been offline"] [store-id=2] [store-address=mock-2] [physically-destroyed=false]
[2024/10/16 07:51:12.839 +00:00] [INFO] [cluster.go:1556] ["store has changed to serving"] [store-id=1] [store-address=mock-1]
[2024/10/16 07:51:13.316 +00:00] [INFO] [lease.go:150] ["keep alive lease too slow"] [timeout-duration=3s] [actual-expire=2024/10/16 07:51:15.313 +00:00] [purpose="leader election"]
[2024/10/16 07:51:13.316 +00:00] [INFO] [lease.go:151] ["lease keep alive stopped"] [purpose="leader election"]
[2024/10/16 07:51:13.316 +00:00] [WARN] [lease.go:185] ["lease keep alive failed"] [purpose="leader election"] [start=2024/10/16 07:51:12.314 +00:00] [error="context canceled"]
[2024/10/16 07:51:13.316 +00:00] [INFO] [lease.go:202] ["stop lease keep alive worker"] [purpose="leader election"]
[2024/10/16 07:51:13.316 +00:00] [WARN] [lease.go:185] ["lease keep alive failed"] [purpose="leader election"] [start=2024/10/16 07:51:11.313 +00:00] [error="context canceled"]
[2024/10/16 07:51:13.317 +00:00] [WARN] [lease.go:185] ["lease keep alive failed"] [purpose="leader election"] [start=2024/10/16 07:51:13.313 +00:00] [error="context canceled"]
[2024/10/16 07:51:15.330 +00:00] [INFO] [server.go:1799] ["no longer a leader because lease has expired, pd leader will step down"]
[2024/10/16 07:51:15.667 +00:00] [WARN] [etcd_kv.go:178] ["txn runs too slow"] [response="{"header":{"cluster_id":176441633097661835,"member_id":17433166340816059334,"revision":25,"raft_term":2},"succeeded":true,"responses":[{"Response":{"response_put":{"header":{"revision":25}}}}]}"] [cost=4.472183291s] []
[2024/10/16 07:51:15.697 +00:00] [WARN] [etcd_kv.go:178] ["txn runs too slow"] [response="{"header":{"cluster_id":176441633097661835,"member_id":17433166340816059334,"revision":26,"raft_term":2},"succeeded":true,"responses":[{"Response":{"response_put":{"header":{"revision":26}}}}]}"] [cost=2.857764914s] []
[2024/10/16 07:51:15.698 +00:00] [WARN] [cluster.go:1441] ["store has been Tombstone"] [store-id=2] [store-address=mock-2] [state=Offline] [physically-destroyed=false]
[2024/10/16 07:51:15.698 +00:00] [WARN] [etcdutil.go:159] ["kv gets too slow"] [request-key=/ms/7426277929569067104/scheduling/registry/] [cost=3.738725909s] []
[2024/10/16 07:51:15.719 +00:00] [INFO] [cluster.go:2291] ["store limit changed"] [store-id=2] [type=remove-peer] [rate-per-min=100000000]

we can see the etcd request is very slow.

[2024/10/16 07:51:13.316 +00:00] [INFO] [lease.go:150] ["keep alive lease too slow"] [timeout-duration=3s] [actual-expire=2024/10/16 07:51:15.313 +00:00] [purpose="leader election"]
[2024/10/16 07:51:13.316 +00:00] [INFO] [lease.go:151] ["lease keep alive stopped"] [purpose="leader election"]
[2024/10/16 07:51:13.316 +00:00] [WARN] [lease.go:185] ["lease keep alive failed"] [purpose="leader election"] [start=2024/10/16 07:51:12.314 +00:00] [error="context canceled"]
[2024/10/16 07:51:13.316 +00:00] [INFO] [lease.go:202] ["stop lease keep alive worker"] [purpose="leader election"]
[2024/10/16 07:51:13.316 +00:00] [WARN] [lease.go:185] ["lease keep alive failed"] [purpose="leader election"] [start=2024/10/16 07:51:11.313 +00:00] [error="context canceled"]
[2024/10/16 07:51:13.317 +00:00] [WARN] [lease.go:185] ["lease keep alive failed"] [purpose="leader election"] [start=2024/10/16 07:51:13.313 +00:00] [error="context canceled"]
[2024/10/16 07:51:15.330 +00:00] [INFO] [server.go:1799] ["no longer a leader because lease has expired, pd leader will step down"]
[2024/10/16 07:51:15.667 +00:00] [WARN] [etcd_kv.go:178] ["txn runs too slow"] [response="{"header":{"cluster_id":176441633097661835,"member_id":17433166340816059334,"revision":25,"raft_term":2},"succeeded":true,"responses":[{"Response":{"response_put":{"header":{"revision":25}}}}]}"] [cost=4.472183291s] []
[2024/10/16 07:51:15.697 +00:00] [WARN] [etcd_kv.go:178] ["txn runs too slow"] [response="{"header":{"cluster_id":176441633097661835,"member_id":17433166340816059334,"revision":26,"raft_term":2},"succeeded":true,"responses":[{"Response":{"response_put":{"header":{"revision":26}}}}]}"] [cost=2.857764914s] []

So the watch is probably slow too. If the offline store(2) convert to tombstone before watch successful, the store(2) info in basicCluster probably skip offline state and become tombstone directly.

We can enable failpoint(doNotBuryStore) to make sure the store do not convert to tombstone during the check.

@okJiang okJiang linked a pull request Oct 16, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/ci The issue is related to CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants