Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client watch doesn't work after restarting etcd server #1209

Closed
bhagyalakshmi1218 opened this issue Aug 24, 2023 · 14 comments
Closed

client watch doesn't work after restarting etcd server #1209

bhagyalakshmi1218 opened this issue Aug 24, 2023 · 14 comments

Comments

@bhagyalakshmi1218
Copy link

Versions

  • etcd: 3.5.1
  • jetcd: 0.7.5
  • java: 17

Describe the bug
etcd watches not getting notified after etcd server is restarted

To Reproduce

  • Start etcd single instance
  • Start java process which uses etcd client and listens on keys for example key = service/abc, value = 1
  • Kill etcd instance
  • Start etcd instance
  • Modify service/abc value to 2. No notification is received for watch in service

Expected behavior
Changes is notified to client after etcd server is restarted

@lburgazzoli
Copy link
Collaborator

We have a test that does something similar: can you check it out and provide a pr with a failing test that reproduces your use case ?

@bhagyalakshmi1218
Copy link
Author

This test is passing.
But can you try manually doing the same. Instead of changing the value using the same client before restart, can you try updating the value from a different client like etcdkeeper.

@lburgazzoli
Copy link
Collaborator

I don't have much time to do manual test, I'm very sorry so to investigate more, I need a reproducer in a form of an integration test.
Otherwise it would help if you can debug the code a little bit

@bhagyalakshmi1218
Copy link
Author

Try with sleep of 1min between stop and start in EtcdCluster::restart, test starts to fail with timeout (WatchResumeTest timeout is changed from 30 to 100sec)
Test passes with sleep of 30, 40sec. Starts failing from 50sec

@lburgazzoli
Copy link
Collaborator

this seems to be that the system reaches the max retry attempts/timeout, have you set up an error handler on the watch listener ?

@bhagyalakshmi1218
Copy link
Author

Yes, just printing stack trace. I get below error irrespective of whether test passes or fails
io.etcd.jetcd.common.exception.EtcdException: Network closed for unknown reason
at io.etcd.jetcd.common.exception.EtcdExceptionFactory.newEtcdException(EtcdExceptionFactory.java:34)
at io.etcd.jetcd.common.exception.EtcdExceptionFactory.fromStatus(EtcdExceptionFactory.java:82)
at io.etcd.jetcd.common.exception.EtcdExceptionFactory.toEtcdException(EtcdExceptionFactory.java:78)
at io.etcd.jetcd.common.exception.EtcdExceptionFactory.toEtcdException(EtcdExceptionFactory.java:73)
at io.etcd.jetcd.impl.WatchImpl$WatcherImpl.onError(WatchImpl.java:318)
at io.vertx.grpc.stub.StreamObserverReadStream.onError(StreamObserverReadStream.java:44)

Test fails on this line when sleep between restart is 1min
kvClient2.put(key, value).get();

I tried creating another client(before restart) and used it to put key with new value. With this, test succeeds putting the value. But fails in assertion.
Caused by: java.lang.AssertionError:
Expecting actual not to be null
at io.etcd.jetcd.impl.WatchResumeTest.lambda$testWatchOnPut$1(WatchResumeTest.java:63)
at org.awaitility.core.AssertionCondition.lambda$new$0(AssertionCondition.java:53)
at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:248)
at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:235)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)

@lburgazzoli
Copy link
Collaborator

can you please at least share your code ?

@bhagyalakshmi1218
Copy link
Author

Sure, I'm re-using WatchResumeTest.java which you had shared with minor changes.
Source Project: https://github.com/etcd-io/jetcd/tree/main
Changed Files: https://github.com/bhagyalakshmi1218/etcd-tests.git

Do you need logs as well?

@lburgazzoli
Copy link
Collaborator

@bhagyalakshmi1218 added some tests here #1210 with increasing timeout and all the test are passing.

So I guess we are not testing the same thing or there are some differences in the set-up.

Can you please provide a reproducer - in the form of an integration test - I can run as part of the test suite ?

@bhagyalakshmi1218
Copy link
Author

@lburgazzoli
Copy link
Collaborator

lburgazzoli commented Aug 28, 2023

@bhagyalakshmi1218 no but I as said, I'm really sorry but I'm the only maintainer left and I'm doing this in my spare time so I don't have the much time to do into a trial & error approach.

I really need to have a test that fails so I can have a look at it

@bhagyalakshmi1218
Copy link
Author

bhagyalakshmi1218 commented Aug 28, 2023

I understand, but this issue is something that we are facing in production. I'm trying to help

Single Client: Same code as yours but only change is test timeout of 100sec instead of 180sec. Test passes with test timeout >100 sec (Left is your code, Right changes i have made)
Code Changes:
image

Test Failure:
image

Two Clients: Same code as yours but only change is use a different client for put than watch
Code Changes:
image
Test Failure:
image

@lburgazzoli
Copy link
Collaborator

lburgazzoli commented Aug 28, 2023

I think I know what's happening but I don't have any clear solution at this stage so in fact the failure is because the reconnect backoff policy implemented in the grpc-java kicks in and at some point it becomes very log, so if you change the timeout of your test it should work.

At least I tried by setting up jetcd as a watcher, then using etcdctl to put data on the cluster and yes it takes a while.
It would be nice if you could provide any help about how to configure the internal grcp-java reconnect mechanism.

Note that there are a number of issues related to reconnect behavior and

@lburgazzoli
Copy link
Collaborator

@bhagyalakshmi1218 can you please confirm the behavior ? unfortunately, there's not so much I can do on the jetcd side as the re-connection logic is something hard-coded in the grpc-java library so I think we should close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants