When the connection to etcd is broken, dfuse search fails to update its connection #165

matthewdarwin · 2020-10-06T16:26:28Z

When the connection to etcd is broken and etcd is replaced by a different instance, dfuse search fails to update its connection and stays broken. Also the health reports as "healthy" so monitoring when this situation occurs is challenging.

One possible solution:

Add a mechanism that detects that the GRPC connection to etcd was broken and just exit and wait to get restarted by k8s or systemd or whatever.

Scenario is probably something like this:

archive A tells etcd that it serves blocks 1000->2000 (BUT THAT ETCD IS GONE, REPLACED BY NEW REBUILT CLUSTER !!!)
router checks etcd, reads this and sends a query to archive A down to block 1000 (BUT THAT ETCD IS GONE, SO NO UPDATES !!!)
archive A says: hey I don't have block 1000, my lowest block is 1100 ("I TRIED TO TELL YOU VIA ETCD BUT MY UPDATE IS STALLED")
Manually restart the router and archives
they connect to the new etcd and that's all good

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the connection to etcd is broken, dfuse search fails to update its connection #165

When the connection to etcd is broken, dfuse search fails to update its connection #165

matthewdarwin commented Oct 6, 2020

When the connection to etcd is broken, dfuse search fails to update its connection #165

When the connection to etcd is broken, dfuse search fails to update its connection #165

Comments

matthewdarwin commented Oct 6, 2020