You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Operator 1.28.1, FDB 7.1.43
Using service for public IP, not using DNS
In order to recover a failing FDB cluster, we had
Turned off the operator
Excluded storage nodes using fdbcli
Deleting PVCs, PODs and services when exclusion was complete
Turn the operator back on
The operator recreates the process groups that are marked for exclusion, but not completely excluded as far as the operator knows (as it was done manually).
The newly created services get the same instance id as previously, but new public IP
The process groups now has two IPs, the old and new
Before the operator starts excluding the storage nodes, some data is rebalanced to the new nodes
The operator excludes the storage nodes and then immediately deletes them before exclusion is complete.
Logs show that excludes were done against the wrong IP, so the FDB responded that the nodes were not part of the cluster, so the operator thinks it can immediately move forward.
What did you expect to happen?
I might have expected the operator to not recreate process groups if they are already fully excluded. This could have been checked before recreation.
It is also a bit surprising that the operator will try to recreate the PVC of a process group that is marked for deletion, as I don't know if that will ever help recover any data.
When the nodes has been recreated with a new IP, I would expect the operator to verify that exclusion was completed for all IP addresses.
How can we reproduce it (as minimally and precisely as possible)?
Create a cluster with service IP and not using locality-based exclusion
Fill with enough data to make exclusion take some time
Start an exclusion
Disable the operator
Finish exclusion manually and delete pvcs, pods and services for excluded groups
Re-enable operator
The bug will cause data loss as some data is rebalanced back to recreated groups, before those groups are swiftly deleted.
What happened?
Operator 1.28.1, FDB 7.1.43
Using service for public IP, not using DNS
In order to recover a failing FDB cluster, we had
What did you expect to happen?
I might have expected the operator to not recreate process groups if they are already fully excluded. This could have been checked before recreation.
It is also a bit surprising that the operator will try to recreate the PVC of a process group that is marked for deletion, as I don't know if that will ever help recover any data.
When the nodes has been recreated with a new IP, I would expect the operator to verify that exclusion was completed for all IP addresses.
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
This was also reported in this forum post: https://forums.foundationdb.org/t/incomplete-exclusion-in-fdb-operator/4301
FDB Kubernetes operator
Kubernetes version
Cloud provider
The text was updated successfully, but these errors were encountered: