Incomplete exclusion in FDB operator #1912

larshagencognite · 2024-01-03T12:38:13Z

What happened?

Operator 1.28.1, FDB 7.1.43
Using service for public IP, not using DNS

In order to recover a failing FDB cluster, we had

Turned off the operator
Excluded storage nodes using fdbcli
Deleting PVCs, PODs and services when exclusion was complete
Turn the operator back on
The operator recreates the process groups that are marked for exclusion, but not completely excluded as far as the operator knows (as it was done manually).
The newly created services get the same instance id as previously, but new public IP
The process groups now has two IPs, the old and new
Before the operator starts excluding the storage nodes, some data is rebalanced to the new nodes
The operator excludes the storage nodes and then immediately deletes them before exclusion is complete.
Logs show that excludes were done against the wrong IP, so the FDB responded that the nodes were not part of the cluster, so the operator thinks it can immediately move forward.

What did you expect to happen?

I might have expected the operator to not recreate process groups if they are already fully excluded. This could have been checked before recreation.

It is also a bit surprising that the operator will try to recreate the PVC of a process group that is marked for deletion, as I don't know if that will ever help recover any data.

When the nodes has been recreated with a new IP, I would expect the operator to verify that exclusion was completed for all IP addresses.

How can we reproduce it (as minimally and precisely as possible)?

Create a cluster with service IP and not using locality-based exclusion
Fill with enough data to make exclusion take some time
Start an exclusion
Disable the operator
Finish exclusion manually and delete pvcs, pods and services for excluded groups
Re-enable operator
The bug will cause data loss as some data is rebalanced back to recreated groups, before those groups are swiftly deleted.

Anything else we need to know?

This was also reported in this forum post: https://forums.foundationdb.org/t/incomplete-exclusion-in-fdb-operator/4301

FDB Kubernetes operator

$ kubectl fdb version
1.28.1

Kubernetes version

$ kubectl version
1.26.7-gke.500

Cloud provider

GCP

johscheuer · 2024-03-14T10:36:26Z

Sorry for the delay. The issue is/was how the operator runs the different reconcilers:

subReconcilers := []clusterSubReconciler{
    updateStatus{},
    updateLockConfiguration{},
    updateConfigMap{},
    checkClientCompatibility{},
    deletePodsForBuggification{},
    replaceMisconfiguredProcessGroups{},
    replaceFailedProcessGroups{},
    addProcessGroups{},
    addServices{},
    addPVCs{},
    addPods{}, // --> Creates the Pod
    generateInitialClusterFile{},
    removeIncompatibleProcesses{},
    updateSidecarVersions{},
    updatePodConfig{},
    updateMetadata{},
    updateDatabaseConfiguration{},
    chooseRemovals{},
    excludeProcesses{},
    changeCoordinators{},
    bounceProcesses{},
    maintenanceModeChecker{},
    updatePods{},
    removeProcessGroups{}, // --> Checks the exclusion state
    removeServices{},
    updateStatus{},
}

As the check for the exclusion state is at a later step the operator will create the Pods. The operator actually checks if all addresses are excluded: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/remove_process_groups.go#L355-L361. We can try to recreate this scenario in an e2e test case to see if we still see the issue.

Do you think you have the time/capacity to work on this?

johscheuer · 2024-06-18T16:09:27Z

@larshagencognite have you seen this issue again or are you able to write an e2e test for it?

larshagencognite added the bug Something isn't working label Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete exclusion in FDB operator #1912

Incomplete exclusion in FDB operator #1912

larshagencognite commented Jan 3, 2024

johscheuer commented Mar 14, 2024

johscheuer commented Jun 18, 2024

Incomplete exclusion in FDB operator #1912

Incomplete exclusion in FDB operator #1912

Comments

larshagencognite commented Jan 3, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

FDB Kubernetes operator

Kubernetes version

Cloud provider

johscheuer commented Mar 14, 2024

johscheuer commented Jun 18, 2024