Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete exclusion in FDB operator #1912

Open
larshagencognite opened this issue Jan 3, 2024 · 2 comments
Open

Incomplete exclusion in FDB operator #1912

larshagencognite opened this issue Jan 3, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@larshagencognite
Copy link
Collaborator

What happened?

Operator 1.28.1, FDB 7.1.43
Using service for public IP, not using DNS

In order to recover a failing FDB cluster, we had

  1. Turned off the operator
  2. Excluded storage nodes using fdbcli
  3. Deleting PVCs, PODs and services when exclusion was complete
  4. Turn the operator back on
  5. The operator recreates the process groups that are marked for exclusion, but not completely excluded as far as the operator knows (as it was done manually).
  6. The newly created services get the same instance id as previously, but new public IP
  7. The process groups now has two IPs, the old and new
  8. Before the operator starts excluding the storage nodes, some data is rebalanced to the new nodes
  9. The operator excludes the storage nodes and then immediately deletes them before exclusion is complete.
  10. Logs show that excludes were done against the wrong IP, so the FDB responded that the nodes were not part of the cluster, so the operator thinks it can immediately move forward.

What did you expect to happen?

I might have expected the operator to not recreate process groups if they are already fully excluded. This could have been checked before recreation.

It is also a bit surprising that the operator will try to recreate the PVC of a process group that is marked for deletion, as I don't know if that will ever help recover any data.

When the nodes has been recreated with a new IP, I would expect the operator to verify that exclusion was completed for all IP addresses.

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a cluster with service IP and not using locality-based exclusion
  2. Fill with enough data to make exclusion take some time
  3. Start an exclusion
  4. Disable the operator
  5. Finish exclusion manually and delete pvcs, pods and services for excluded groups
  6. Re-enable operator
  7. The bug will cause data loss as some data is rebalanced back to recreated groups, before those groups are swiftly deleted.

Anything else we need to know?

This was also reported in this forum post: https://forums.foundationdb.org/t/incomplete-exclusion-in-fdb-operator/4301

FDB Kubernetes operator

$ kubectl fdb version
1.28.1

Kubernetes version

$ kubectl version
1.26.7-gke.500

Cloud provider

GCP
@larshagencognite larshagencognite added the bug Something isn't working label Jan 3, 2024
@johscheuer
Copy link
Member

Sorry for the delay. The issue is/was how the operator runs the different reconcilers:

subReconcilers := []clusterSubReconciler{
    updateStatus{},
    updateLockConfiguration{},
    updateConfigMap{},
    checkClientCompatibility{},
    deletePodsForBuggification{},
    replaceMisconfiguredProcessGroups{},
    replaceFailedProcessGroups{},
    addProcessGroups{},
    addServices{},
    addPVCs{},
    addPods{}, // --> Creates the Pod
    generateInitialClusterFile{},
    removeIncompatibleProcesses{},
    updateSidecarVersions{},
    updatePodConfig{},
    updateMetadata{},
    updateDatabaseConfiguration{},
    chooseRemovals{},
    excludeProcesses{},
    changeCoordinators{},
    bounceProcesses{},
    maintenanceModeChecker{},
    updatePods{},
    removeProcessGroups{}, // --> Checks the exclusion state
    removeServices{},
    updateStatus{},
}

As the check for the exclusion state is at a later step the operator will create the Pods. The operator actually checks if all addresses are excluded: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/remove_process_groups.go#L355-L361. We can try to recreate this scenario in an e2e test case to see if we still see the issue.

Do you think you have the time/capacity to work on this?

@johscheuer
Copy link
Member

@larshagencognite have you seen this issue again or are you able to write an e2e test for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants