Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHOAIENG-2974: Restrict DSCI deletion before DSC #1094

Conversation

Sara4994
Copy link
Contributor

@Sara4994 Sara4994 commented Jul 2, 2024

Description

This PR fixes a race condition that happens while uninstalling the odh-operator. The issue occurs while uninstalling, at times DSCI instance gets removed first and DSC instance hangs over with error status expecting DSCI instance to be created. Manual deletion also does not remove the hanging DSC instance. This PR address the issue by blocking the deletion of DSCI before DSC through webhooks.

JIRA issue: https://issues.redhat.com/browse/RHOAIENG-2974

How Has This Been Tested?

Issue was reported as a race condition. To reproduce the race condition manually, followed the below steps

  • Built and deployed the operator with this code changes.
  • Created DSCI and DSC instances and checked pod logs if they are fine without errors.
  • Deleted DSCI instance first and checked the status of DSC to be in error status
  • Manually tried to delete the DSC instance and was able to remove the instance.

Screenshot or short clip

Screenshot 2024-07-03 at 1 31 21 AM

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link
Member

@zdtsw zdtsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need test case updates in the webhook_suite_test.go

controllers/webhook/webhook.go Outdated Show resolved Hide resolved
controllers/webhook/webhook.go Outdated Show resolved Hide resolved
@Sara4994 Sara4994 force-pushed the uninstall-racecondition-fix branch from 7016ad6 to b83b9cb Compare July 3, 2024 16:53
@Sara4994 Sara4994 force-pushed the uninstall-racecondition-fix branch from b83b9cb to c6e8f45 Compare July 3, 2024 17:16
Copy link
Member

@zdtsw zdtsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two small comments, otherwise i am fine with the change

controllers/webhook/webhook_suite_test.go Show resolved Hide resolved
@@ -101,12 +103,24 @@ func (w *OpenDataHubWebhook) checkDupCreation(ctx context.Context, req admission
fmt.Sprintf("Only one instance of %s object is allowed", req.Kind.Kind))
}

func (w *OpenDataHubWebhook) checkDeletion(ctx context.Context, req admission.Request) admission.Response {
if req.Kind.Kind != "DSCInitialization" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to have a check here, we can do a if kind == DSC ,be more clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Sara4994
Copy link
Contributor Author

Sara4994 commented Jul 4, 2024

/retest

@zdtsw
Copy link
Member

zdtsw commented Jul 4, 2024

there are some failure in "linter" @Sara4994 i will approve once that is fixed.

@Sara4994
Copy link
Contributor Author

Sara4994 commented Jul 4, 2024

there are some failure in "linter" @Sara4994 i will approve once that is fixed.

@zdtsw fixed now :)

@openshift-ci openshift-ci bot added the lgtm label Jul 4, 2024
Copy link

openshift-ci bot commented Jul 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jul 4, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 2439104 into opendatahub-io:incubation Jul 4, 2024
8 checks passed
@bartoszmajsak
Copy link
Contributor

bartoszmajsak commented Jul 7, 2024

I might be missing something fundamental, but what I instantly thought when seeing this PR is - why not cascade deletion using owner ref and finalizers? DSCI would be an owner of DSC so deleting the former (DSCI) would result in always deleting the latter (DSC). With the approach brought by this PR, we force the user to trigger those two deletions in order manually.

Copy link

@adelton adelton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as @bartoszmajsak in #1094 (comment) -- why are we not using owner ref and finalizers?

@zdtsw
Copy link
Member

zdtsw commented Jul 8, 2024

I think one reason to do this is: if user mistakenly deleted DSCI (as origin issue described) this cascading would automatically delete DSC as well which in turn all components' resource etc.
the effect of this ^ would similar to uninstall operator.
do we want this? if so, we need someone document this change to the user.
A missing/broken DSCI can somehow mean operator is nonfunctional, but still enabled components can run as-is.

@adelton
Copy link

adelton commented Jul 9, 2024

Conceptually, what is the reason for DSC and DSCI being separate entities?

If deleting the DSCI CR produces a broken cluster, aren't the two actually linked so tightly together that it makes sense to treat them as very tightly connected? If the user/customer has the Operator nonfunctional but the components (seem to) work, is that a supported configuration?

And yes, I understand that implications of this would need to be figured out / investigated. So the webhook approach might be a reasonable stop gap.

@zdtsw
Copy link
Member

zdtsw commented Jul 9, 2024

Conceptually, what is the reason for DSC and DSCI being separate entities?

If deleting the DSCI CR produces a broken cluster, aren't the two actually linked so tightly together that it makes sense to treat them as very tightly connected? If the user/customer has the Operator nonfunctional but the components (seem to) work, is that a supported configuration?

And yes, I understand that implications of this would need to be figured out / investigated. So the webhook approach might be a reasonable stop gap.

The very old idea (i was not involved) is to have one DSCI for one cluster, but we could have multiple DSC:s in one cluster (I can only guess it is prep for the multi-tenant case) . But later on, this had been changed which makes it only supports one DSC and one DSCI in one cluster.
I do not know if any plan to continue with multiple DSC in the (near) future or we continue with 1:1:1 map, but yes, use webhook is only for "DSC deletion not hanging if no DSCI exist" case , till we re-define the relation between DSC and DSCI, e.g using Ownerreference or not.
cc @VaishnaviHire / @etirelli

ykaliuta pushed a commit to ykaliuta/opendatahub-operator that referenced this pull request Sep 17, 2024
* RHOAIENG-2974: Restrict DSCI deletion before DSC

* Added tests and minor enhancements

* Admission allow DSC

* lint fix

(cherry picked from commit 2439104)

Add DELETE operation for webhook in CSV, upstream

60a44c2 ("update: cleanup remove kfdef during uninstallation (opendatahub-io#1100)")
ykaliuta pushed a commit to ykaliuta/opendatahub-operator that referenced this pull request Sep 17, 2024
* RHOAIENG-2974: Restrict DSCI deletion before DSC

* Added tests and minor enhancements

* Admission allow DSC

* lint fix

(cherry picked from commit 2439104)

Add DELETE operation for webhook in CSV, upstream

60a44c2 ("update: cleanup remove kfdef during uninstallation (opendatahub-io#1100)")

Amend client -> Client due to already applied

03c1abc ("upgrade: controller-runtime and code change accordingly (opendatahub-io#1189)")
openshift-merge-bot bot pushed a commit to red-hat-data-services/rhods-operator that referenced this pull request Sep 18, 2024
* RHOAIENG-2974: Restrict DSCI deletion before DSC

* Added tests and minor enhancements

* Admission allow DSC

* lint fix

(cherry picked from commit 2439104)

Add DELETE operation for webhook in CSV, upstream

60a44c2 ("update: cleanup remove kfdef during uninstallation (opendatahub-io#1100)")

Amend client -> Client due to already applied

03c1abc ("upgrade: controller-runtime and code change accordingly (opendatahub-io#1189)")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants