Rewrote ISM Policy reconciler #846

rkthtrifork · 2024-06-18T11:28:30Z

Description

The ISM Policy reconciler was constantly trying to update the ISM Policy and it was not handling reconciliation requeue in some cases. There were possibly other issues as well. Below I have described what caused the different issues I encountered

The ISM Policy request was different from the response, but they were both made with the same struct. This caused the reconciler to always see the existing ISM Policy and the ISM Policy from the CR as different and try to update it. I have created a separate struct model for each to separate the logic and in the code I now compare the existing policy with the policy from the CR by comparing both the Policy IDs and the policy spec
There were some very complex cases in the code that were very difficult to understand so I have attempted to make the code more concise and easy to read and understand
I have added reconciliation requeuing to all cases so the operator doesn't just stop reconciling the ISM Policy in some cases

One thing I am wondering is that I am not sure why we would want to create a CR without specifying the cluster ID and then the operator automatically links it to that cluster ID so it breaks if the OpenSearch CR is deleted. Is this intended and why? I'm talking about the section with the comment "Check cluster ref has not changed"

Tested cases:

A new ISM Policy is created through a CR and the operator creates it in the OpenSearch Cluster
The CR for an ISM Policy that is created by the operator is removed and the operator removes it in the OpenSearch Cluster
An ISM Policy that already exists in the OpenSearch Cluster is created through a CR and the operator ignores it and marks it as existing
The CR for an ISM Policy that was pre-existing and therefore was not created by the operator is removed and the operator does not remove the ISM Policy from the OpenSearch Cluster
An ISM Policy that already exists in the OpenSearch Cluster is created through a CR and the operator ignores it and marks it as existing. The ISM Policy is then manually removed from the OpenSearch Cluster and the operator now applies the ISM Policy from the CR

The test for ISM Policies is currently failing miserably, but I decided to create the PR to get feedback before I dive into fixing it.

Issues Resolved

#833
#732
Possibly other issues

Check List

Commits are signed per the DCO using --signoff
Unittest added for the new/changed functionality and all unit tests are successful
Customer-visible features documented
No linter warnings (make lint)

If CRDs are changed:

CRD YAMLs updated (make manifests) and also copied into the helm chart
Changes to CRDs documented

Please refer to the PR guidelines before submitting this pull request.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

rkthtrifork · 2024-06-21T10:50:06Z

I'm taking a second look at my code to see if there's anything I'm not happy about and I see naked returns are sometimes used and sometimes not. I see that the first reconciler (cluster reconciler) does not use naked return values and it seems somewhat random when we use naked returns and when we use explicit return values. I personally find the naked returns quite difficult to read and I have to backtrack in the code to see where the return value comes from so I would like to use explicit return values everywhere for the ISM Policy. Is this something you would have an issue with for this PR? @swoehrl-mw (I'm mentioning you since I've seen you the most here)

rkthtrifork · 2024-06-21T11:29:59Z

I also don't understand why there are so many cases where the reconciliation is not requeued. Is this intended? And is it correctly understood that r.updateStatus is a boolean representing whether the operator should update the CR or is it something else?

rkthtrifork · 2024-06-21T12:13:22Z

And a last question. Is there a reason why we check that the cluster ref has not been changed? Would we not expect the CR to be applied to any OpenSearch Cluster with the given name? Especially since the name has to be unique for each namespace

cthtrifork · 2024-08-05T06:53:32Z

@swoehrl-mw Is there a possibility of a community call or similar to discuss this? We would like to help improve this, but we need to align on the approach.

swoehrl-mw · 2024-08-07T11:21:04Z

Hi @cthtrifork. There currently is no community call for the operator. I'm quite swamped with work so don't really have the time for a 1:1 call, the best I can offer is an async discussion either here or you can hit me up in the opensearch slack.

cc @prudhvigodithi

swoehrl-mw · 2024-08-07T11:35:24Z

Hi @rkthtrifork. Sorry for not getting back to you, not much time and your comments got lost among all my messages.

I'm taking a second look at my code to see if there's anything I'm not happy about and I see naked returns are sometimes used and sometimes not. I see that the first reconciler (cluster reconciler) does not use naked return values and it seems somewhat random when we use naked returns and when we use explicit return values. I personally find the naked returns quite difficult to read and I have to backtrack in the code to see where the return value comes from so I would like to use explicit return values everywhere for the ISM Policy. Is this something you would have an issue with for this PR?

This boils down to basically personal style. I'm fine with using explicit returns.

I also don't understand why there are so many cases where the reconciliation is not requeued. Is this intended? And is it correctly understood that r.updateStatus is a boolean representing whether the operator should update the CR or is it something else?

The r.updateStatus is just a little helper for testing to make writing tests a bit easier. For reconciles the basic rule is: Each single reconciler should reconcile if it ran into an error or if its waiting on something. If there was no problem or nothing to do the default should be reconcile after 30s. It could be a bit confusing because for all reconcilers running under the cluster banner the top-level controller does this 30s default reconcile requeue. For the others there might be inconsistencies. If you find any, cleaning them up would be dope.

And a last question. Is there a reason why we check that the cluster ref has not been changed? Would we not expect the CR to be applied to any OpenSearch Cluster with the given name? Especially since the name has to be unique for each namespace

That was an early design decision to avoid inconsistencies and make it safer for the operator (like what should it do if the cluster changes, should it try to delete the role from the old one).

rkthtrifork · 2024-08-07T12:15:43Z

I'm taking a second look at my code to see if there's anything I'm not happy about and I see naked returns are sometimes used and sometimes not. I see that the first reconciler (cluster reconciler) does not use naked return values and it seems somewhat random when we use naked returns and when we use explicit return values. I personally find the naked returns quite difficult to read and I have to backtrack in the code to see where the return value comes from so I would like to use explicit return values everywhere for the ISM Policy. Is this something you would have an issue with for this PR?

This boils down to basically personal style. I'm fine with using explicit returns.

Got it

I also don't understand why there are so many cases where the reconciliation is not requeued. Is this intended? And is it correctly understood that r.updateStatus is a boolean representing whether the operator should update the CR or is it something else?

The r.updateStatus is just a little helper for testing to make writing tests a bit easier. For reconciles the basic rule is: Each single reconciler should reconcile if it ran into an error or if its waiting on something. If there was no problem or nothing to do the default should be reconcile after 30s. It could be a bit confusing because for all reconcilers running under the cluster banner the top-level controller does this 30s default reconcile requeue. For the others there might be inconsistencies. If you find any, cleaning them up would be dope.

Got it

And a last question. Is there a reason why we check that the cluster ref has not been changed? Would we not expect the CR to be applied to any OpenSearch Cluster with the given name? Especially since the name has to be unique for each namespace

That was an early design decision to avoid inconsistencies and make it safer for the operator (like what should it do if the cluster changes, should it try to delete the role from the old one).

We have been talking about why it does this since we are running a test cluster where we sometimes kill OpenSearch completely during downgrades (when we retry an upgrade) and when we have tested some stuff that didn't go as planned. When we kill OpenSearch completely, we have to delete all the custom resources as well and we were unsure why it was made this way. I have no issue whatsoever with adding it back, but I wanted to raise the question first

rkthtrifork · 2024-08-12T13:19:40Z

@swoehrl-mw Should I add back the cluster ref check and fix the test and then you can take a look at the PR?

swoehrl-mw · 2024-08-13T07:05:38Z

Should I add back the cluster ref check and fix the test and then you can take a look at the PR?

@rkthtrifork Yes, please add back the ref check. That way all reconcilers are consistent.

Signed-off-by: rkthtrifork <[email protected]>

rkthtrifork · 2024-08-22T12:13:09Z

@swoehrl-mw I fixed up the unit tests and added back the cluster ref check. Would you or someone else take a look?

swoehrl-mw

LGTM. Thanks @rkthtrifork for your contribution.

prudhvigodithi · 2024-08-23T01:01:03Z

@swoehrl-mw Is there a possibility of a community call or similar to discuss this? We would like to help improve this, but we need to align on the approach.

Hey @cthtrifork I'm open for setting up a community call, we even have a slack channel #k8s-operator https://opensearch.slack.com/archives/C06QRV1RLD7, please join.
Thanks
@getsaurabh02

rkthtrifork · 2024-08-23T09:01:37Z

@swoehrl-mw Is there a possibility of a community call or similar to discuss this? We would like to help improve this, but we need to align on the approach.

Hey @cthtrifork I'm open for setting up a community call, we even have a slack channel #k8s-operator https://opensearch.slack.com/archives/C06QRV1RLD7, please join. Thanks @getsaurabh02

I joined the Slack channel as well. Idk if there's a reason for a community meeting in regards to discussing this now that it's already merged, but I still do think there are few inconsistencies with how the reconcilers handle different cases and I think I also saw some cases where some of the reconcilers didn't requeue when I would expect it to. Getting those aligned would be cool. I would have to check up on it before I have some concrete places in the code to mention

rkthtrifork marked this pull request as ready for review June 18, 2024 11:31

rkthtrifork requested review from idanl21, dbason, swoehrl-mw, prudhvigodithi, jochenkressin, pchmielnik and salyh as code owners June 18, 2024 11:31

prudhvigodithi mentioned this pull request Jun 20, 2024

[BUG] ISM policies and index templates are constantly updated #833

Open

rkthtrifork marked this pull request as draft August 5, 2024 06:54

rkthtrifork force-pushed the patch/fix-ismpolicy-needs-update branch 4 times, most recently from 3cca2d0 to a3b8382 Compare August 21, 2024 12:34

rkthtrifork marked this pull request as ready for review August 21, 2024 12:51

rkthtrifork force-pushed the patch/fix-ismpolicy-needs-update branch 2 times, most recently from cc6b2c8 to 623268a Compare August 22, 2024 12:08

Rewrote ISM Policy reconciler

c0cc5bb

Signed-off-by: rkthtrifork <[email protected]>

rkthtrifork force-pushed the patch/fix-ismpolicy-needs-update branch from 623268a to c0cc5bb Compare August 22, 2024 12:09

swoehrl-mw approved these changes Aug 22, 2024

View reviewed changes

swoehrl-mw merged commit ec6428d into opensearch-project:main Aug 22, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrote ISM Policy reconciler #846

Rewrote ISM Policy reconciler #846

rkthtrifork commented Jun 18, 2024 •

edited

Loading

rkthtrifork commented Jun 21, 2024 •

edited

Loading

rkthtrifork commented Jun 21, 2024

rkthtrifork commented Jun 21, 2024 •

edited

Loading

cthtrifork commented Aug 5, 2024

swoehrl-mw commented Aug 7, 2024

swoehrl-mw commented Aug 7, 2024

rkthtrifork commented Aug 7, 2024

rkthtrifork commented Aug 12, 2024

swoehrl-mw commented Aug 13, 2024

rkthtrifork commented Aug 22, 2024

swoehrl-mw left a comment

prudhvigodithi commented Aug 23, 2024

rkthtrifork commented Aug 23, 2024

Rewrote ISM Policy reconciler #846

Rewrote ISM Policy reconciler #846

Conversation

rkthtrifork commented Jun 18, 2024 • edited Loading

Description

Issues Resolved

Check List

rkthtrifork commented Jun 21, 2024 • edited Loading

rkthtrifork commented Jun 21, 2024

rkthtrifork commented Jun 21, 2024 • edited Loading

cthtrifork commented Aug 5, 2024

swoehrl-mw commented Aug 7, 2024

swoehrl-mw commented Aug 7, 2024

rkthtrifork commented Aug 7, 2024

rkthtrifork commented Aug 12, 2024

swoehrl-mw commented Aug 13, 2024

rkthtrifork commented Aug 22, 2024

swoehrl-mw left a comment

Choose a reason for hiding this comment

prudhvigodithi commented Aug 23, 2024

rkthtrifork commented Aug 23, 2024

rkthtrifork commented Jun 18, 2024 •

edited

Loading

rkthtrifork commented Jun 21, 2024 •

edited

Loading

rkthtrifork commented Jun 21, 2024 •

edited

Loading