Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle HSM for HFC #351

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

iurygregory
Copy link

No description provided.

Copy link

openshift-ci bot commented May 8, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: iurygregory
Once this PR has been reviewed and has the lgtm label, please assign dtantsur for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@honza
Copy link
Member

honza commented May 8, 2024

Is this a downstream-only change?

@iurygregory
Copy link
Author

@honza nope, is just to try to test downstream first in the setup Jad has (I have a thread in slack about it)

@honza
Copy link
Member

honza commented May 8, 2024

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2024
@rhjanders
Copy link

/retest

hfc.Status.Updates = hfc.Spec.Updates
t := metav1.Now()
hfc.Status.LastUpdated = &t
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without calling Update() this doesn't have any effect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are talking about having the reconciler calling Status().Update() right ? r.Status().Update(info.ctx, info.hfc)
Any ideas on how to access the HostFirmwareComponentsReconciler? Or there is other way to call

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is in baremetalhost controller, you can either define it as a method to BMH Reconciler or call the Status().Update in the caller function.
By the way, in line 1750 are we sure that all the updates have been applied by this point ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hroyrh ohhh. that makes some sense!
Regarding L1750, since this is in the saveHostFirmwareComponents I should probably re-think the place I'm calling it actionPreparing (L1174 atm), maybe I should move it to L118.. so we would be sure that there is nothing else to do and we would do actionComplete right after it..

@@ -1168,6 +1169,10 @@ func (r *BareMetalHostReconciler) actionPreparing(prov provisioner.Provisioner,
if err != nil {
return actionError{errors.Wrap(err, "could not save the host provisioning settings")}
}
if hfc != nil {
info.log.Info("saving hostfirmwarecomponents updates into status")
saveHostFirmwareComponents(hfc, info)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This updates the status when manual cleaning starts, but what if it fails?

I suspect we only want to update this after cleaning has succeeded, but we also have to think about what happens if updating the status fails.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right!, yeah I agree we want to only update if it succeeded.
You mean updating the status in case an error happened during cleaning or what?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hroyrh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if the call to Update() fails, do we get the chance to try again? Or will we restart the whole manual cleaning process or something weird like that?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't have a way to remember that the Update failed in the last Reconcile, right ? So, do we have to check if the config changes were already applied before, by fetching the current Ironic node and if yes - then, simply run Update again, rather than the whole manual cleaning process as you mentioned ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way, could be like @hroyrh mentioned I think...
We can check if the information ironic provides about Components would be different (if there was a Firmware Update something would be different in the ironic DB)
Does it makes sense?

@@ -223,9 +223,7 @@ func (r *HostFirmwareComponentsReconciler) updateHostFirmware(info *rhfcInfo, co

// Update Status if has changed
if dirty {
info.log.Info("Status for HostFirmwareComponents changed")
info.hfc.Status = *newStatus.DeepCopy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this you're no longer writing the components read from ironic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@iurygregory iurygregory force-pushed the hfc-handleAvailable branch 2 times, most recently from 94b0778 to 84f1818 Compare May 16, 2024 03:34
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 21, 2024
@iurygregory iurygregory force-pushed the hfc-handleAvailable branch 10 times, most recently from 57db30c to a903ba9 Compare June 28, 2024 01:57
newStatus.Components = make([]metal3api.FirmwareComponentStatus, len(components))
for i := range info.hfc.Status.Components {
components[i].DeepCopyInto(&newStatus.Components[i])
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to this approach to see if would help, but maybe it's missing to trigger a condition before doing this, checking if !reflect.DeepEqual(info.hfc.Status.Components, components) ?

@iurygregory iurygregory force-pushed the hfc-handleAvailable branch 4 times, most recently from cba0e20 to 6e930a0 Compare July 2, 2024 16:21
@iurygregory
Copy link
Author

I've update this PR to address some of the upstream comments from 1793 and 1821. I'm still struggling to figure out how to proper update the Components information with the newer information Ironic has about the firmware.

@iurygregory iurygregory force-pushed the hfc-handleAvailable branch 2 times, most recently from 89a6138 to 934b7a3 Compare July 25, 2024 18:13
@iurygregory
Copy link
Author

@dtantsur @zaneb
This approach works!
In Scenario 1 at least 😅 - when you add a BMH + HostFirmwareComponents CRD - the firmware is update in preparing, when reaches available the newer info for the component is not yet available in status.components, we need to scale-up the machine set to make the BMH go to provisioning and after some time during this phase the information will be updated.

I'm going to test Scenario 2 - where we have a provisioned BMH and we need to scale-down and scale-up.

- Fixed the comments provided in the upstream review
- investigation about how to update the newer firmware information
in Status after the update.
Copy link

openshift-ci bot commented Jul 26, 2024

@iurygregory: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-serial-ipv4 dcb99c2 link true /test e2e-metal-ipi-serial-ipv4

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants