Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graduate downwardMetrics to feature KV lifecycle #12650

Closed
wants to merge 4 commits into from

Conversation

jcanocan
Copy link
Contributor

@jcanocan jcanocan commented Aug 22, 2024

What this PR does

It deprecates the DownwardMetrics feature gate in favor of a configurable field: spec.downwardMetrics: {}. The configurable is disabled by default, and it brings the option to the user to enable it cluster-wide if desired. Moreover, it adds checks to block a VM starting if the feature is out of sync and runtime checks to inform the user when the feature is out of sync too.

Fixes # CNV-43919

Why we need it and why it was done in this way

The following tradeoffs were made:

The following alternatives were considered:

Links to places where the discussion took place:

Special notes for your reviewer

You can use the following steps to try it yourself:

  1. Create a cluster that allows VM live migration:
export KUBEVIRT_STORAGE=rook-ceph-default
export KUBEVIRT_MEMORY_SIZE=12000M
export KUBEVIRT_NUM_NODES=2
make cluster-up
  1. Create the following VM:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    special: vmi-fedora-1
  name: vmi-fedora-1
spec:
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/domain: fedora-1
        kubevirt.io/vm: fedora-1
      annotations:
        descheduler.alpha.kubernetes.io/evict: "true"
    spec:      
      domain:
        devices:
          downwardMetrics: {}
          disks:
          - disk:
              bus: virtio
            name: containerdisk
          - disk:
              bus: virtio
            name: cloudinitdisk
          interfaces:
            - name: default
              masquerade: {}
        machine:
          type: ""
        resources:
          requests:
            memory: 1024M
      terminationGracePeriodSeconds: 180
      networks:
        - name: default
          pod: {}
      volumes:
      - cloudInitNoCloud:
          userData: |-
            #cloud-config
            chpasswd:
              expire: false
            password: fedora
            user: fedora
        name: cloudinitdisk
      - name: containerdisk
        containerDisk:
          image: kubevirt/fedora-cloud-container-disk-demo:latest
  1. Start the VM. The VMI object will show an error stating that the feature has been not enabled and the VMI will remain in Pending state.
  2. Enable the feature. The VM will start eventually.
$ ./cluster-up/kubectl.sh patch kubevirt  kubevirt -n kubevirt --type  json -p '[{"op":"add", "path":"/spec/downwardMetrics", "value": {}}]'
  1. Disable the feature. The VM will continue running, and you will be able to fetch the downward metrics inside the VM as normal.
$ ./cluster-up/kubectl.sh patch kubevirt  kubevirt -n kubevirt --type  json -p '[{"op":"remove", "path":"/spec/configuration/downwardMetrics", "value": {}}]'
  1. Live migrate the VM. The VM will be able to live migrate.
$ ./cluster-up/virtctl.sh migrate vmi-fedora-1

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note

Deprecated `DownwardMetrics` feature gate and introduce `spec.downwardMetrics: {}` field to enable the feature (disabled by default).

@kubevirt-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/compute labels Aug 22, 2024
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alicefr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 4, 2024
@kubevirt-bot kubevirt-bot added sig/observability Denotes an issue or PR that relates to observability. size/XL and removed size/L labels Sep 20, 2024
@jcanocan jcanocan changed the title WIP: Adjust downward metrics to feature KV lfecycle Graduate downwardMetrics to feature KV lifecycle Sep 20, 2024
@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 20, 2024
@jcanocan jcanocan force-pushed the adjust-dwm-feature-lifecycle branch 2 times, most recently from 6faa6bd to f3cbf37 Compare September 20, 2024 10:29
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 20, 2024
@jcanocan jcanocan marked this pull request as ready for review September 20, 2024 10:30
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 20, 2024
@dosubot dosubot bot added area/api-server kind/deprecation Indicates the PR/issue deprecates a feature that will be removed in a subsequent release. labels Sep 20, 2024
@jcanocan jcanocan force-pushed the adjust-dwm-feature-lifecycle branch 2 times, most recently from f5868b7 to 065ce62 Compare September 25, 2024 14:40
@jcanocan
Copy link
Contributor Author

/retest-required

@jcanocan
Copy link
Contributor Author

Could @EdDev take a look at this, please?

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 28, 2024
@jcanocan jcanocan force-pushed the adjust-dwm-feature-lifecycle branch from 7824733 to 305209e Compare October 7, 2024 10:11
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2024
@jcanocan jcanocan force-pushed the adjust-dwm-feature-lifecycle branch 2 times, most recently from 7a699a9 to d7c0ff3 Compare October 8, 2024 07:14
@jcanocan
Copy link
Contributor Author

jcanocan commented Oct 8, 2024

v2: VM do not start, i.e., it does not create a VMI object, if an invalid configuration is detected.

pkg/controller/controller.go Outdated Show resolved Hide resolved
@@ -107,6 +107,10 @@ func IsS390X(arch string) bool {
return arch == "s390x"
}

func (c *ClusterConfig) IsDownwardMetricsEnabled() bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsDownwardMetricsEnabled and DownwardMetricsEnabled are very similar names that can easily be confused. Can you give it another name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to IsDownwardMetricsFeatureEnabled.

@@ -1351,6 +1352,13 @@ func (c *Controller) startVMI(vm *virtv1.VirtualMachine) (*virtv1.VirtualMachine
return vm, fmt.Errorf("failed create validation: %v", validateErr)
}

if downwardmetrics.IsDownwardMetricsConfigurationInvalid(c.clusterConfig, &vmi.Spec) {
err = fmt.Errorf("DownwardMetrics feature is not enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.New is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pkg/virt-controller/watch/vm/vm.go Outdated Show resolved Hide resolved
@@ -3255,6 +3263,12 @@ func (c *Controller) sync(vm *virtv1.VirtualMachine, vmi *virtv1.VirtualMachineI
vm = vmCopy
}

if vmi == nil {
if downwardmetrics.IsDownwardMetricsConfigurationInvalid(c.clusterConfig, &vm.Spec.Template.Spec) {
return vm, common.NewSyncError(fmt.Errorf("DownwardMetrics feature is not enabled"), controller.FeatureNotEnabled), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt.Errorf("DownwardMetrics feature is not enabled")

Make this a const or a custom error in the downwardmetrics pkg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've also replaced all occurrences of this string with the custom error variable.

})

AfterEach(func() {
tests.EnableDownwardMetrics(virtClient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be reset to original value, i.e. reset the KV object after? I think we have helpers for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit of reset the entire KV object? IMHO, if we are only changing one specific value, it makes sense to just reset that value back.

tests/migration/migration.go Outdated Show resolved Hide resolved
@@ -75,6 +75,11 @@ func AdjustKubeVirtResource() {
},
}}

// Add the DownwardMetrics configuration, it will avoid to make some test to run on serial
if kv.Spec.DownwardMetrics == nil {
kv.Spec.DownwardMetrics = &v1.DownwardMetricsConfiguration{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables the feature which should be off by default for everyone else? I don't think this is good practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't enable it by default, live migration downwardMetric tests will need to run on serial because we are updating the KubeVirt CR. In the past, in order to allow these test to run on parallel, we enabled the downwardMetrics FG by default in the tests. This is why I've included this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is marked as Serial already? I'd rather not enable something globally to avoid Serial just for a single test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the particular test I'm referring to: https://github.com/kubevirt/kubevirt/pull/12650/files#diff-5034f7c82b377dd518dd147554ad7df560b0cc667f1f0aa08e7a632d1436237aR471
It's just two tests, but if we disabled the feature globally, we need to make them Serial. I'm not against it, whatever we consider best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tests require it to be enabled but it is not by default, then they should be Serial.

kv := libkubevirt.GetCurrentKv(client)
kv.Spec.DownwardMetrics = &v1.DownwardMetricsConfiguration{}

updateKubevirtSpec(kv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use libkubevirt.UpdateKubeVirtConfigValueAndWait and drop updateKubevirtSpec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I was able to understand, UpdateKubeVirtConfigValueAndWait will just update the kv.Spec.Configuration, while the downwardMetrics configurable is under kv.Spec. Therefore, if I use UpdateKubeVirtConfigValueAndWait the change won't be reflected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, can libkubevirt be updated to do what we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are updating the kv.Spec, that's what updateKubevirtSpec is already doing. Sorry, maybe I'm missing something here. Could you please elaborate a bit more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, can libkubevirt be used instead? And the Disable/Enable helpers should live closer to where they are used. We should not grow tests/utils.go at all.

@@ -661,6 +662,9 @@ func (c *Controller) updateStatus(vmi *virtv1.VirtualMachineInstance, pod *k8sv1
c.syncVolumesUpdate(vmiCopy)
}

// Requires some action in the VM to trigger this check, is this ok?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question.

@xpivarc @fossedihelm Can you help us shed some light on this? Will the VMs/VMIs be resynced at some point?

@0xFelix
Copy link
Member

0xFelix commented Oct 8, 2024

Also, can you please highlight a bit more that this is a breaking change?

The feature moves from being enabled to being disabled by default.

@0xFelix
Copy link
Member

0xFelix commented Oct 8, 2024

Yet another piece of feedback: Could you please split deprecation of the FG and adding the new configurable into separate PRs?

It deprecates the `DownwardMetrics` feature gate, since this feature
already reached GA. This feature gate is not longer required to use the
downwardMetrics feature.

Signed-off-by: Javier Cano Cano <[email protected]>
Currently, this function only accepts VMI objects to determine if a
downwardMetrics volume is being used. It generalizes the function to
accept a list of Volumes instead.

Signed-off-by: Javier Cano Cano <[email protected]>
It add a new configurable field to enable/disable the downwardMetrics
feature cluster-wide.

Signed-off-by: Javier Cano Cano <[email protected]>
@jcanocan jcanocan force-pushed the adjust-dwm-feature-lifecycle branch from d7c0ff3 to 2da9406 Compare October 8, 2024 16:52
@jcanocan
Copy link
Contributor Author

jcanocan commented Oct 8, 2024

Also, can you please highlight a bit more that this is a breaking change?

The feature moves from being enabled to being disabled by default.

Currently, the feature is disabled by default. If the user does not enable the FG, the feature keeps disabled. However, it is true that users with the feature already enabled need some manual action or if they reboot a VM, it would fail to start.

@jcanocan
Copy link
Contributor Author

jcanocan commented Oct 8, 2024

Yet another piece of feedback: Could you please split deprecation of the FG and adding the new configurable into separate PRs?

I would like to keep it the same PR if possible. If we deprecate the FG and then add the new configurable in a separate PR, it would create a gap in which the FG will be enabled by default. I would like to avoid this scenario by all means. This is because having this feature enabled by default could be considered a security issue.

Enables the field: `spec.downwardMetrics: {}`. The configurable is
disabled by default and it brings the option to the user to enable it
cluster-wide if desired. Moreover, it adds checks to block a VM starting
if the feature is out of sync and runtime checks to inform the user when
the feature is out of sync too.

Signed-off-by: Javier Cano Cano <[email protected]>
@jcanocan jcanocan force-pushed the adjust-dwm-feature-lifecycle branch from 2da9406 to bceced8 Compare October 8, 2024 17:41
@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented Oct 8, 2024

@jcanocan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-fuzz bceced8 link false /test pull-kubevirt-fuzz
pull-kubevirt-e2e-k8s-1.31-sig-compute bceced8 link true /test pull-kubevirt-e2e-k8s-1.31-sig-compute
pull-kubevirt-e2e-k8s-1.31-sig-storage bceced8 link true /test pull-kubevirt-e2e-k8s-1.31-sig-storage
pull-kubevirt-check-tests-for-flakes bceced8 link false /test pull-kubevirt-check-tests-for-flakes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 9, 2024
@kubevirt-bot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@@ -1026,6 +1030,10 @@ func (c *Controller) sync(vmi *virtv1.VirtualMachineInstance, pod *k8sv1.Pod, da
if validateErr := errors.Join(validateErrors...); validateErrors != nil {
return common.NewSyncError(fmt.Errorf("failed create validation: %v", validateErr), "FailedCreateValidation")
}
if downwardmetrics.IsDownwardMetricsConfigurationInvalid(c.clusterConfig, &vmi.Spec) {
c.recorder.Eventf(vmi, k8sv1.EventTypeWarning, controller.FeatureNotEnabled, downwardmetrics.DownwardMetricsNotEnabledError.Error())
return common.NewSyncError(fmt.Errorf("virtual machine is requesting a disabled feature: %s", "DownwardMetrics"), controller.FeatureNotEnabled)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use `downwardmetrics.DownwardMetricsNotEnabledError here too?

The use of fmt.Error with static strings looks weird.

@@ -17,6 +19,8 @@ const (
DownwardMetricsChannelSocket = DownwardMetricsChannelDir + "/downwardmetrics.sock"
)

var DownwardMetricsNotEnabledError = errors.New("DownwardMetrics feature is not enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var DownwardMetricsNotEnabledError = errors.New("DownwardMetrics feature is not enabled")
var NotEnabledError = errors.New("DownwardMetrics feature is not enabled")

When used as downwardmetrics.NotEnabledError this avoids repetition.

@@ -580,6 +580,9 @@ const (

// Indicates whether the VMI is live migratable
VirtualMachineInstanceIsStorageLiveMigratable VirtualMachineInstanceConditionType = "StorageLiveMigratable"

// Indiates that the VMI has a configuration out of sync with the cluster-wide configuration
VirtualMachineInstanceConfigurationOutOfSync VirtualMachineInstanceConditionType = "FeatureConfigurationOutOfSync"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VirtualMachineInstanceConfigurationOutOfSync VirtualMachineInstanceConditionType = "FeatureConfigurationOutOfSync"
VirtualMachineInstanceConfigurationOutOfSync VirtualMachineInstanceConditionType = "ConfigurationOutOfSync"

BeforeEach(func() {
virtClient = kubevirt.Client()
tests.EnableDownwardMetrics(virtClient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Still it looks strange to me that we enable DownwardMetrics globally in the test setup and then again in the BeforeEach of these tests.

Comment on lines +234 to +237
Consistently(func() bool {
_, err := virtClient.VirtualMachineInstance(vm.Namespace).Get(context.Background(), vm.Name, metav1.GetOptions{})
return errors.IsNotFound(err)
}, 60*time.Second, 5*time.Second).Should(BeTrue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert the condition on the VM instead and try once to get the VMI expecting an IsNotFound error as not to prolong the test unnecessarily?

var vmi *v1.VirtualMachineInstance

BeforeEach(func() {
tests.EnableDownwardMetrics(virtClient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, it looks strange to me that we call this in BeforeEach/AfterEach and in the global test setup.

@@ -75,6 +75,11 @@ func AdjustKubeVirtResource() {
},
}}

// Add the DownwardMetrics configuration, it will avoid to make some test to run on serial
if kv.Spec.DownwardMetrics == nil {
kv.Spec.DownwardMetrics = &v1.DownwardMetricsConfiguration{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is marked as Serial already? I'd rather not enable something globally to avoid Serial just for a single test.

kv := libkubevirt.GetCurrentKv(client)
kv.Spec.DownwardMetrics = &v1.DownwardMetricsConfiguration{}

updateKubevirtSpec(kv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, can libkubevirt be updated to do what we want?

@jcanocan
Copy link
Contributor Author

jcanocan commented Oct 9, 2024

Since the potential concerns about the breaking change that this implies. Let's split the graduation in multiple releases/PRs.

@jcanocan
Copy link
Contributor Author

/close

@kubevirt-bot
Copy link
Contributor

@jcanocan: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api-server dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/deprecation Indicates the PR/issue deprecates a feature that will be removed in a subsequent release. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api Denotes an issue or PR that relates to changes in api. sig/compute sig/observability Denotes an issue or PR that relates to observability. sig/virtualization size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants