design-proposal: Feature configurables #316

jcanocan · 2024-08-14T15:21:41Z

What this PR does / why we need it:
This design document states how features that require to have a mechanism to change it's state, e.g., enabled/disabled, should be implemented in KubeVirt.

Special notes for your reviewer:

This is current list of GA'd features present in KubeVirt/KubeVirt which are still using feature gates and are shown as
configurables in HCO:

DownwardMetrics
Root (not sure about this one)
DisableMDEVConfiguration
PersistentReservation
AutoResourceLimitsGate
AlignCPUs

This is the current list of GA'd features present in KubeVirt/KubeVirt which are still using feature gates and are always
enabled by HCO:

CPUManager
Snapshot
HotplugVolumes
GPU
HostDevices
NUMA
VMExport
DisableCustomSELinuxPolicy
KubevirtSeccompProfile
HotplugNICs
VMPersistentState
NetworkBindingPlugins
VMLiveUpdateFeatures

Please note that only feature gates included in KubeVirt/KubeVirt are listed here.

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Design: A design document was considered and is present (link) or not required
PR: The PR description is expressive enough and will help future contributors
Code: Write code that humans can understand and Keep it simple
Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
Upgrade: Impact of this change on upgrade flows was considered and addressed if required
Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
Community: Announcement to kubevirt-dev was considered

Release note:

Added proposal to introduce how configurable features should be implemented.

jcanocan · 2024-08-14T16:51:11Z

/cc @0xFelix @lyarwood

design-proposals/configurable-features.md

dankenigsberg · 2024-08-15T07:13:46Z

design-proposals/configurable-features.md

+# Design
+If a developer wants to make a feature configurable, he needs to do so by adding new fields to the KubeVirt CR under `spec.configuration`.
+
+> **NOTE:** The inclusion of these new KubeVirt API fields should be carefully considered and justified. The feature configurables should be avoided as much as possible.


Correct. There should be very good reasons to complicate our API with new fields.

I think that this proposal would benefit if it include concrete examples of features that really require it. In fact, I'd love to see a comprehensive list of all GAed features that need a cluster-wide configuration.

I tried my best to add a list.
I've created two separate list of feature gates:

Those that can be tuned in HCO and in the downstream documentation we ask the user to switch them to get a specific feature. These should have a configurable.

Those that are always enabled in HCO by default, which IMHO we should either, remove the feature gate entirely or create a configurable.
Could you please confirm if the features listed here are GA'd?

Could you please confirm if the features listed here are GA'd?

I cannot, but we must know the answer, even if it means reaching out to every developer that introduced each feature gate.

Perfect. I agree, maybe we can create an issue for each feature gate and ping the last 2 o 3 contributors of them.

Makes sense. I think that contacting them should be part of this proposal, as they are the stakeholder who would need to do the work prescribed here.

Could you please confirm if the features listed here are GA'd?

FYI @stu-gott @acardace, perhaps we can collaborate to list the different FGs and their graduation state.

design-proposals/configurable-features.md

dankenigsberg · 2024-08-19T16:36:59Z

design-proposals/configurable-features.md

+- AlignCPUs
+
+This is the current list of GA'd features present in KubeVirt/KubeVirt which are still using feature gates and are always
+enabled by [HCO](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/controllers/operands/kubevirt.go#L125-L142):


Does this mean that you recommend that the feature flag is dropped and the feature is enabled on all clusters?

I've assumed that those features are GA since in downstream they are enabled in a hardcoded way, i.e., the user can't disable them by any means. Therefore, IMHO, they should drop the feature flag.

Those are Kubevirt FeatureGates that are always set by HCO as part of its opinionated deployment.
Those are not exposed to the cluster admin by HCO and they cannot be enabled/disabled.

Yes. Correct. Thanks for the double check.

dankenigsberg · 2024-08-19T16:43:42Z

design-proposals/configurable-features.md

+  Pending state.
+- Feature status checks should only be performed during the scheduling process, not at runtime. Therefore, the feature
+  status changes in the KubeVirt CR should not affect running VMis. Moreover, the VMi should still be able to live
+  migrate, preserving its original feature state.


This is a very difficult requirement. If a feature is disabled, I don't expect things like new nodes to support it (e.g they would not have needed daemonset running on them) and I would not expect the VMI to migrate to them.

If a feature is disabled, VMs that use it may die. I don't think that we should actively kill them, but we should not promise anything about them. I think that we should just alert that VM is using a feature that is no longer supported by the cluster.

If a feature is disabled, I don't expect things like new nodes to support it (e.g they would not have needed daemonset running on them) and I would not expect the VMI to migrate to them.

Ack.

I don't think that we should actively kill them,

Agree.

but we should not promise anything about them

If a feature is disabled, VMs that use it may die.

I do not agree. Which reason do you see that we should not promise the same as we promised at the time the VM was started?

I think that we should just alert that VM is using a feature that is no longer supported by the cluster.

Let's check if this is implementable using a reasonable amount of effort at runtime.

but we should not promise anything about them
If a feature is disabled, VMs that use it may die.

I do not agree. Which reason do you see that we should not promise the same as we promised at the time the VM was started?

The moment that I, the cluster-admin, decided to turn off the feature, I already broke my "promise". For example, a daily VM-bound job would fail to start. Or a VM with runStrategy=Always would fail to start if the node crashes. I see no reason to add (and no way to promise) a special case of the case where a node is turned off and VMs are migrated away. When I disable a feature (say, because I realize that downwardMetrics exposes too much information to the guest), I want VMs to stop using it.

Another way to think of this is with eventual consistency in mind. Assume that a VM started just as the admin disabled the feature. We should not promise eternal life to the VM if it sneaked in a microsecond earlier. If the feature is disabled, then eventually, a VM using it should not be running.

I think that we should just alert that VM is using a feature that is no longer supported by the cluster.

Let's check if this is implementable using a reasonable amount of effort at runtime.

Sure. I hope it would not be hard to add a condition reporting that the VM uses a feature in state A but the cluster has it in state B.

dankenigsberg · 2024-08-19T16:46:38Z

design-proposals/configurable-features.md

+  status changes in the KubeVirt CR should not affect running VMis. Moreover, the VMi should still be able to live
+  migrate, preserving its original feature state.
+- Optionally, It could enable the possibility to reject the KubeVirt CR change request if running VMis are using the
+  feature in a given state. However, by the default the request should be accepted.


This option cannot be implemented safely, as it is raceful. Besides, it gives a single VM of a single user the ability to block the cluster-admin from making a change to their cluster.

I suggested this use case and agree that as written it can easily race if a creation request from a user comes in while the admin is attempting to disable a given feature.

That said I think that at a high level the use case is still valid and that some cluster admins aren't going to want to disable a feature cluster wide that's being used by running VMs.

That said I think that at a high level the use case is still valid and that some cluster admins aren't going to want to disable a feature cluster wide that's being used by running VMs.

Right. I think that cluster admins would want to know if the functionality that they are disabling is currently being used. They may also want to know by whom (as not all VMs are created equal). But a mere VM owner should not be in the position to block the cluster admin from expressing an intent that a feature is to be reconfigured.

I'm just a bit concerned about the time and resources needed to fetch all VMI using the feature.

That said I think that at a high level the use case is still valid and that some cluster admins aren't going to want to disable a feature cluster wide that's being used by running VMs.

Right. I think that cluster admins would want to know if the functionality that they are disabling is currently being used. They may also want to know by whom (as not all VMs are created equal). But a mere VM owner should not be in the position to block the cluster admin from expressing an intent that a feature is to be reconfigured.

+1
I'm not sure we can make a simplistic rule here, it sounds possible that in some cases it's useful to change configuration for running VMs and in others it isn't. Perhaps it's useful to start off with more concrete examples and have a discussion regarding them.

Correct. And in "warning" I means "raise and alert or event or condition" on any affected VMs.

Ah I see, so should be the VMs itself who triggers the alert, event, warning, etc. Did I understand correctly?

Good. But we should not have a non-default attempt to block modification of the spec. It cannot be done safely (it is raceful), so we should not try.

Works for me.

should be the VMs itself who triggers the alert, event, warning, etc.

As a VM owner, I see value in knowing that my VM is using a feature that is being removed from the cluster. So yes, a VM condition would be useful.

All right. Would that also work for the cluster admin changing configurations? I.e., as a cluster admin, I'm changing configurables and want to know if there's any affected VM? or should be the cluster admin careful enough to run a command the one you have posted before kubectl get vm -A -o yaml|grep someRequestedFeature to know it in advance?

should be the cluster admin careful enough to run a ... kubectl get vm -A -o yaml|grep someRequestedFeature to know it in advance?

I don't see any other way. Note that this is also raceful. Someone may create a VM with someRequestedFeature a microsecond later. That's the nature of our platform - the cluster admin cannot "know" at any point. They can only declare what is desired, and eventually the cluster would get there.

All right. Thanks for the input. I will drop the sentence or change it by something like "the kubevirt CR can't be rejected, the cluster admin is responsible to know in advance if the feature configurable change will affect running VMIs".

design-proposals/configurable-features.md

lyarwood · 2024-08-20T16:41:38Z

design-proposals/configurable-features.md

+  status changes in the KubeVirt CR should not affect running VMis. Moreover, the VMi should still be able to live
+  migrate, preserving its original feature state.
+- Optionally, It could enable the possibility to reject the KubeVirt CR change request if running VMis are using the
+  feature in a given state. However, by the default the request should be accepted.


I suggested this use case and agree that as written it can easily race if a creation request from a user comes in while the admin is attempting to disable a given feature.

That said I think that at a high level the use case is still valid and that some cluster admins aren't going to want to disable a feature cluster wide that's being used by running VMs.

design-proposals/configurable-features.md

lyarwood · 2024-08-20T16:43:46Z

design-proposals/configurable-features.md

+fields, this change is acceptable, but it should be marked as a breaking change and documented. Moreover, all feature
+gates should be evaluated to determine if they need to be dropped and transitioned to configurables.
+
+## About implementing the checking logic in the VM controller


VMI controller?

My intention here is to reflect that we can implement some logic in the VM controller itself to get an early feedback, i.e., before starting the VM, about the feature status in the VM conditions.

May be worth (if possible, not entirely sure yet if is possible at all independent of the location, to still consolidate runtime checks on the VMI lievel and propagate from the vmi status to vm status.

PoC of design proposal [feature configurables](kubevirt/community#316) using the downward metrics feature as an example. It deprecates the `DownwardMetrics` feature gate in favor of a configurable spec `spec.configuration.downwardMetris: {}`. Signed-off-by: Javier Cano Cano <[email protected]>

design-proposals/configurable-features.md

dankenigsberg · 2024-08-20T15:10:11Z

design-proposals/configurable-features.md

+spec:
+  certificateRotateStrategy: {}
+  configuration:


How would this look for the downward metrics? How would I enable or disable it?

It would be enabled with:

apiVersion: kubevirt.io/v1 kind: KubeVirt [...] spec: certificateRotateStrategy: {} configuration: downwardMetrics: {} [...]

And disabled by removing spec.configuration.downwardMetrics: {}.

You can find a working PoC in: kubevirt/kubevirt#12650

What is the benefit of having the configuration subelement over

apiVersion: kubevirt.io/v1 kind: KubeVirt [...] spec: certificateRotateStrategy: {} downwardMetrics: {} [...]

?

And how about an even simpler downwardMetrics: true or downwardMetrics: false with false as the default?

Using an empty struct with the meaning of true and its absence with the implicit meaning of false sounds a bit odd to me.
I think that in general we should aim to have such kind of configuration values always explicitly represented in our APIs, rather than asserting that "unspecified fields get the default behavior".
Something like featureXxxx: true/false will gave us the ability to always explicitly show the expected configuration still letting us choose if we want to default to true or false via the default property in the openAPIv3 schema, see: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#defaulting

Just to throw in other possiblities which may be worth considering, values which transport potentially more meaning are possible as well instead of true/false:

spec: featureGates: DownwardMetrics: Enabled FeatureB: Enabled FeatureC: Disabled

The empty struct here means "use an emptyDir volume source"

Sorry if my question was unclear. I'm asking if in the history of emptyDir it was ever defined as an empty dictionary with no field at all?

Hey @dankenigsberg!
Sorry for not being clear.

Yes indeed. As can be seen here, emptyDir used to be an empty struct. Only later in time the Medium field was added, then later the Size field was added as well.

Looks like there is an agreement on no placing new configurables under spec.configuration but in spec. Adjusted the document to reflect this.

In theory I support that.
But as said above, we already have fields living under configuration and we need to mind backward compatibility.

Here are some options that come into mind:

Duplicate configuration fields into .spec, deprecate the ones under .spec.configuration, and basically duplicate them until we advance Kubevirt CR to v2.

Document that in the future we want to move these fields, but keep them as-is for now.

IMHO, you might consider splitting these two efforts to two distinct PRs: this PR to determine the feature toggles policy, and a different PR to plan deprecating .spec.configuration.

I agree. My point is: new feature configurable specs like downwardMetrics, should be placed in spec in the Kubevirt CR, e.g.,

spec: downwardMetrics: {}

Regarding existing specs under .spec.configuration, do you think we should reflect this somehow in this design document?

@jcanocan can you follow up on empty-map-means-true discussion?

Sure. IMHO, arguments provided by @iholder101 and @0xFelix sounds good enough.

new feature configurable specs like downwardMetrics, should be placed in spec in the Kubevirt CR, e.g.,

Regarding existing specs under .spec.configuration, do you think we should reflect this somehow in this design document?

IMO it would be easier to discuss this in a different proposal

All right. Thanks for the clarification.

dankenigsberg · 2024-08-21T15:25:33Z

design-proposals/configurable-features.md

+  Pending state.
+- Feature status checks should only be performed during the scheduling process, not at runtime. Therefore, the feature
+  status changes in the KubeVirt CR should not affect running VMis. Moreover, the VMi should still be able to live
+  migrate, preserving its original feature state.


but we should not promise anything about them
If a feature is disabled, VMs that use it may die.

I do not agree. Which reason do you see that we should not promise the same as we promised at the time the VM was started?

The moment that I, the cluster-admin, decided to turn off the feature, I already broke my "promise". For example, a daily VM-bound job would fail to start. Or a VM with runStrategy=Always would fail to start if the node crashes. I see no reason to add (and no way to promise) a special case of the case where a node is turned off and VMs are migrated away. When I disable a feature (say, because I realize that downwardMetrics exposes too much information to the guest), I want VMs to stop using it.

Another way to think of this is with eventual consistency in mind. Assume that a VM started just as the admin disabled the feature. We should not promise eternal life to the VM if it sneaked in a microsecond earlier. If the feature is disabled, then eventually, a VM using it should not be running.

I think that we should just alert that VM is using a feature that is no longer supported by the cluster.

Let's check if this is implementable using a reasonable amount of effort at runtime.

Sure. I hope it would not be hard to add a condition reporting that the VM uses a feature in state A but the cluster has it in state B.

rmohr

Great proposal!

rmohr · 2024-08-22T16:14:49Z

design-proposals/configurable-features.md

+requiring a feature in a state different from what was configured in the KubeVirt CR, or what should happen if the
+configuration of a feature in use is changed. (see matrix below).
+
+## Goals


I think it would be great if the goals could be extended to "I want to have a clear understanding which features are enabled and disabled".

One take which we did lately in another project, was expressing feature gates like this:

Spec:

spec: featureGates: DownwardMetrics: Enabled FeatureB: Enabled FeatureC: Disabled

Status:

status: featureGates: Downwardmetrics: Enabled # explicitly enabled FeatureB: Enabled # explicitly enabled FeatureD: Enabled # enabled by default FeatureC: Disabled # explicitly disabled FeatureD: Disabled # disabled by default

The status section makes it eventually very clear and discoverable what's effectively enabled.

The example here is specficially about feature gates and not about toggling features after they GAed, but I think a simliar approach may make sense to make it visible what's effectively enabled at the end. Especially if we add a second layer of configurables independent of feature gates, it may be even more valuable, since it get's harder to understand what's truly enabled.

I think it would be great if the goals could be extended to "I want to have a clear understanding which features are enabled and disabled".

I agree.

The status section makes it eventually very clear and discoverable what's effectively enabled.

I'm not sure. In the case, a user tries to enable a feature which still is under a feature gate, and this feature gate is disabled, the system should not allow you to enable the feature using the configurable.

@rmohr just wanted to point out that feature gates are usually booleans, i.e. either enabled or disabled, while feature configuration might be more complex than that including more tunables that aren't necessarily booleans.

That being said, I like the approach of having the status outline what's effectively configured.

Regarding the status: I've included this in the goals and an example of how it might look like.

rmohr · 2024-08-22T16:16:24Z

design-proposals/configurable-features.md

+spec:
+  certificateRotateStrategy: {}
+  configuration:


Just to throw in other possiblities which may be worth considering, values which transport potentially more meaning are possible as well instead of true/false:

spec: featureGates: DownwardMetrics: Enabled FeatureB: Enabled FeatureC: Disabled

rmohr · 2024-08-22T16:21:50Z

design-proposals/configurable-features.md

+- If the feature is set to state A in the KubeVirt CR and the VMI is requesting the feature in state B, the VMIs must
+  stay in Pending state. The VMI status should be updated, showing a status message, highlighting the reason(s) for the
+  Pending state.
+- Feature status checks should only be performed during the VMI reconciliation process, not at runtime. Therefore, the


We should probably go a little bit more into detail here. There can be cases where feature enablement has no API visibility on the VMI and may only happen at virt-handler side, or where virt-operator has to redeploy components in different configurations. If you want to do this in the virt-controller reconciliation stage, some things may need additional hints in the vmi status to kind of snapshot the current configuration. Just to ensure we think about such cases before an agreement is found.

rmohr · 2024-08-22T16:23:18Z

design-proposals/configurable-features.md

+fields, this change is acceptable, but it should be marked as a breaking change and documented. Moreover, all feature
+gates should be evaluated to determine if they need to be dropped and transitioned to configurables.
+
+## About implementing the checking logic in the VM controller


May be worth (if possible, not entirely sure yet if is possible at all independent of the location, to still consolidate runtime checks on the VMI lievel and propagate from the vmi status to vm status.

rmohr · 2024-08-22T16:25:25Z

design-proposals/configurable-features.md

+policy, features reaching General Availability (GA) need to drop their use of feature gates. This applies also to
+configurable features that we may want to disable.
+
+## Motivation


Is the idea of the proposal to outline a general way of handling kubevirt configurables, or is it specifically for VMI features?

This proposal targets VM/VMI features respectively every configurable that needs to be taken into account when handling VMs/VMIs.

iholder101

Thank you for this proposal, this is a very important topic. Happy to see this discussion!

iholder101 · 2024-08-27T12:55:26Z

design-proposals/configurable-features.md

+requiring a feature in a state different from what was configured in the KubeVirt CR, or what should happen if the
+configuration of a feature in use is changed. (see matrix below).
+
+## Goals


@rmohr just wanted to point out that feature gates are usually booleans, i.e. either enabled or disabled, while feature configuration might be more complex than that including more tunables that aren't necessarily booleans.

That being said, I like the approach of having the status outline what's effectively configured.

design-proposals/configurable-features.md

iholder101 · 2024-08-27T13:18:07Z

design-proposals/configurable-features.md

+This is current list of GA'd features present in KubeVirt/KubeVirt which are still using feature gates and are shown as
+configurables in [HCO](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/controllers/operands/kubevirt.go#L166-L174):


While these lists are valuable for discussion's sake, I'm not sure they belong to the proposal itself. WDYT about moving these into the PR description?

I agree. Done.

I find the concrete list very helpful to understand what this proposal is about. I would prefer to understand what we plan for each of them.

Okay. I've reintroduced the FG list.

Thanks. I'd like to understand how the proposal is going to affect many/all of them.

iholder101 · 2024-08-27T13:21:17Z

design-proposals/configurable-features.md

+# Design
+
+In order to make a feature configurable, it must be done by adding new fields to the KubeVirt CR under
+`spec.configuration`.
+
+
+> **NOTE:** The inclusion of these new KubeVirt API fields should be carefully considered and justified. The feature
+> configurables should be avoided as much as possible.


Apart for the list of feature gates this is the only thing written under the "Design" section, although many questions are left unanswered like the questions you've listed under "Goals".

Here I'd expect the document to generally outline the approach we're going to commit to.

I've extended the section to include more details and to make the Goal Section a bit easier to follow. Hope it is better now.

iholder101 · 2024-08-27T13:33:18Z

design-proposals/configurable-features.md

+spec:
+  certificateRotateStrategy: {}
+  configuration:


I'm with @0xFelix on this one.
Using booleans as APIs is, in general, a smell since it's unextendible, and this is especially true for designing APIs for features configuration.

Would you suggest an example that would be awkward? If one day we find out that we need to control the "color" of downwardMetrics I would consider this quite readable:

apiVersion: kubevirt.io/v1 kind: KubeVirt [...] spec: certificateRotateStrategy: {} downwardMetrics: true downwardMetricsColor: purple [...]

@dankenigsberg To me this is much less readable, but that's pretty fine when you have only a single unstructured configuration. Please consider the following example (which is, obviously, entirely made up):

kind: KubeVirt [...] spec: downwardMetrics: hostInformation: - MACAdress - IP - ListOfUsers style: color: White textSize: 17 [...]

I think you can agree that it looks much better than (the order is messed up on purpose. bear in mind that the order of yaml fields is usually sorted in an alphabetical order):

kind: KubeVirt [...] spec: downwardMetrics: true downwardMetricsColor: purple downwardMetricsIncludeHostIP: true downwardMetricsTextSize: 17 downwardMetricsIncludeHostMacAddress: true downwardMetricsIncludeHostListOfUsers: true [...]

My point is: this is unscalable and will fastly turn into a complete mess. We should take advantage of the fact that YAML/JSON already supports structured data.

In addition, it's common in k8s to omitempty and leave only whatever is relevant / enabled, so I'm not sure how it's odd.

iholder101 · 2024-08-27T13:38:45Z

design-proposals/configurable-features.md

+- If the feature is set to state A in the KubeVirt CR and the VMI is requesting the feature in state B, the VMIs must
+  stay in Pending state. The VMI status should be updated, showing a status message, highlighting the reason(s) for the
+  Pending state.
+- Feature status checks should only be performed during the VMI reconciliation process, not at runtime. Therefore, the


iholder101 · 2024-08-27T13:43:35Z

design-proposals/configurable-features.md

+- Get a clear understanding about the feature status.
+- Establish how the features status swapping should work.
+- Describe how the system should react in these scenarios:
+    - A feature in KubeVirt is set to state A and a VMI requests the feature to be in state B.


You assume that all of the features are granular to the VM/VMI level, but I don't think it's necessarily true. Things like UseEmulation, MinimumClusterTSCFrequency, ImageRegistry and so on are cluster-wide by nature that don't even have APIs at the VM/VMI level.

I've extended the goal definition to just aim those features that exposes an VM/VMI API.

Why? I think that we should eliminate all feature gates.

Why? I think that we should eliminate all feature gates.

Sorry, I'm not sure what do you mean. Could you please elaborate?

Sorry, I'm not sure what do you mean. Could you please elaborate?

We have many feature gates that do not comply with https://github.com/kubevirt/community/blob/main/design-proposals/feature-lifecycle.md . Some of them expose a VM/VMI API, some does not. But imo that's not the main issue. All feature gates must comply. They all must either graduate or be reverted, and those that require a configurable in kubevirt cr must define it - not only those with VM/VMI APIs.

I see. I've included the goal "-Graduate features status swapping from features gates to configurables." with that intention. Maybe it's not clear enough. WDYT of changing it to "Drop feature gates that do not require a configurable and graduate feature gates to feature configurables that requires them."?

I would avoid the term "status swapping", we are talking here about configuring or specifying how a feature behave, it is not about the current status or about swapping two things. Howe about
Graduate features by dropping their gates and (optionally) adding spec options for them

Sounds good!
All right, I will replace any occurrence of "status swapping".
Many thanks for the suggestion. Much appreciated.

iholder101 · 2024-08-27T13:45:02Z

design-proposals/configurable-features.md

+  stay in Pending state. The VMI status should be updated, showing a status message, highlighting the reason(s) for the
+  Pending state.
+- Feature status checks should only be performed during the VMI reconciliation process, not at runtime. Therefore, the
+  feature status changes in the KubeVirt CR should not affect running VMIs. Moreover, the VMI should still be able to


What about things like EvictionStrategy?
Are we sure that we don't want to ever impact running VMs? To me it sounds valuable from an admin's perspective to change the eviction strategy before an upgrade for example.

Maybe we can allow impacting running VMs if the change does not require rebooting the VM. WDYT?

I don't like to give this promise.

Rebooting VMs is not fun, we should try to avoid it. But if as a cluster-admin, I no longer want downwardMetrics, I would like to be able to express this - even if eventually this means that all VMs must be restarted.

Typically, any change to kubevirt cr is going to affect how currently-running or future VMs are going to behave. We can warn the admin, but not block them.

Should we allow live migration in the first place if, for instance, the feature is disabled and the VMI is requesting it?

Should we allow live migration in the first place if, for instance, the feature is disabled and the VMI is requesting it?

I would expect migration to fail if the destination host does not support the requested feature. Otherwise, we should not block migration (e.g to a host that still has this feature for some reason)

Understood. I've updated the text, trying to better reflect what we have discussed here.

iholder101 · 2024-08-27T13:49:45Z

design-proposals/configurable-features.md

+  status changes in the KubeVirt CR should not affect running VMis. Moreover, the VMi should still be able to live
+  migrate, preserving its original feature state.
+- Optionally, It could enable the possibility to reject the KubeVirt CR change request if running VMis are using the
+  feature in a given state. However, by the default the request should be accepted.


That said I think that at a high level the use case is still valid and that some cluster admins aren't going to want to disable a feature cluster wide that's being used by running VMs.

Right. I think that cluster admins would want to know if the functionality that they are disabling is currently being used. They may also want to know by whom (as not all VMs are created equal). But a mere VM owner should not be in the position to block the cluster admin from expressing an intent that a feature is to be reconfigured.

+1
I'm not sure we can make a simplistic rule here, it sounds possible that in some cases it's useful to change configuration for running VMs and in others it isn't. Perhaps it's useful to start off with more concrete examples and have a discussion regarding them.

design-proposals/configurable-features.md

dankenigsberg · 2024-09-01T09:08:56Z

design-proposals/configurable-features.md

+- Get a clear understanding about the features status.
+- Establish how the features status swapping should work.


I find the terminology here confusing. Is "status" the request of the cluster admin that can be "swapped"? Typically it is the actual condition reported by the cluster. Maybe you can explain the goal here in terms of a user story? What does the cluster admin want to do/understand?

I've adjusted the goal. The idea is to reflect how the features can be tuned, e.g., enabled or disabled.

I still find the term "feature configuration status" confusing. "config" or "spec" is something that a user controls. "status" is how the system is known to be working. I don't know what putting them together means.

Maybe "state" would be more accurate?

Updated. Just kept "state" in those situations where I believe is clear what are we referring to. Please, take a look and if it is still confusing, I will change it.

dankenigsberg · 2024-09-01T09:13:08Z

design-proposals/configurable-features.md

+  status changes in the KubeVirt CR should not affect running VMis. Moreover, the VMi should still be able to live
+  migrate, preserving its original feature state.
+- Optionally, It could enable the possibility to reject the KubeVirt CR change request if running VMis are using the
+  feature in a given state. However, by the default the request should be accepted.


I'm just a bit concerned about the time and resources needed to fetch all VMI using the feature.

Would you elaborate, possibly with a specific example? Would it be much harder than kubectl get vm -A -o yaml|grep someRequestedFeature?

+1 for considering several concrete examples.

dankenigsberg · 2024-09-01T09:28:56Z

design-proposals/configurable-features.md

+- Get a clear understanding about the feature status.
+- Establish how the features status swapping should work.
+- Describe how the system should react in these scenarios:
+    - A feature in KubeVirt is set to state A and a VMI requests the feature to be in state B.


Why? I think that we should eliminate all feature gates.

dankenigsberg · 2024-09-03T07:24:27Z

design-proposals/configurable-features.md

+apiVersion: kubevirt.io/v1
+kind: KubeVirt
+[...]
+status:


this is too abstract for me. How would this look like for existing features?

For instance, let's suppose that we have downwardMetrics and maxAllowedCPUsPerVM feature. The maxAllowedCPUsPerVM controls the maximum amount of CPUs that a given VM can get. We enable the downwardMetrics and we want to restrict that any given VM just can get 2 CPUs. The Kubevirt CR will looks like:

apiVersion: kubevirt.io/v1 kind: KubeVirt [...] spec: downwardMetrics: {} # this means "enabled" maxCPUsPerVM: 2 [...] status: featureStatus: downwardMetrics: status: Enabled maxCPUsPerVM: status: 2 CPUs

WDYT?

I don't see the value of featureStatus here. Would it ever be different from the relevant spec elements?

No, it shouldn't differ. This idea was proposed here: #316 (comment)

In that context it made sense, as the default of a feature was not clear. If we use the empty dictionary to mean enabled (which I find very hard to consume), the default is always disabled.

All right. I will drop it.

dankenigsberg · 2024-09-03T07:25:56Z

design-proposals/configurable-features.md

+spec:
+  certificateRotateStrategy: {}
+  configuration:


@jcanocan can you follow up on empty-map-means-true discussion?

design-proposals/configurable-features.md

dankenigsberg · 2024-09-03T07:32:36Z

design-proposals/configurable-features.md

+- Get a clear understanding about the feature status.
+- Establish how the features status swapping should work.
+- Describe how the system should react in these scenarios:
+    - A feature in KubeVirt is set to state A and a VMI requests the feature to be in state B.


Sorry, I'm not sure what do you mean. Could you please elaborate?

We have many feature gates that do not comply with https://github.com/kubevirt/community/blob/main/design-proposals/feature-lifecycle.md . Some of them expose a VM/VMI API, some does not. But imo that's not the main issue. All feature gates must comply. They all must either graduate or be reverted, and those that require a configurable in kubevirt cr must define it - not only those with VM/VMI APIs.

sradco · 2024-09-11T10:46:46Z

Hi,

I would address this with a metric, not sure an alert is needed.
We try to keep the alerts for when they are truely needed and actionable.

We can show metrics like this in the dashboards(I'm in the process of building them for ACM and they will also be available in-cluster)

and I think we can also add this information to the VM page and show it as a warning.

We can add to kubevirt_vmi_info metric or to a new kubevirt_vmi_labels metric, the features as labels and value enabled/disables.

Wdyt?

jcanocan · 2024-09-11T14:37:02Z

Hi,

I would address this with a metric, not sure an alert is needed. We try to keep the alerts for when they are truely needed and actionable.

We can show metrics like this in the dashboards(I'm in the process of building them for ACM and they will also be available in-cluster)

and I think we can also add this information to the VM page and show it as a warning.

We can add to kubevirt_vmi_info metric or to a new kubevirt_vmi_labels metric, the features as labels and value enabled/disables.

Wdyt?

Yes, it sounds amazing!
Sorry, I'm not sure which option would be more convenient. Could you please elaborate the ups and downs of each option?
Thanks for the input. Much appreciated

sradco · 2024-09-11T20:53:19Z

On Wed, Sep 11, 2024, 5:37 PM Javier Cano Cano ***@***.***> wrote: Hi, I would address this with a metric, not sure an alert is needed. We try to keep the alerts for when they are truely needed and actionable. We can show metrics like this in the dashboards(I'm in the process of building them for ACM and they will also be available in-cluster) and I think we can also add this information to the VM page and show it as a warning. We can add to kubevirt_vmi_info metric or to a new kubevirt_vmi_labels metric, the features as labels and value enabled/disables. Wdyt? Yes, it sounds amazing! Sorry, I'm not sure which option would be more convenient. Could you please elaborate the ups and downs of each option? Thanks for the input. Much appreciated

I would go with the second option, since the kubevirt_vmi_info hold basic information about the vmi and adding these labels would be too much. Also, it should serve a specific target, so I would even call it something like kubevirt_vmi_features_status_info to make it clear what its about. It should be easy to query with this approach all the vmis that have a specific feature enable/disable. Best, Shirly

…

— Reply to this email directly, view it on GitHub <#316 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEQGMDEKIUZOC5EPFXAHWYDZWBISNAVCNFSM6AAAAABMQSQ2M6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTHA3DMMZXHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0xFelix · 2024-09-12T06:58:40Z

I think the problem we have with this approach is that we don't have a list of active/disabled features on a VM? In case of features which are configured in the VM's spec it is easier to detect. But what about features that do not expose configurables on the VM level?

jcanocan · 2024-09-13T09:57:36Z

I think the problem we have with this approach is that we don't have a list of active/disabled features on a VM? In case of features which are configured in the VM's spec it is easier to detect. But what about features that do not expose configurables on the VM level?

True. Maybe we should limit the metric running VMIs with a "feature issues", i.e., features with a configuration in an inconsistent state with the cluster-wide configuration. WDYT @0xFelix @sradco?

0xFelix · 2024-09-13T10:04:22Z

Still, how do you know configurations are inconsistent? It comes back to the initial issue of not knowing which feature is used by which VM.

kubevirt-bot · 2024-09-27T10:10:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vladikr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jcanocan · 2024-09-27T10:12:51Z

In order to adjust the scope of this design proposal. The alerting concern has been dropped, as well as the detection to configurables checks that do not expose VMI API fields. Those concerns will addressed in a follow-up design proposal.

design-proposals/configurable-features.md

dankenigsberg · 2024-09-29T08:27:52Z

design-proposals/configurable-features.md

+
+The checking in the VM controller could be added to let the user know if a VM has requested a feature configuration which 
+is different from what it is specified in the KubeVirt CR. This will provide an early feedback to the user before starting
+the VM. However, it should not prevent the user to start the VM, the VMI controller should take care of checks preventing


I don't understand the English here, e.g what is the "early feedback" you refer to.

Can you rephrase the paragraph to ensure that

The controller must not start a VM that requires a feature that is not available in the cluster.

The fact that a VM cannot start despite its owner asking to do so must be reported by the controller in the status field of the VM.

Done.
Let me clarify what I wanted to express with "early feedback". Let's suppose that the feature-b is disabled cluster-wide and the user creates a VM object requesting this feature. The VM is not requested to start, just the VM object is created. The idea is that the VM controller updates the status field reflecting that this VM won't start because it is requesting a feature no available.
I'm not sure if this would be helpful for users or just will create more noise.

Ah. You would like to inform a stopped VM that it won't be able to generate a VMI. This can be quite expensive (as it requires you to track all stopped VMs for no immediate reason). I would not make this a requirement.

AFAIK, it is not that expensive. It's a check in the VM controller reconciliation loop, what it is true is that if you have a stopped VM and disable the feature, the reconciliation loop is not triggered immediately. Takes some time. I will make this optional.

Done. Could you please add the lgtm again if you are fine with the change? Thanks.

0xFelix

Do I understand correctly that if a VM is unable to start, because it is requesting a non-valid configuration, a VMI that remains in pending state is still created? Wouldn't it be better to not create a VMI if we know early that it won't be able to start?

0xFelix · 2024-10-08T08:56:49Z

design-proposals/configurable-features.md

+The downward metrics feature exposes some metrics about the host node where the VMI is running to the guest. This
+information may be considered sensitive information.
+If there is no mechanism to disable the feature, any VMI could request the metrics and inspect information that, in some
+cases, the admin would like to hide, creating a potential security issue.


"Need to know principle" :)

Added this.

0xFelix · 2024-10-08T08:59:38Z

design-proposals/configurable-features.md

+    configB: string
+[...]
+```
+Please note that if the feature spec field is not present, the feature is assumed to be completely disabled.


Not true for common-instancetypes deployment, which is enabled by default (the opposite).

Dropped. IMO, this could be left up to the developers to decide.

0xFelix · 2024-10-08T09:02:30Z

design-proposals/configurable-features.md

+## Update/Rollback Compatibility
+
+The feature configurables should not affect forward or backward compatibility once the feature GA. A given feature,
+after 3 releases in Beta, all feature gates must be dropped. Those features that need a configurable should define it ahead


A given feature, after 3 releases in Beta, all feature gates must be dropped.

Do we want to mention this here? I was under the impression this proposal is not about features gates?

A given feature, after 3 releases in Beta, all feature gates must be dropped.

Do we want to mention this here? I was under the impression this proposal is not about features gates?

I think it is important to mention. Some old features have been using gates to configure if they are off or on. This has to change - features that cannot be always on, must add configurable so that we can graduate the feature and drop the feature gate.

jcanocan · 2024-10-08T17:20:50Z

Do I understand correctly that if a VM is unable to start, because it is requesting a non-valid configuration, a VMI that remains in pending state is still created? Wouldn't it be better to not create a VMI if we know early that it won't be able to start?

No, it should not create the VMI object. Only if the VMI object is created directly, it will remain in Pending state. I've added a clarification sentence in the "About implementing the checking logic in the VM controller" Section.

0xFelix

/lgtm

Thanks! Time to get this merged.

0xFelix · 2024-10-09T07:50:14Z

design-proposals/configurable-features.md

+## Update/Rollback Compatibility
+
+The feature configurables should not affect forward or backward compatibility once the feature GA. A given feature,
+after 3 releases in Beta, all feature gates must be dropped. Those features that need a configurable should define it ahead


This design document states how features that require to have a mechanism to change it's state, e.g., enabled/disabled should be implemented in KubeVirt. Signed-off-by: Javier Cano Cano <[email protected]>

kubevirt-bot · 2024-10-11T07:07:18Z

New changes are detected. LGTM label has been removed.

dankenigsberg · 2024-10-13T15:30:45Z

Do I understand correctly that if a VM is unable to start, because it is requesting a non-valid configuration, a VMI that remains in pending state is still created? Wouldn't it be better to not create a VMI if we know early that it won't be able to start?

No, it should not create the VMI object. Only if the VMI object is created directly, it will remain in Pending state. I've added a clarification sentence in the "About implementing the checking logic in the VM controller" Section.

In one sense it is better to fail the creation of the VMI (less redundant objects in the system); But in another sense it is worse, because this would force the VM controller to replicate logic that is already in the VMI controller. I prefer the cleaner design, with a clear separation of responsibilities.

kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Aug 14, 2024

kubevirt-bot requested review from aburdenthehand and cwilkers August 14, 2024 15:21

kubevirt-bot added the size/M label Aug 14, 2024

kubevirt-bot requested review from 0xFelix and lyarwood August 14, 2024 16:51

dankenigsberg suggested changes Aug 15, 2024

View reviewed changes

kubevirt-bot assigned dankenigsberg Aug 15, 2024

jcanocan force-pushed the configurable-features branch from 84420a7 to 7e4b3db Compare August 19, 2024 14:53

kubevirt-bot added size/L and removed size/M labels Aug 19, 2024

dankenigsberg suggested changes Aug 20, 2024

View reviewed changes

jcanocan force-pushed the configurable-features branch from 7e4b3db to 3104058 Compare August 20, 2024 14:36

lyarwood reviewed Aug 20, 2024

View reviewed changes

jcanocan force-pushed the configurable-features branch from 3104058 to fe68749 Compare August 21, 2024 14:47

jcanocan mentioned this pull request Aug 22, 2024

Graduate downwardMetrics to feature KV lifecycle kubevirt/kubevirt#12650

Closed

8 tasks

dankenigsberg reviewed Aug 22, 2024

View reviewed changes

jcanocan force-pushed the configurable-features branch from fe68749 to 99c7e4d Compare August 22, 2024 13:49

rmohr reviewed Aug 22, 2024

View reviewed changes

jcanocan force-pushed the configurable-features branch from 99c7e4d to 342c91e Compare August 26, 2024 12:44

iholder101 reviewed Aug 27, 2024

View reviewed changes

jcanocan force-pushed the configurable-features branch from 342c91e to 898b450 Compare August 28, 2024 13:37

dankenigsberg reviewed Sep 1, 2024

View reviewed changes

dankenigsberg suggested changes Sep 1, 2024

View reviewed changes

jcanocan force-pushed the configurable-features branch from 898b450 to d275eb8 Compare September 2, 2024 14:05

dankenigsberg reviewed Sep 3, 2024

View reviewed changes

jcanocan force-pushed the configurable-features branch from d275eb8 to a825f81 Compare September 3, 2024 11:56

jcanocan mentioned this pull request Sep 18, 2024

instancetype: Graduate CommonInstancetypesDeploymentGate and introduce configurable to control deployment kubevirt/kubevirt#12753

Merged

8 tasks

jcanocan force-pushed the configurable-features branch from a0c7ba0 to 86dcaa3 Compare September 27, 2024 10:10

dankenigsberg suggested changes Sep 29, 2024

View reviewed changes

jcanocan force-pushed the configurable-features branch from 86dcaa3 to 5092239 Compare October 7, 2024 12:57

dankenigsberg approved these changes Oct 7, 2024

View reviewed changes

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2024

jcanocan force-pushed the configurable-features branch from 5092239 to b311f1a Compare October 8, 2024 07:20

kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 8, 2024

dankenigsberg approved these changes Oct 8, 2024

View reviewed changes

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 8, 2024

0xFelix reviewed Oct 8, 2024

View reviewed changes

jcanocan force-pushed the configurable-features branch from b311f1a to 852893d Compare October 8, 2024 17:14

kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 8, 2024

0xFelix approved these changes Oct 9, 2024

View reviewed changes

kubevirt-bot assigned 0xFelix Oct 9, 2024

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2024

design-proposal: Feature configurables

31bdca3

This design document states how features that require to have a mechanism to change it's state, e.g., enabled/disabled should be implemented in KubeVirt. Signed-off-by: Javier Cano Cano <[email protected]>

jcanocan force-pushed the configurable-features branch from 852893d to 31bdca3 Compare October 11, 2024 07:07

kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2024

		This is current list of GA'd features present in KubeVirt/KubeVirt which are still using feature gates and are shown as
		configurables in [HCO](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/controllers/operands/kubevirt.go#L166-L174):

		- Get a clear understanding about the features status.
		- Establish how the features status swapping should work.

design-proposal: Feature configurables #316

Are you sure you want to change the base?

design-proposal: Feature configurables #316

Conversation

jcanocan commented Aug 14, 2024 • edited Loading

jcanocan commented Aug 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmohr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iholder101 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcanocan commented Aug 14, 2024 •

edited

Loading