📖 ✨ 🧑‍🤝‍🧑 add proposal for Node Bootstrapping working group #11407

t-lo · 2024-11-12T15:41:31Z

What this PR does / why we need it:

Propose a working group for node bootstrapping and cluster provisioning.
The need for this working group originated from an ongoing discussion around separating cluster provisioning and node bootstrapping, as stated in the WG's User Story.

Which issue(s) this PR fixes

CC

Tags

/area provider/bootstrap-kubeadm
/area bootstrap

/kind documentation
/kind proposal

k8s-ci-robot · 2024-11-12T15:41:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-11-12T15:41:40Z

This PR is currently missing an area label, which is used to identify the modified component when generating release notes.

Area labels can be added by org members by writing /area ${COMPONENT} in a comment

Please see the labels list for possible areas.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-11-12T15:41:41Z

Hi @t-lo. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

t-lo · 2024-11-12T15:48:32Z

"@elmiko" "@eljohnson92" I took the liberty to add you as co-stakeholders to the WG proposal - IIRC you expressed interest in participating. I hope that's OK?

johananl · 2024-11-12T16:14:36Z

docs/community/20241112-node-provisioning.md

@@ -0,0 +1,88 @@
+---
+title: ClusterAPI Node Provisioning working group


I'm a bit concerned about carrying over the conflation between bootstrap and provisioning into the working group...
The filename and title on line 2 have "node provisioning" in it but the h1 heading below says "node bootstrapping".

Also, I'm not sure what we mean by "cluster provisioning".

To clarify, the two concerns which are currently intertwined are the following:

Generating cloud-config or Ignition configuration and executing it on nodes to make OS customizations, file manipulations on disk etc.

Generating kubeadm configuration and executing kubeadm to turn a host into a k8s node.

I don't care too much what we call each concern as long as the two are clearly demarcated and we know which one we're discussing at any given moment.

In my WIP design document I offered to call no. 1 "provisioning" and no. 2 "bootstrap".

Good catch, I was trying to use "cluster provisioning" and "node bootstrapping" consistently but apparently it escaped me in the title.

Clarified wording in the PR.

docs/community/20241112-node-provisioning.md

docs/community/20241112-node-bootstrapping.md

fabriziopandini · 2024-11-14T08:45:53Z

docs/community/20241112-node-bootstrapping.md

+Initial focus of the working group is on separating node bootstrapping and node
+OS configuration / customisation ("node provisioning" from here on).
+At the time of writing, bootstrapping and provisioning are architecturally
+intertwined; both are implemented in the bootstrap provider.
+This increases maintenance effort and effectively prevents re-using provisioning
+features across bootstrap providers.


Might be I'm wrong, but I think that this statement is somehow more hinting a solution (separation) than a problem.

Base on my understanding, we are trying to solve two problems:

to offer to the users more combinations of provisioner/bootstrappers.

to ensure those combinations can evolve freely, taking befits from the greatest an the latest the underlying technologies

Separation is a possible way to get there, but IMO, users don't care about separation at all.
In fact, users will always need both a provisioner and a bootstrapper to get a working cluster.

If we split provisioner and a bootstrapper into two separate components, I fear we are not making a good service both to the users, nor to ourself (as project maintainers), nor to the community that builds on top of Cluster API. Why?

we are adding to the system another dimension of complexity, that makes this change highly disruptive for users

we are imaging a change that will impact not only the users, but the entire of the CAPI ecosystem, think e.g. to the CAPI operator, or all the products built on top of Cluster API.

we are adding to the users an additional operational burden (they have to take care of yet another component/provider, its upgrades, its compatibility matrix, its CVEs etc etc).

we are imaging a change so complex and extensive that, being very pragmatic, the chances that this change might happen drastically drop (e.g. it will impact clusterctl, clusterclass, runtime extensions, all the controllers, almost everything; in fact this is why this problem is laying around for so long)

Frankly speaking, this seems too much to digest for everyone.

The good news, it that I think that it is possible to look at the problem above from a different angle and find a better solution.

How to solve the two problem above (offer more combinations/ensure those combinations can evolve freely) without introducing separation?

That's up to the working group to figure out, but If I can give my two cents, what we are looking for is a solution that doesn't impact users, but makes it easier to implement and evolve different provisioners (e.g. we want to bump ignition!), while re-using existing bootstrappers (e.g. kubeadm).

I think that the key point in the sentence above is software re-use

So, a possible solution might be in a combination of federated development/separated repositories (providers as we call it in CAPI) + something that makes it easy to reuse code e.g. making kubeadm provider a library (details TBD of course, but I hope you get the idea)

I'm pretty sure if three years ago there was something like a kubeadm library, today we will have a "kubeadm ignition provider" in a separate repository and everyone everyone would be happy. Why not to make this happen now?

Please also see https://docs.google.com/document/d/1Fz5vWwhWA-d25_QDqep0LWF6ae0DnTqd5-8k8N0vDDM/edit?disco=AAABX57uOnU where I make other considerations about the same point

Happy to chat about it if you are interested on this angle

These are good points - will rework the proposal accordingly. Admittedly I wrote the first draft with the existing proposal in mind.

Thank you for your review!

Thanks @fabriziopandini. I completely understand the concerns you've raised and am likewise worried by them. I'm also 100% for maximizing code reuse.

If I understand correctly, what you are proposing is to keep provisioners (e.g. Ignition) tightly coupled with bootstrappers (e.g. kubeadm) and offer multiple "flavors" of bootstrap providers for each bootstrapper, one for each provisioner. So, for example, we would have a kubeadm + cloud-init bootstrap provider and a kubeadm + Ignition bootstrap provider in separate repositories.

If that's what you meant, I'm seeing the following problems with this approach:

The cartesian product of bootstrappers and provisioners can grow very large. True, with 1 bootstrapper and 2 provisioners we would have only 2 repositories to maintain, but what if we have 3 bootstrappers and 3 provisioners? That's 9 repositories already. AFAICT this isn't premature optimization since we seem to already have 8 supported bootstrappers and at least 3 provisioners (cloud-init and Ignition, with CloudBase Init either implemented or planned - not sure). That's 24 repos if we don't separate. In addition, we should think not only about the number of repos but also about project personas and separation of conerns: IMO, ideally the maintainer of a provider should only touch his/her codebase when something changes in the technology specific to that provider. For example, as the maintainer of the kubeadm provider, I don't want to worry about changes in Ignition or cloud-init -- that's for the maintainers of the relevant provisioner providers to take care of.

Even if we decide the number of repos is manageable, it's not clear to me how we would achieve good code reuse when we have cloud-config fields embedded in types such as KubeadmConfigSpec. Example:

apiVersion: controlplane.cluster.x-k8s.io/v1beta1 kind: KubeadmControlPlane metadata: name: capi-azure-control-plane namespace: default spec: kubeadmConfigSpec: ... diskSetup: filesystems: - device: /dev/disk/azure/scsi1/lun0 extraOpts: - -E - lazy_itable_init=1,lazy_journal_init=1 filesystem: ext4 label: etcd_disk ... partitions: - device: /dev/disk/azure/scsi1/lun0 layout: true overwrite: false tableType: gpt

In the snippet above, diskSetup is a field under KubeadmConfigSpec. This field is specific to cloud-init. That means KubeadmConfigSpec assumes cloud-init is the only provisioner. True, we could implement "translators" under the hood for Ignition and any other provisioner, but there is no guarantee for feature parity among all provisioners so IMO it's only a matter of time until we run into issues where some feature is supported by cloud-init but not by other provisioners.

I think we have a major design flaw in the current core API (mixing bootstrap and provisioning in the data model). True, it will be very disruptive to change this, but from my understanding you're justifiably concerned about building more complexity (e.g. Ignition 3) on top of a suboptimal foundation, and to me it seems we'd be doing exactly that if we keep the API as is and keep adding layers on top of it.

Also, I know that in the past some maintainers have expressed similar concerns: #4172 (comment)

We already have a bootstrapper contract. What I'm proposing is to formalize a provisioner contract and add the first two provisioner providers (cloud-init and Ignition) and do the necessary breaking changes in the relevant CRDs to support a loose coupling of bootstrappers and provisioners while devising a migration path for users. Such a contract would make bootstrappers and provisioners fully independent.

Bootstrapper contract TL;DR: "The provider should generate bootstrap configuration and store it at a well-known location in the k8s datastore (secrets, CRs or whatever)."

Provisioner contract TL;DR: "The provider should read the bootstrap configuration as well as any user-specified custom provisioning code, generate unified configurations for master and worker nodes and store it at a well-known location in the k8s datastore so that infra providers can read them and inject them to machines."

This way we can flexibly combine bootstrappers and provisioners by relying solely on k8s constructs, make everything more maintainable and likely become future proof in terms of evolving both bootstrappers and provisioners over time.

To summarize, I think that if we avoid tackling the design problem head on and look for ways around it, we'd find ourselves in a similar situation later on, only then we'd be disrupting even more areas of the projects.

That said, I'm open to changing my mind on this, and from Microsoft's perspective if you see a shorter/simpler path which would allow us to graduate Ignition support to GA and add v3 support - we'd be happy of course 🙂

Adding to Fabrizio's comment.

I think we also have to keep in mind that Cluster API is at a point where we can't freely re-shuffle fundamental pieces of the project anymore. We made some very disruptive changes in v1alpha2 and v1alpha3 (4,5 years ago). We declared Cluster API production ready 3 years ago (Kubernetes Cluster API reaches production readiness with version 1.0).

We have to be very very conscious of all the folks relying on Cluster API as a stable fundation (because we told them it is - 3 years ago). We really have to be careful to not blow up the ecosystem by making disruptive changes.

To summarize, I think we have to be very very careful with anything that basically comes down to a Cluster API v2.

(this should come down to evolving Cluster API responsibly and keeping everything backwards compatible, xref: https://cluster-api.sigs.k8s.io/user/manifesto#the-right-to-be-unfinished)

@johananl I really appreciate the fact that you clearly went miles down this path, but me and probably also someone else probably have to catch up on your work and make up their own mind around this problem. (and this was probably one of the key driver behind the idea of proposing a working group I guess).

What I'm advocating for is that this working group should be open to other options, because if you tight this charter to a single option, this will implicitly prevent or limit other's contributions.

Also, please note that, I'm bringing this up mostly because, based on my experience, limiting others contribution in the initial phase might become an issue for the success of the working group down the line, when broader consensus will be required e.g.g to get proposal approved.

I would defer all the other discussion to the working group, but I also don't want to be un polite and to seem one that is ignoring your other comments, so I drop some point of discussions for the working group sessions.

WRT to the initial assumption of the use cases we have in front of us, I get that theoretically we have to address a possible high number of combinations, but from the other side, as a matter of fact only the ignition team has raised this concern so far.

I'm not sure why, but I think that we should consider that either not all the combinations of providers/bootstrappers makes sense or simply there is no demand for many of those combinations (but getting a better understanding of the problem scope is again part of the working group goal IMO).

WRT to the possible solution, frankly speaking I think it is too early to get into the weed of different options without a better understanding of the context.

But just to avoid misunderstanding, I'm well aware of the fact that the current KubeadmConfig API now is not in an ideal state for what we are discussing. As a matter of fact it wasn't designed with re-use in mind.

However, just to be clear, what I'm advocating for is to being open to explore options for making re-use possible while minimising the number of breaking changes/impact for users.

Whatever idea I dropped in the previous note is just something to make the case that other options might exist, and it should not be intended as a fully defined alternative (sorry if this was not clear in my previous comment).

But finding a solution for this problem is a discussion that I will be happy to have if the charter of this working group opens up to such option. Otherwise, why we even start a working group if it cannot shape the target solution?

Hey @fabriziopandini. Thanks for your follow up.

Of course I'm open to discussing other solutions and would love to get contributions from others 🙂 I hope I didn't write anything which implies otherwise.

The specific solution in my WIP proposal definitely isn't the only one, and it also isn't a solution I had in mind in advance but rather what I could come up with after gathering a list of current issues and concerns from both users and maintainers around bootstrap and provisioning. I'm not attached to this proposal and am happy to drop it or replace it with a better proposal right away if we decide that's the way to go.

To be clear, the reason I became involved with this work initially was to improve the integration between Flatcar Container Linux and CAPI. So, with my Microsoft hat on I'd be very glad to avoid doing any major reworking of core CAPI pieces if it's not absolutely necessary. We'd be happy to simply keep Ignition supported, add support for Ignition v3 and work towards graduating Ignition support from experimental mode to a stable "GA".

However, I also have the OSS contributor hat, and with that hat on I always try to do what's right for the projects I'm involved in rather than solving ad-hoc issues which stand in my way while working on use cases relevant mainly to my company.

The reason I've gone into the weeds of #5294 and came up with this proposal is that previously we heard very clearly from some of the maintainers that before we can consider making Ignition a fully-supported, "GA" provisioner, we must tackle #5294. If that's no longer the case or if I somehow misunderstood the intention there -- great, as this would make our work around Ignition much shorter and easier 👍

I'm also in favor of deferring the technical discussions to the work group.

I understand your point (as well as Stefan's) about keeping the API stable. That's very important indeed. The only reason I proposed changing core API objects is that as far as I understood the CAPI maintainers (or at least some of them) required that we cleanly separate bootstrap from provisioning, and I couldn't find a way to truly separate the two without touching the API because the two concerns are unfortunately entangled in the API.

Let's continue this discussion in the first iteration of the work group then 🙂

t-lo · 2024-11-15T15:25:49Z

Thank you Johanan, Fabrizio, and Stefan for tuning in! This is immensely helpful.
I also love the fact that we're already iterating over design thoughts; this is exactly the momentum we were hoping for with the working group.

Made a few changes to the proposal; reworked the whole user story part to focus on goals instead of implementations, and rephrased the "problem statement" section a bit to not hint at a solution when describing the issue.

Added a new section on stability and compatibility - this really was the proverbial elephant in the room for me since in Flatcar, we put a massive (occasionally painful) focus on never breaking user workloads ever - kudos to Stefan for calling this out. I'll make sure to hold the working group proposals to an equally high standard.

I don't think we're quite there yet but we're definitely making progress. Ready for another round of feedback!

Co-Authored-by: Johanan Liebermann <[email protected]> Signed-off-by: Thilo Fromm <[email protected]>

t-lo · 2024-11-19T08:13:34Z

@sbueringer , @fabriziopandini what do you think? Could you give it another pass?

johananl

Thanks! Added a few comments.

johananl · 2024-11-20T13:38:48Z

docs/community/20241112-node-bootstrapping.md

+
+Currently, the kubeadm bootstrap provider implements two provisioning systems:
+cloud-init and ignition.
+As there is currently no abstraction of node configuration, both implementations


What do we mean by "node configuration"? Is it different than "provisioning"?
I so, maybe we should clarify what this term means since "configuration" is a very overloaded term.
If not, I suggest we use consistent terminology to avoid further confusion around this matter.

johananl · 2024-11-20T13:40:16Z

docs/community/20241112-node-bootstrapping.md

+MicroK8s currently only supports cloud-init.
+
+Outside of ClusterAPI, both [cloud-init](https://github.com/canonical/cloud-init)
+and [ignition](https://github.com/coreos/ignition) provisioning is widely adopted


Suggested change

and [ignition](https://github.com/coreos/ignition) provisioning is widely adopted

and [ignition](https://github.com/coreos/ignition) provisioning are widely adopted

johananl · 2024-11-20T13:42:03Z

docs/community/20241112-node-bootstrapping.md

+It is likely that more provisioning systems exist; a clean separation and guidelines
+will make it easier to add support to ClusterAPI as well as to share implementations
+across bootstrap providers.


Suggested change

It is likely that more provisioning systems exist; a clean separation and guidelines

will make it easier to add support to ClusterAPI as well as to share implementations

across bootstrap providers.

It is likely that more provisioning systems exist; a clean separation between bootstrap and provisioning

and guidelines on how each is developed will make it easier to add support to ClusterAPI as well as to

share implementations across bootstrap providers.

johananl · 2024-11-20T13:44:41Z

docs/community/20241112-node-bootstrapping.md

+across bootstrap providers.
+
+
+**Compatibility and API guarantees**


Does this mean this work group can't work on any changes which may result in breaking changes? IMO this will severely limit the possible solutions we could come up with.

Maybe we should mention that backward-compatible changes are preferable and that in case breaking changes are introduced, a well-documented migration path with a proper deprecation process should be put in place?

johananl · 2024-11-20T13:45:04Z

docs/community/20241112-node-bootstrapping.md

+
+## Scope
+
+1. Propose architectural improvements that ease integration of node configuration


Again, not clear on what "node configuration" refers to.

johananl · 2024-11-20T13:46:10Z

docs/community/20241112-node-bootstrapping.md

+1. Propose architectural improvements that ease integration of node configuration
+     implementations.
+   Decouple generic bootstrapping from configuration language specific concerns.
+2. Deliver an example implementation in the kubeadm bootstrapper.


Suggested change

2. Deliver an example implementation in the kubeadm bootstrapper.

2. Deliver an example implementation in the kubeadm bootstrap provider.

Better to use the official terminology to avoid confusion. Ditto all occurrences below.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 12, 2024

k8s-ci-robot requested review from jackfrancis and richardcase November 12, 2024 15:41

k8s-ci-robot added the do-not-merge/needs-area PR is missing an area label label Nov 12, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 12, 2024

johananl reviewed Nov 12, 2024

View reviewed changes

t-lo force-pushed the t-lo/propose-wg-node-provisioning branch 2 times, most recently from 49bf126 to bf5ce21 Compare November 12, 2024 16:46

johananl reviewed Nov 12, 2024

View reviewed changes

docs/community/20241112-node-provisioning.md Outdated Show resolved Hide resolved

t-lo force-pushed the t-lo/propose-wg-node-provisioning branch from 22ad278 to 6353aad Compare November 13, 2024 07:29

t-lo changed the title ~~docs: add proposal for Node Bootstrapping working group~~ 📖 ✨ 🧑‍🤝‍🧑 add proposal for Node Bootstrapping working group Nov 13, 2024

fabriziopandini reviewed Nov 14, 2024

View reviewed changes

t-lo force-pushed the t-lo/propose-wg-node-provisioning branch from 6353aad to f61f4ee Compare November 15, 2024 15:19

t-lo requested review from johananl, sbueringer and fabriziopandini November 15, 2024 15:28

t-lo force-pushed the t-lo/propose-wg-node-provisioning branch from f61f4ee to dbc9ed4 Compare November 15, 2024 15:31

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 15, 2024

docs: add proposal for Node Bootstrapping working group

857dd3d

Co-Authored-by: Johanan Liebermann <[email protected]> Signed-off-by: Thilo Fromm <[email protected]>

t-lo force-pushed the t-lo/propose-wg-node-provisioning branch from dbc9ed4 to 857dd3d Compare November 15, 2024 16:29

johananl reviewed Nov 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📖 ✨ 🧑‍🤝‍🧑 add proposal for Node Bootstrapping working group #11407

📖 ✨ 🧑‍🤝‍🧑 add proposal for Node Bootstrapping working group #11407

t-lo commented Nov 12, 2024 •

edited

Loading

k8s-ci-robot commented Nov 12, 2024

k8s-ci-robot commented Nov 12, 2024

k8s-ci-robot commented Nov 12, 2024

t-lo commented Nov 12, 2024

johananl Nov 12, 2024 •

edited

Loading

t-lo Nov 12, 2024

t-lo Nov 12, 2024

fabriziopandini Nov 14, 2024 •

edited

Loading

t-lo Nov 14, 2024

johananl Nov 14, 2024 •

edited

Loading

sbueringer Nov 15, 2024 •

edited

Loading

fabriziopandini Nov 15, 2024 •

edited

Loading

johananl Nov 18, 2024

t-lo commented Nov 15, 2024

t-lo commented Nov 19, 2024

johananl left a comment

johananl Nov 20, 2024

johananl Nov 20, 2024

johananl Nov 20, 2024

johananl Nov 20, 2024

johananl Nov 20, 2024

johananl Nov 20, 2024

		@@ -0,0 +1,88 @@
		---
		title: ClusterAPI Node Provisioning working group

	and [ignition](https://github.com/coreos/ignition) provisioning is widely adopted
	and [ignition](https://github.com/coreos/ignition) provisioning are widely adopted

		across bootstrap providers.


		Compatibility and API guarantees


		## Scope

		1. Propose architectural improvements that ease integration of node configuration

	2. Deliver an example implementation in the kubeadm bootstrapper.
	2. Deliver an example implementation in the kubeadm bootstrap provider.

📖 ✨ 🧑‍🤝‍🧑 add proposal for Node Bootstrapping working group #11407

Are you sure you want to change the base?

📖 ✨ 🧑‍🤝‍🧑 add proposal for Node Bootstrapping working group #11407

Conversation

t-lo commented Nov 12, 2024 • edited Loading

k8s-ci-robot commented Nov 12, 2024

k8s-ci-robot commented Nov 12, 2024

k8s-ci-robot commented Nov 12, 2024

t-lo commented Nov 12, 2024

johananl Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johananl Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

sbueringer Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

fabriziopandini Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t-lo commented Nov 15, 2024

t-lo commented Nov 19, 2024

johananl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t-lo commented Nov 12, 2024 •

edited

Loading

johananl Nov 12, 2024 •

edited

Loading

fabriziopandini Nov 14, 2024 •

edited

Loading

johananl Nov 14, 2024 •

edited

Loading

sbueringer Nov 15, 2024 •

edited

Loading

fabriziopandini Nov 15, 2024 •

edited

Loading