Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit ironic UID and GID assignment #582

Open
elfosardo opened this issue Oct 24, 2024 · 14 comments
Open

Revisit ironic UID and GID assignment #582

elfosardo opened this issue Oct 24, 2024 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@elfosardo
Copy link
Member

currently the ironic user and group have fixed UID and GID assigned, specifically UID is 997 and GID is 994
Unfortunately, with the increase of default system services, these numbers can easily conflict with other already installed services, for example in the CentOS Stream 10 image they're assigned by default to chrony and dockerroot.
We have at least two options:

  • remove fixed UID and GID and let the system generate them -> this would be the preferred solution
  • change UID and GID and assign higher numbers (ideally over 1000) -> with this approach we risk to see the same issue happening again, unless we use very high number
@metal3-io-bot metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Oct 24, 2024
@tuminoid
Copy link
Member

It should be noted that those UID/GID are present in BMO manifests, docs, guides, ironic-standalone-operator as well. Auto-generated UID/GID is a bit problematic for this reason, and probably we get off much easier if we just pick UID/GID at high range. Picking UID/GID under 1024 in the first place was kind of asking trouble later.

Just for completeness sake tho:

  • BMO manifest issue: having Ironic-image manifests in BMO is a WTF itself
  • Our fix to having Ironic manifests in BMO is ironic-standalone-operator
  • Hence ironic-standalone-operator can redefine the UID/GID freely
  • Docs is never a real issue

If this need fixing short term, I say picking a high range number is the choice right now. Long term, ironic-standalone-operator can let it auto-generate as its the sole controller for Ironic.

@elfosardo
Copy link
Member Author

@tuminoid thanks for the reply, I'm aware of the various places where the ironic UID/GID are used, that's why I opted for opening an issue instead of just going for a direct change.
The initial choice for UID/GID below 1024 was done to keep consistency with the RPM settings, but the fixed IDs have been removed since a while, that's why I would opt for auto-generated IDs.
I believe it would be possible then to just use the user and group name (ironic) instead of the specific IDs in all the places where they are used, the chain of changes is simplified in this way.
It's not a very urgent change, but the sooner, the better, as this conflicts with CentOS Stream 10 migration.
I think we should discuss this during one of the next metal3 meetings.

@permissiongranted
Copy link

permissiongranted commented Oct 25, 2024

Is this possibly the reason I'm seeing an issue with the the ironic-ipa-downloader container? e.g. possibly to do with securitycontexts of the pod

If I spin up the pod, it will go to completion and be healthy, however if the pod is killed and lands on the same node, then the init will fail with mkdir: cannot create directory '/shared': Permission denied

Edit: It looks like the /shared folder is root:root, but all subfolders are ironic:ironic, so there's probably something going on for my contexts using the BMO

@dtantsur
Copy link
Member

It looks like the /shared folder is root:root, but all subfolders are ironic:ironic, so there's probably something going on for my contexts using the BMO

Is /shared a host directory for you? I think you need privileged containers for that to work.

@permissiongranted
Copy link

permissiongranted commented Oct 25, 2024

I don't want to derail the issue, since this seems like a "my cluster" problem.

It's not a host directory; it's running longhorn CSI as it's provisioner with PVC as RWX, fstype ext4. So my guess is that longhorn is probably not respecting securityContext.fsGroup: 994 for some reason.

@tuminoid
Copy link
Member

I believe it would be possible then to just use the user and group name (ironic) instead of the specific IDs in all the places where they are used, the chain of changes is simplified in this way.

Issue with that is in k8s pod's securityContext, you cannot use user names, as k8s cannot resolve them. They're defined as UID/GID, see docs and ironic template in bmo

In general, the UID/GID change is pain, as Ironic manifests are in BMO, which leads to coupling and means people need to adapt the manifests if they use "non-coupled" versions of BMO/Ironic together. Not a blocker by any means, but major annoyance.

We have more and more cases coming where the Ironic manifests in BMO is really painful, and while our choice of solution is to wait for Ironic Standalone Operator, I'm starting to feel we need to address this sooner than that. But that is discussion we can have in another issue or meeting.

@dtantsur
Copy link
Member

Note: the operator needs fixing too https://github.com/metal3-io/ironic-standalone-operator/blob/main/pkg/ironic/containers.go#L22-L23

Unfortunately, it will not solve the problem of hardcoding UID/GID when building ironic-image: https://github.com/metal3-io/ironic-image/blob/main/prepare-image.sh#L29-L30

@Rozzii
Copy link
Member

Rozzii commented Nov 6, 2024

/triage accepted
/king bug

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Nov 6, 2024
@Rozzii
Copy link
Member

Rozzii commented Nov 6, 2024

/kind bug

@metal3-io-bot metal3-io-bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 6, 2024
@dtantsur
Copy link
Member

dtantsur commented Nov 6, 2024

A wild idea: we can use a privileged init container to make upgrades possible without breakages.

So, state 1: ironic-image has OLD_ID, BMO has OLD_ID.

We add an init container whose only job is to take new ID's via environment variables and change the ownership of the files we care about.

State 2: ironic-image has OLD_ID, BMO scripts tell the init container to re-own files to NEW_ID and then deploy Ironic with NEW_ID.

At this point, we have an upgrade path. We only need to tell the users to upgrade ironic-image first.

State 3: ironic-image has NEW_ID, BMO scripts tell the init container to re-own files to NEW_ID and then deploy Ironic with NEW_ID.

After this, we can drop the workaround.

State 4: ironic-image has NEW_ID, BMO scripts deploy Ironic with NEW_ID.

@Rozzii
Copy link
Member

Rozzii commented Nov 7, 2024

We add an init container whose only job is to take new ID's via environment variables and change the ownership of the files we care about.

I am bit confused about "state1" if the BMO scripts still have the old ids in the security context how would the init container help? I mean it would reown stuff just fine but then the security context coming from BMO kustomize would want to run the containers with the old IDs.

@elfosardo
Copy link
Member Author

After the official release of CentOS Stream 10 last week https://blog.centos.org/2024/12/introducing-centos-stream-10/ we verified that the conflicting UID/GID are present only in the CS10 bootc container image, being associated to the kube user.
Now the CS10 migration patch is passing CI, waiting for the epel10 repo being 100% ready and especially for the inotify package, all looks good.
This is still an issue as the conflict may happen in the future, so let's keep working on it while reducing the urgency.

@tuminoid
Copy link
Member

Good update. Not having to do it right away helps us, as we can do the switch smoother later when we use IRSO. Don't need to care about BMO sync anymore.

@tuminoid
Copy link
Member

/lifecycle frozen

@metal3-io-bot metal3-io-bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants