rke2 start fails after rke2-killall.sh execution #6571

mynktl · 2024-08-15T02:51:53Z

Environmental Info:
RKE2 Version:

rke2 version v1.30.1+rke2r1 (e7f87c6)
go version go1.22.2 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux server0 4.18.0-372.41.1.el8_6.x86_64 #1 SMP Thu Jan 5 13:56:06 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server

Describe the bug:

For RKE2, We are using two mount points, one for etcd database and other for rke2.
structure of /var/lib/rancher directory is as below:

[root@server0 rancher]# tree /var/lib/rancher/ -L 3
/var/lib/rancher/
`-- rke2
    |-- agent
    |   |-- client-ca.crt
    |   |-- client-kubelet.crt
    |   |-- client-kubelet.key
    |   |-- client-kube-proxy.crt
    |   |-- client-kube-proxy.key
    |   |-- client-rke2-controller.crt
    |   |-- client-rke2-controller.key
    |   |-- containerd
    |   |-- etc
    |   |-- images
    |   |-- kubelet.kubeconfig
    |   |-- kubeproxy.kubeconfig
    |   |-- logs
    |   |-- pod-manifests
    |   |-- rke2controller.kubeconfig
    |   |-- server-ca.crt
    |   |-- serving-kubelet.crt
    |   `-- serving-kubelet.key
    |-- bin -> /var/lib/rancher/rke2/data/v1.30.1-rke2r1-c42b85364830/bin
    |-- data
    |   `-- v1.30.1-rke2r1-c42b85364830
    `-- server
        |-- agent-token -> /var/lib/rancher/rke2/server/token
        |-- cred
        |-- db
        |-- etc
        |-- manifests
        |-- node-token -> /var/lib/rancher/rke2/server/token
        |-- tls
        `-- token

16 directories, 16 files

and mount configuration is as below:

[root@server0 rancher]# lsblk
NAME                          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                             8:0    0  128G  0 disk
|-sda1                          8:1    0  500M  0 part /boot
|-sda2                          8:2    0   63G  0 part
| |-rootvg-tmplv              253:11   0    2G  0 lvm  /tmp
| |-rootvg-usrlv              253:12   0   10G  0 lvm  /usr
| |-rootvg-homelv             253:13   0    1G  0 lvm  /home
| |-rootvg-varlv              253:14   0    8G  0 lvm  /var
| `-rootvg-rootlv             253:15   0   22G  0 lvm  /
|-sda14                         8:14   0    4M  0 part
`-sda15                         8:15   0  495M  0 part /boot/efi
sdb                             8:16   0   16G  0 disk
`-uipathetcdvg-etcdlv         253:10   0   16G  0 lvm  /var/lib/rancher/rke2/server/db
sdd                             8:48   0  256G  0 disk
|-uipathvg-rancherlv          253:1    0  185G  0 lvm  /var/lib/rancher
|-uipathvg-kubeletlv          253:2    0   56G  0 lvm  /var/lib/kubelet
sdf                             8:80   0   32G  0 disk

To mount /var/lib/rancher/rke2/server/db automatically, We have added a dependency of this db mount to rke2-server.service.

[root@server0 rancher]# cat /etc/systemd/system/rke2-server.service.d/custom.conf
[Unit]
After=var-lib-rancher-rke2-server-db.mount var-lib-rancher.mount var-lib-kubelet.mount datadisk-insights.mount datadisk-monitoring.mount datadisk-objectstore.mount datadisk-registry.mount
Requires=var-lib-rancher-rke2-server-db.mount var-lib-rancher.mount var-lib-kubelet.mount datadisk-insights.mount datadisk-monitoring.mount datadisk-objectstore.mount datadisk-registry.mount

So whenever systemctl start rke2-server is executed it will perform the db mount and start rke2 server.

Issue:
When we execute rke2-killall.sh script, it unmount /var/lib/rancher/rke2/server/db and delete this directory. As We don't have mount point specific to /var/lib/rancher/rke2, do_unmount_and_remove performs the action on /var/lib/rancher/rke2/server/db because of the condition grep "^$1" in do_unmount_and_remove function.

Post this execution, rke2 start via systemctl command fails, with below error

-- Unit var-lib-rancher-rke2-server-db.mount has begun starting up.
Aug 15 02:48:25 server0 mount[1527607]: mount: /var/lib/rancher/rke2/server/db: mount point does not exist.
Aug 15 02:48:25 server0 systemd[1]: var-lib-rancher-rke2-server-db.mount: Mount process exited, code=exited status=32
Aug 15 02:48:25 server0 systemd[1]: var-lib-rancher-rke2-server-db.mount: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit var-lib-rancher-rke2-server-db.mount has entered the 'failed' state with result 'exit-code'.
Aug 15 02:48:25 server0 systemd[1]: Failed to mount /var/lib/rancher/rke2/server/db.
-- Subject: Unit var-lib-rancher-rke2-server-db.mount has failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit var-lib-rancher-rke2-server-db.mount has failed.
--
-- The result is failed.

This behaviour is with selinux enabled.

[root@server0 rancher]# getenforce
Enforcing

As we have separate mount point for etcd db, We have one more risk of etcd data getting deleted. In rke2-killall.sh script,
umount will always go through as this is local mount point, but missing set -e. If amount fails then rm -rf --one-file-system ${MOUNTS} will delete the content of the directory, which is unexpected.

Steps To Reproduce:

Installed RKE2:

Expected behavior:

Actual behavior:

Additional context / logs:

The text was updated successfully, but these errors were encountered:

rajivml · 2024-08-15T05:43:34Z

cc @brandond

brandond · 2024-08-15T18:21:09Z

This sounds like a duplicate of #6557

mynktl · 2024-08-16T02:44:13Z

@brandond As we have removed the do_unmount_and_remove for rke2 directory in https://github.com/vitorsavian/rke2/blob/587fb7f22469a4827ea9040b36bfb23d14f9d0c5/bundle/bin/rke2-killall.sh, it will fix the problem for rke2 related directories. But do you see any need to have error handling or set -e in do_unmount_and_remove?

We are unmounting the directory and then removing that directory to clear the mount point. But if unmount fails then it will remove the content of directory which may result into data loss.

brandond · 2024-08-16T17:38:01Z

if unmount fails then it will remove the content of directory which may result into data loss.

That shouldn't be the case, we use rm -rf --one-file-system ${MOUNTS} which would not traverse across the filesystem boundary into the path that failed to unmount.

If you believe you've having problems with this not working as intended, please provide steps to reproduce.

maxlillo · 2024-08-16T21:31:19Z

cd /var/lib/rancher/rke2/server/db
run bash -x rke2-killall.sh

Relevant output below:

do_unmount_and_remove /var/lib/rancher/rke2
umount /var/lib/rancher/rke2/server/db
umount: /var/lib/rancher/rke2/server/db: target is busy.
rm -rf --one-file-system /var/lib/rancher/rke2/server/db
rm: cannot remove '/var/lib/rancher/rke2/server/db': Device or resource busy
do_unmount_and_remove /var/lib/kubelet/pods
umount /var/lib/kubelet/pods/fd004727-387a-4336-a5

This was done with was ran with the version before /var/lib/rancher/rke2 was removed.

But seems like a good idea to do error checking in case things like this happened that you don't expect. You never know.

brandond · 2024-08-17T00:13:14Z

I don't think any changes are necessary. Please test on the release that no longer cleans up mounts under /var/lib/rancher/rke2.

maxlillo · 2024-08-17T22:38:29Z

Why do you think adding error checking is unnecessary? Is there some concern or do you have some coding standard or style guide you are adhering too?

The issue was not really rm -rf --one-file-system ${MOUNTS}. In your code you execute

MOUNTS=
while read ignore mount ignore; do
        MOUNTS="${mount}\n${MOUNTS}"
done </proc/self/mounts
MOUNTS=$(printf ${MOUNTS} | grep "^$1" | sort -r)

The last command results in MOUNTS being a collection of directories. I believe the purpose of that command is to make sure you remove the mounts in order.

echo $MOUNTS
/var/lib/rancher/rke2/server/db
/var/lib/rancher/rke2

Therefore when rm is executed its running: rm -rf --one-file-system /var/lib/rancher/rke2/server/db which would remove the content of the db directory since the prior unmount command failed.

We found this issue because a customer ran it and it deleted there etcd db on the node. They were shutting down all nodes but luckily this only happened on the one node. If the script had error checking to begin with this would have never happened.

In the current iteration, no commands catch my eye as unsafe, but I am always surprised by what I miss. And perhaps in the future something bad will get introduced.

I checked some online style guides online and this seems to be a best practice:

https://google.github.io/styleguide/shellguide.html#calling-commands
https://tldp.org/LDP/abs/html/external.html - Could not find it explicitly but in the examples they all have error checking

maxlillo · 2024-08-28T23:52:44Z

@brandond there are still problems with the new script. It was update to not remove "${RKE2_DATA_DIR}"

But we use topolvm and it mounts different resources under /var/lib/kubelet/pods. (Not sure if other csi plugins do this)

In the same way we saw the etcd accidently get deleted we see the content of some of our stateful set PVCs get deleted.

rm: cannot remove '/var/lib/kubelet/pods/f891db50-3a58-4ba2-a1e7-cb7c90a42741/volumes/kubernetes.iolocal-volume/insights-lookerdir-pv-autosuitea': Device or resource busy
rm: cannot remove '/var/lib/kubelet/pods/f891db50-3a58-4ba2-a1e7-cb7c90a42741/volumes/kubernetes.iolocal-volume/insights-looker-datadir-pv-autosuitea': Device or resource busy

The content of these PVs was fully deleted.

Simply adding an error check to the whole operation would fix this. I would rather not see this in the wild like we saw with etcd getting deleted.

brandond · 2024-08-29T16:34:41Z

I'll reopen this for follow-up.

If you can provide any steps for our QA team to reproduce the issue that would be appreciated. What are you using that fails to unmount, but does allow files to be removed?

maxlillo · 2024-09-02T18:28:32Z

@brandond I am not sure what originally caused the issue with etcd. My guess is that they have some sort of security scanning going on.

I think you could simulate this by making a host mount in a container, navigating to that directory under /var/lib/pod and then running rke2-killall.sh. Because the working directory of the shell is one of the folders being unmounted, the unmount command fails. If that does not work then you would need to isntall a CSI like topolvm.

We have not seen one of the PVs get wiped in a customer environment. But as part of our RCA we ask the question of "Can this happen again? What could have prevented this?" etc.

Since error checking was not added to the script, it was out conclusion that yes this could happen again and we probably cannot predict all the edge cases where it could happen. But loosing data in this manner is unacceptable and preventable.

Just adding set -e at the start would prevent this and as far as I am aware adding an error check is common best practice.

github-actions · 2024-10-17T20:17:45Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

brandond closed this as completed Aug 15, 2024

brandond reopened this Aug 29, 2024

github-actions bot added the status/stale label Oct 17, 2024

brandond mentioned this issue Oct 24, 2024

Only clean up pod filesystems if umount succeeds #7134

Open

brandond added area/install-script Issues that seem to be related to install.sh or the other shell scripts and removed status/stale labels Oct 24, 2024

brandond added this to the 2024-12 Release Cycle milestone Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rke2 start fails after rke2-killall.sh execution #6571

rke2 start fails after rke2-killall.sh execution #6571

mynktl commented Aug 15, 2024 •

edited

Loading

rajivml commented Aug 15, 2024

brandond commented Aug 15, 2024

mynktl commented Aug 16, 2024 •

edited

Loading

brandond commented Aug 16, 2024 •

edited

Loading

maxlillo commented Aug 16, 2024

brandond commented Aug 17, 2024

maxlillo commented Aug 17, 2024 •

edited

Loading

maxlillo commented Aug 28, 2024 •

edited

Loading

brandond commented Aug 29, 2024

maxlillo commented Sep 2, 2024

github-actions bot commented Oct 17, 2024

rke2 start fails after rke2-killall.sh execution #6571

rke2 start fails after rke2-killall.sh execution #6571

Comments

mynktl commented Aug 15, 2024 • edited Loading

rajivml commented Aug 15, 2024

brandond commented Aug 15, 2024

mynktl commented Aug 16, 2024 • edited Loading

brandond commented Aug 16, 2024 • edited Loading

maxlillo commented Aug 16, 2024

brandond commented Aug 17, 2024

maxlillo commented Aug 17, 2024 • edited Loading

maxlillo commented Aug 28, 2024 • edited Loading

brandond commented Aug 29, 2024

maxlillo commented Sep 2, 2024

github-actions bot commented Oct 17, 2024

mynktl commented Aug 15, 2024 •

edited

Loading

mynktl commented Aug 16, 2024 •

edited

Loading

brandond commented Aug 16, 2024 •

edited

Loading

maxlillo commented Aug 17, 2024 •

edited

Loading

maxlillo commented Aug 28, 2024 •

edited

Loading