Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rke2 start fails after rke2-killall.sh execution #6571

Open
mynktl opened this issue Aug 15, 2024 · 11 comments
Open

rke2 start fails after rke2-killall.sh execution #6571

mynktl opened this issue Aug 15, 2024 · 11 comments
Labels
area/install-script Issues that seem to be related to install.sh or the other shell scripts

Comments

@mynktl
Copy link

mynktl commented Aug 15, 2024

Environmental Info:
RKE2 Version:

rke2 version v1.30.1+rke2r1 (e7f87c6)
go version go1.22.2 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux server0 4.18.0-372.41.1.el8_6.x86_64 #1 SMP Thu Jan 5 13:56:06 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server

Describe the bug:

For RKE2, We are using two mount points, one for etcd database and other for rke2.
structure of /var/lib/rancher directory is as below:

[root@server0 rancher]# tree /var/lib/rancher/ -L 3
/var/lib/rancher/
`-- rke2
    |-- agent
    |   |-- client-ca.crt
    |   |-- client-kubelet.crt
    |   |-- client-kubelet.key
    |   |-- client-kube-proxy.crt
    |   |-- client-kube-proxy.key
    |   |-- client-rke2-controller.crt
    |   |-- client-rke2-controller.key
    |   |-- containerd
    |   |-- etc
    |   |-- images
    |   |-- kubelet.kubeconfig
    |   |-- kubeproxy.kubeconfig
    |   |-- logs
    |   |-- pod-manifests
    |   |-- rke2controller.kubeconfig
    |   |-- server-ca.crt
    |   |-- serving-kubelet.crt
    |   `-- serving-kubelet.key
    |-- bin -> /var/lib/rancher/rke2/data/v1.30.1-rke2r1-c42b85364830/bin
    |-- data
    |   `-- v1.30.1-rke2r1-c42b85364830
    `-- server
        |-- agent-token -> /var/lib/rancher/rke2/server/token
        |-- cred
        |-- db
        |-- etc
        |-- manifests
        |-- node-token -> /var/lib/rancher/rke2/server/token
        |-- tls
        `-- token

16 directories, 16 files

and mount configuration is as below:

[root@server0 rancher]# lsblk
NAME                          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                             8:0    0  128G  0 disk
|-sda1                          8:1    0  500M  0 part /boot
|-sda2                          8:2    0   63G  0 part
| |-rootvg-tmplv              253:11   0    2G  0 lvm  /tmp
| |-rootvg-usrlv              253:12   0   10G  0 lvm  /usr
| |-rootvg-homelv             253:13   0    1G  0 lvm  /home
| |-rootvg-varlv              253:14   0    8G  0 lvm  /var
| `-rootvg-rootlv             253:15   0   22G  0 lvm  /
|-sda14                         8:14   0    4M  0 part
`-sda15                         8:15   0  495M  0 part /boot/efi
sdb                             8:16   0   16G  0 disk
`-uipathetcdvg-etcdlv         253:10   0   16G  0 lvm  /var/lib/rancher/rke2/server/db
sdd                             8:48   0  256G  0 disk
|-uipathvg-rancherlv          253:1    0  185G  0 lvm  /var/lib/rancher
|-uipathvg-kubeletlv          253:2    0   56G  0 lvm  /var/lib/kubelet
sdf                             8:80   0   32G  0 disk

To mount /var/lib/rancher/rke2/server/db automatically, We have added a dependency of this db mount to rke2-server.service.

[root@server0 rancher]# cat /etc/systemd/system/rke2-server.service.d/custom.conf
[Unit]
After=var-lib-rancher-rke2-server-db.mount var-lib-rancher.mount var-lib-kubelet.mount datadisk-insights.mount datadisk-monitoring.mount datadisk-objectstore.mount datadisk-registry.mount
Requires=var-lib-rancher-rke2-server-db.mount var-lib-rancher.mount var-lib-kubelet.mount datadisk-insights.mount datadisk-monitoring.mount datadisk-objectstore.mount datadisk-registry.mount

So whenever systemctl start rke2-server is executed it will perform the db mount and start rke2 server.


Issue:
When we execute rke2-killall.sh script, it unmount /var/lib/rancher/rke2/server/db and delete this directory. As We don't have mount point specific to /var/lib/rancher/rke2, do_unmount_and_remove performs the action on /var/lib/rancher/rke2/server/db because of the condition grep "^$1" in do_unmount_and_remove function.

Post this execution, rke2 start via systemctl command fails, with below error

-- Unit var-lib-rancher-rke2-server-db.mount has begun starting up.
Aug 15 02:48:25 server0 mount[1527607]: mount: /var/lib/rancher/rke2/server/db: mount point does not exist.
Aug 15 02:48:25 server0 systemd[1]: var-lib-rancher-rke2-server-db.mount: Mount process exited, code=exited status=32
Aug 15 02:48:25 server0 systemd[1]: var-lib-rancher-rke2-server-db.mount: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit var-lib-rancher-rke2-server-db.mount has entered the 'failed' state with result 'exit-code'.
Aug 15 02:48:25 server0 systemd[1]: Failed to mount /var/lib/rancher/rke2/server/db.
-- Subject: Unit var-lib-rancher-rke2-server-db.mount has failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit var-lib-rancher-rke2-server-db.mount has failed.
--
-- The result is failed.

This behaviour is with selinux enabled.

[root@server0 rancher]# getenforce
Enforcing

As we have separate mount point for etcd db, We have one more risk of etcd data getting deleted. In rke2-killall.sh script,
umount will always go through as this is local mount point, but missing set -e. If amount fails then rm -rf --one-file-system ${MOUNTS} will delete the content of the directory, which is unexpected.

Steps To Reproduce:

  • Installed RKE2:

Expected behavior:

Actual behavior:

Additional context / logs:

@rajivml
Copy link

rajivml commented Aug 15, 2024

cc @brandond

@brandond
Copy link
Member

This sounds like a duplicate of #6557

@mynktl
Copy link
Author

mynktl commented Aug 16, 2024

@brandond As we have removed the do_unmount_and_remove for rke2 directory in https://github.com/vitorsavian/rke2/blob/587fb7f22469a4827ea9040b36bfb23d14f9d0c5/bundle/bin/rke2-killall.sh, it will fix the problem for rke2 related directories. But do you see any need to have error handling or set -e in do_unmount_and_remove?

We are unmounting the directory and then removing that directory to clear the mount point. But if unmount fails then it will remove the content of directory which may result into data loss.

@brandond
Copy link
Member

brandond commented Aug 16, 2024

if unmount fails then it will remove the content of directory which may result into data loss.

That shouldn't be the case, we use rm -rf --one-file-system ${MOUNTS} which would not traverse across the filesystem boundary into the path that failed to unmount.

If you believe you've having problems with this not working as intended, please provide steps to reproduce.

@maxlillo
Copy link

  1. cd /var/lib/rancher/rke2/server/db
  2. run bash -x rke2-killall.sh

Relevant output below:

  • do_unmount_and_remove /var/lib/rancher/rke2
  • umount /var/lib/rancher/rke2/server/db
    umount: /var/lib/rancher/rke2/server/db: target is busy.
  • rm -rf --one-file-system /var/lib/rancher/rke2/server/db
    rm: cannot remove '/var/lib/rancher/rke2/server/db': Device or resource busy
  • do_unmount_and_remove /var/lib/kubelet/pods
  • umount /var/lib/kubelet/pods/fd004727-387a-4336-a5

This was done with was ran with the version before /var/lib/rancher/rke2 was removed.

But seems like a good idea to do error checking in case things like this happened that you don't expect. You never know.

@brandond
Copy link
Member

I don't think any changes are necessary. Please test on the release that no longer cleans up mounts under /var/lib/rancher/rke2.

@maxlillo
Copy link

maxlillo commented Aug 17, 2024

Why do you think adding error checking is unnecessary? Is there some concern or do you have some coding standard or style guide you are adhering too?

The issue was not really rm -rf --one-file-system ${MOUNTS}. In your code you execute

MOUNTS=
while read ignore mount ignore; do
        MOUNTS="${mount}\n${MOUNTS}"
done </proc/self/mounts
MOUNTS=$(printf ${MOUNTS} | grep "^$1" | sort -r)

The last command results in MOUNTS being a collection of directories. I believe the purpose of that command is to make sure you remove the mounts in order.

echo $MOUNTS
/var/lib/rancher/rke2/server/db
/var/lib/rancher/rke2

Therefore when rm is executed its running: rm -rf --one-file-system /var/lib/rancher/rke2/server/db which would remove the content of the db directory since the prior unmount command failed.

We found this issue because a customer ran it and it deleted there etcd db on the node. They were shutting down all nodes but luckily this only happened on the one node. If the script had error checking to begin with this would have never happened.

In the current iteration, no commands catch my eye as unsafe, but I am always surprised by what I miss. And perhaps in the future something bad will get introduced.

I checked some online style guides online and this seems to be a best practice:

  1. https://google.github.io/styleguide/shellguide.html#calling-commands
  2. https://tldp.org/LDP/abs/html/external.html - Could not find it explicitly but in the examples they all have error checking

@maxlillo
Copy link

maxlillo commented Aug 28, 2024

@brandond there are still problems with the new script. It was update to not remove "${RKE2_DATA_DIR}"

But we use topolvm and it mounts different resources under /var/lib/kubelet/pods. (Not sure if other csi plugins do this)

In the same way we saw the etcd accidently get deleted we see the content of some of our stateful set PVCs get deleted.

rm: cannot remove '/var/lib/kubelet/pods/f891db50-3a58-4ba2-a1e7-cb7c90a42741/volumes/kubernetes.iolocal-volume/insights-lookerdir-pv-autosuitea': Device or resource busy
rm: cannot remove '/var/lib/kubelet/pods/f891db50-3a58-4ba2-a1e7-cb7c90a42741/volumes/kubernetes.io
local-volume/insights-looker-datadir-pv-autosuitea': Device or resource busy

The content of these PVs was fully deleted.

Simply adding an error check to the whole operation would fix this. I would rather not see this in the wild like we saw with etcd getting deleted.

@brandond
Copy link
Member

I'll reopen this for follow-up.

If you can provide any steps for our QA team to reproduce the issue that would be appreciated. What are you using that fails to unmount, but does allow files to be removed?

@brandond brandond reopened this Aug 29, 2024
@maxlillo
Copy link

maxlillo commented Sep 2, 2024

@brandond I am not sure what originally caused the issue with etcd. My guess is that they have some sort of security scanning going on.

I think you could simulate this by making a host mount in a container, navigating to that directory under /var/lib/pod and then running rke2-killall.sh. Because the working directory of the shell is one of the folders being unmounted, the unmount command fails. If that does not work then you would need to isntall a CSI like topolvm.

We have not seen one of the PVs get wiped in a customer environment. But as part of our RCA we ask the question of "Can this happen again? What could have prevented this?" etc.

Since error checking was not added to the script, it was out conclusion that yes this could happen again and we probably cannot predict all the edge cases where it could happen. But loosing data in this manner is unacceptable and preventable.

Just adding set -e at the start would prevent this and as far as I am aware adding an error check is common best practice.

Copy link
Contributor

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@brandond brandond added area/install-script Issues that seem to be related to install.sh or the other shell scripts and removed status/stale labels Oct 24, 2024
@brandond brandond added this to the 2024-12 Release Cycle milestone Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install-script Issues that seem to be related to install.sh or the other shell scripts
Projects
None yet
Development

No branches or pull requests

4 participants