Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rootfs: make pivot_root(2) dance handle initramfs case #4434

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .cirrus.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,8 @@ task:
HOME: /root
CIRRUS_WORKING_DIR: /home/runc
GO_VERSION: "1.23"
BATS_VERSION: "v1.9.0"
RPMS: gcc git iptables jq glibc-static libseccomp-devel make criu fuse-sshfs container-selinux
BATS_VERSION: "v1.11.0"
RPMS: cpio gcc git iptables jq qemu-kvm glibc-static libseccomp-devel make criu fuse-sshfs container-selinux
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the QEMU test should be executed only on a single GHA Ubuntu job

Copy link
Member Author

@cyphar cyphar Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, but the test itself runs quite quickly (1.6s on my machine) so I don't think it's worth making the test runner logic more complicated.

# yamllint disable rule:key-duplicates
matrix:
DISTRO: almalinux-8
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ jobs:
- name: install deps
run: |
sudo apt update
sudo apt -y install libseccomp-dev sshfs uidmap
sudo apt -y install cpio libseccomp-dev qemu-kvm sshfs uidmap

- name: install CRIU
if: ${{ matrix.criu == '' }}
Expand Down Expand Up @@ -140,7 +140,7 @@ jobs:
- name: Setup Bats and bats libs
uses: bats-core/[email protected]
with:
bats-version: 1.9.0
bats-version: 1.11.0
support-install: false
assert-install: false
detik-install: false
Expand Down
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
ARG GO_VERSION=1.23
ARG BATS_VERSION=v1.9.0
ARG BATS_VERSION=v1.11.0
ARG LIBSECCOMP_VERSION=2.5.5

FROM golang:${GO_VERSION}-bookworm
Expand All @@ -16,6 +16,7 @@ RUN KEYFILE=/usr/share/keyrings/criu-repo-keyring.gpg; \
criu \
gcc \
gcc-multilib \
cpio \
curl \
gawk \
gperf \
Expand All @@ -24,6 +25,7 @@ RUN KEYFILE=/usr/share/keyrings/criu-repo-keyring.gpg; \
kmod \
pkg-config \
python3-minimal \
qemu-kvm \
sshfs \
sudo \
uidmap \
Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ localunittest: all
integration: runcimage
$(CONTAINER_ENGINE) run $(CONTAINER_ENGINE_RUN_FLAGS) \
-t --privileged --rm \
-v /boot:/boot:ro \
-v /lib/modules:/lib/modules:ro \
-v $(CURDIR):/go/src/$(PROJECT) \
$(RUNC_IMAGE) make localintegration TESTPATH="$(TESTPATH)"
Expand Down
2 changes: 1 addition & 1 deletion Vagrantfile.fedora
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Vagrant.configure("2") do |config|
cat << EOF | dnf -y --exclude=kernel,kernel-core shell && break
config install_weak_deps false
update
install iptables gcc golang-go make glibc-static libseccomp-devel bats jq git-core criu fuse-sshfs container-selinux
install cpio iptables gcc golang-go make qemu-kvm glibc-static libseccomp-devel bats jq git-core criu fuse-sshfs container-selinux
ts run
EOF
done
Expand Down
5 changes: 3 additions & 2 deletions create.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,9 @@ command(s) that get executed on start, edit the args parameter of the spec. See
Usage: "specify the file to write the process id to",
},
cli.BoolFlag{
Name: "no-pivot",
Usage: "do not use pivot root to jail process inside rootfs. This should be used whenever the rootfs is on top of a ramdisk",
Name: "no-pivot",
Usage: "(deprecated; do not use)",
Hidden: true,
},
cli.BoolFlag{
Name: "no-new-keyring",
Expand Down
74 changes: 61 additions & 13 deletions libcontainer/rootfs_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -202,10 +202,19 @@ func prepareRootfs(pipe *syncSocket, iConfig *initConfig) (err error) {
return err
}

if config.NoPivotRoot {
err = msMoveRoot(config.Rootfs)
} else if config.Namespaces.Contains(configs.NEWNS) {
if config.Namespaces.Contains(configs.NEWNS) {
err = pivotRoot(config.Rootfs)
if config.NoPivotRoot {
logrus.Warnf("--no-pivot is deprecated and may be removed or silently ignored in a future version of runc -- see <https://github.com/opencontainers/runc/issues/4435> for more details")
cyphar marked this conversation as resolved.
Show resolved Hide resolved
if err != nil {
// Always try to do pivot_root(2) because it's safe, and only fallback
// to the unsafe MS_MOVE+chroot(2) dance if pivot_root(2) fails.
logrus.Warnf("your container failed to start with pivot_root(2) (%v) -- please open a bug report to let us know about your usecase", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End-users may not understand who are "us", as they don't execute runc directly.

We may link https://github.com/opencontainers/runc/issues for explicitness, but we may potentially get a report about some third-party product that we have never heard of.

Copy link
Member Author

@cyphar cyphar Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already provide a link to the tracking issue for this deprecation if you pass --no-pivot (regardless of whether pivot_root(2) works), I don't think we need to provide the same link twice?

err = msMoveRoot(config.Rootfs)
} else {
logrus.Warnf("despite setting --no-pivot, this container successfully started using pivot_root(2) -- consider removing the --no-pivot flag")
}
}
} else {
err = chroot()
}
Expand Down Expand Up @@ -1068,19 +1077,58 @@ func pivotRoot(rootfs string) error {
}
defer unix.Close(oldroot) //nolint: errcheck

newroot, err := unix.Open(rootfs, unix.O_DIRECTORY|unix.O_RDONLY, 0)
if err != nil {
return &os.PathError{Op: "open", Path: rootfs, Err: err}
}
defer unix.Close(newroot) //nolint: errcheck

// Change to the new root so that the pivot_root actually acts on it.
if err := unix.Fchdir(newroot); err != nil {
return &os.PathError{Op: "fchdir", Path: "fd " + strconv.Itoa(newroot), Err: err}
if err := os.Chdir(rootfs); err != nil {
return err
cyphar marked this conversation as resolved.
Show resolved Hide resolved
}

if err := unix.PivotRoot(".", "."); err != nil {
return &os.PathError{Op: "pivot_root", Path: ".", Err: err}
pivotErr := unix.PivotRoot(".", ".")
if errors.Is(pivotErr, unix.EINVAL) {
// If pivot_root(2) failed with -EINVAL, one of the possible reasons is
Copy link
Member

@lifubang lifubang Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are six reasons will cause -EINVAL:

       EINVAL new_root is not a mount point.

       EINVAL put_old is not at or underneath new_root.

       EINVAL The current root directory is not a mount point (because
              of an earlier [chroot(2)](https://man7.org/linux/man-pages/man2/chroot.2.html)).

       EINVAL The current root is on the rootfs (initial ramfs) mount;
              see NOTES.

       EINVAL Either the mount point at new_root, or the parent mount of
              that mount point, has propagation type MS_SHARED.

       EINVAL put_old is a mount point and has the propagation type
              MS_SHARED.

This PR tries to deal with the forth one, does this pr would cause some unexpected behaviors for the fifth one? (Maybe need to check once have a time)
The user case:
For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed as MS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.
Before this patch, it will return an error immediately, but after this patch, it will do some magic operation to let pivot_root(2) work, I think we should be careful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user case:
For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed as MS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.

For this case, this patch has the same security issue like --no-pivot, even this flag is not provided. I can escape from the container.
So, I think we should split this PR into two:

  1. Deprecate the flag no-pivot, this should be in milestone 1.2.1;
  2. Find a new way to handle initramfs case, this one can be moved to the next version if we can't fix it in a short time.

Copy link
Member Author

@cyphar cyphar Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, this patch has the same security issue like --no-pivot, even this flag is not provided. I can escape from the container.

How? After pivot_root(2) the whole host mount tree is gone, no? chroot_fs_refs should move the current->fs->root to the pivot_root(2)'d path so the intermediate chroot(2) should not have any effect on the final mount setup AFAICS. Can you share a reproducer?

For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed as MS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.

We also set the parent to MS_PRVIATE in rootfsParentMountPrivate. (It already should not be the case that we get -EINVAL for that reason because we explicitly make sure that it works with rootfsParentMountPrivate.) Is that not sufficient?

The bind-mount of / keeps the propagation settings the same as well. So if the propagation was actually wrong, the second pivot_root(2) would still fail. I also don't see how you could use that to break out of the container -- AFAIK the reason the kernel requires the mount to be !shared is so that the mount tree changes during pivot_root(2) don't propagate to the host (and break it).

Find a new way to handle initramfs case, this one can be moved to the next version if we can't fix it in a short time.

The "magic" done is just creating a bind-mount of the root that can be used for pivot_root(2) (in an MS_PRIVATE mount so the mount itself will be MS_PRIVATE, to answer your question).

This is the only way of doing it (I spoke to the VFS maintainers a few weeks ago, and this is the only "right" way of doing it -- FWIW Lennart implied that systemd does this but they don't it seems).

It's a bit hard to understand what the problem is without an actual reproducer.

Deprecate the flag no-pivot, this should be in milestone 1.2.1;

We can't deprecate it without having a solution for minikube and kata...

Copy link
Member

@lifubang lifubang Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't deprecate it without having a solution for minikube and kata...

Mark it deprecate not means we remove it right now, it's only to remind users that they should not use it except there is no way to run their case, and to inform users that we can't guarantee the container's security with --no-privot.
But yes, it's a little strange do it in the situation that we still have no solution for minikube and kata or other projects. I think we can find a solution to fix it. I suggest to split into two because maybe we need more time to find out this solution, but I think we should release v1.2.1 ASAP, then we can let runc v1.2.* could be tested by more and more users, especially in some downstream rc versions.

We also set the parent to MS_PRVIATE in rootfsParentMountPrivate. (It already should not be the case that we get -EINVAL for that reason because we explicitly make sure that it works with rootfsParentMountPrivate.) Is that not sufficient?

No. I have concern about it for a long time after #4417 merged, the commit is: 13a6f56. This commit has changed the behavior of rootfsParentMountPrivate:
1. Before this commit, we make / to MS_PRIVATE mount;
2. After this commit, if the rootfs is a overlay mount, we will only make rootfs to MS_PRIVATE mount.
@kolyshkin I don't know this behavior change will cause some issues, but we will not set parent to MS_PRIVATE is a fact, but maybe make rootfs MS_PRIVATE is enough.

EDIT: Sorry, I'm wrong, the parent mount point has always been changed to MS_PRIVATE by rootfsParentMountPrivate; So this bug is not related to this, I have sent the email to @cyphar, if you don't receive it, please let me know.

It's a bit hard to understand what the problem is without an actual reproducer.

I'll send you a email ASAP.

// that we are in early boot and trying pivot_root on top of the
// initramfs (which isn't allowed because initramfs/rootfs doesn't have
// a parent mount).
//
// Traditionally, users were told to pass --no-pivot (which used chroot
// instead) but this is very insecure (even with the hardenings we've
// put into our chroot() wrapper).
//
// A much better solution is to create a bind-mount clone of / (which
// would have a parent) and then chroot into that clone so that we are
// properly rooted within a mount that has a parent mount. Then we can
// retry the pivot_root().

// Clone / on top of . to create a version of / that has a parent and
// so can be pivot-rooted.
if err := unix.Mount("/", ".", "", unix.MS_BIND|unix.MS_REC, ""); err != nil {
err := &os.PathError{Op: "make clone of / mount", Path: rootfs, Err: err}
return fmt.Errorf("error during fallback for failed pivot_root (%w): %w", pivotErr, err)
}
// Switch to the cloned mount. We have to use the full path here
// because we need to get the kernel to move us into the new mount
// (chdir(".") will keep us in the old non-cloned / mount).
if err := os.Chdir(rootfs); err != nil {
return fmt.Errorf("error during fallback for failed pivot_root (%w): switch to cloned mount: %w", pivotErr, err)
}
// Move the cloned mount to /.
if err := unix.Mount(".", "/", "", unix.MS_MOVE, ""); err != nil {
err := &os.PathError{Op: "move / clone mount to /", Path: rootfs, Err: err}
return fmt.Errorf("error during fallback for failed pivot_root (%w): %w", pivotErr, err)
}
// Update current->fs->root to be the cloned / (to be pivot_root'd).
if err := unix.Chroot("."); err != nil {
err := &os.PathError{Op: "chroot into cloned /", Path: rootfs, Err: err}
return fmt.Errorf("error during fallback for failed pivot_root (%w): %w", pivotErr, err)
}

// Go back to the container rootfs and retry pivot_root.
if err := os.Chdir(rootfs); err != nil {
return fmt.Errorf("error during fallback for failed pivot_root (%w): %w", pivotErr, err)
}
pivotErr = unix.PivotRoot(".", ".")
}
if pivotErr != nil {
return &os.PathError{Op: "pivot_root", Path: rootfs, Err: pivotErr}
}

// Currently our "." is oldroot (according to the current kernel code).
Expand Down
5 changes: 3 additions & 2 deletions run.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,9 @@ command(s) that get executed on start, edit the args parameter of the spec. See
Usage: "disable the use of the subreaper used to reap reparented processes",
},
cli.BoolFlag{
Name: "no-pivot",
Usage: "do not use pivot root to jail process inside rootfs. This should be used whenever the rootfs is on top of a ramdisk",
Name: "no-pivot",
Usage: "(deprecated; do not use)",
Hidden: true,
},
cli.BoolFlag{
Name: "no-new-keyring",
Expand Down
44 changes: 40 additions & 4 deletions tests/integration/helpers.bash
Original file line number Diff line number Diff line change
Expand Up @@ -39,18 +39,54 @@ ARCH=$(uname -m)
# Seccomp agent socket.
SECCCOMP_AGENT_SOCKET="$BATS_TMPDIR/seccomp-agent.sock"

# Wrapper for runc.
function runc() {
run __runc "$@"
function sane_run() { # --pipe --timeout=<timeout>
local getopt
getopt="$(getopt -o + --long pipe,timeout: -- "$@")"
eval set -- "$getopt"

pipe=
timeout=
while true; do
case "$1" in
--pipe)
pipe=1
shift
;;
--timeout)
timeout="$2"
shift 2
;;
--)
shift
break
;;
*)
fail "unknown argument $1 ${2:-} to sane_run"
;;
esac
done
cmd_prefix=()
[ -n "$pipe" ] && cmd_prefix+=("bats_pipe")
[ -n "$timeout" ] && cmd_prefix+=("timeout" "--foreground" "--signal=KILL" "$timeout")

local cmd="$1"
shift

run "${cmd_prefix[@]}" "$cmd" "$@"

# Some debug information to make life easier. bats will only print it if the
# test failed, in which case the output is useful.
# shellcheck disable=SC2154
echo "$(basename "$RUNC") $* (status=$status):" >&2
echo "$cmd $* (status=$status):" >&2
# shellcheck disable=SC2154
echo "$output" >&2
}

# Wrapper for runc.
function runc() {
sane_run __runc "$@"
}

# Raw wrapper for runc.
function __runc() {
"$RUNC" ${RUNC_USE_SYSTEMD+--systemd-cgroup} \
Expand Down
Loading
Loading