From 7ae25ff532da1a2438bcb11b8e505b41be2a0ee7 Mon Sep 17 00:00:00 2001 From: David Karlsson <35727626+dvdksn@users.noreply.github.com> Date: Tue, 17 Dec 2024 10:59:13 +0100 Subject: [PATCH 1/2] vale: add NUMA, BSD to acronym exceptions Signed-off-by: David Karlsson <35727626+dvdksn@users.noreply.github.com> --- _vale/Docker/Acronyms.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/_vale/Docker/Acronyms.yml b/_vale/Docker/Acronyms.yml index 8bad91e72b8..512319cdf37 100644 --- a/_vale/Docker/Acronyms.yml +++ b/_vale/Docker/Acronyms.yml @@ -17,6 +17,7 @@ exceptions: - AWS - BIOS - BPF + - BSD - CI - CISA - CLI @@ -73,6 +74,7 @@ exceptions: - NFS - NOTE - NTLM + - NUMA - NVDA - OCI - OS From 0a4d8cf004680d2f77b81edf4b0f8adaad6abbe7 Mon Sep 17 00:00:00 2001 From: David Karlsson <35727626+dvdksn@users.noreply.github.com> Date: Tue, 17 Dec 2024 10:59:46 +0100 Subject: [PATCH 2/2] engine: refresh seccomp page Signed-off-by: David Karlsson <35727626+dvdksn@users.noreply.github.com> --- _vale/config/vocabularies/Docker/accept.txt | 14 +-- content/manuals/engine/security/seccomp.md | 112 ++++++++++---------- 2 files changed, 64 insertions(+), 62 deletions(-) diff --git a/_vale/config/vocabularies/Docker/accept.txt b/_vale/config/vocabularies/Docker/accept.txt index b5ec8d27138..205fcac3ab8 100644 --- a/_vale/config/vocabularies/Docker/accept.txt +++ b/_vale/config/vocabularies/Docker/accept.txt @@ -20,8 +20,8 @@ Couchbase Datadog Ddosify Debootstrap -Dev Environments? Dev +Dev Environments? Django Docker Build Cloud Docker Business @@ -73,8 +73,8 @@ Nuxeo OAuth OTel Okta -Paketo PKG +Paketo Postgres PowerShell Python @@ -98,8 +98,9 @@ WireMock Zscaler Zsh [Aa]utobuild -[Bb]uildx +[Aa]llowlist [Bb]uildpack(s)? +[Bb]uildx [Cc]odenames? [Cc]ompose [Dd]istroless @@ -134,6 +135,10 @@ Zsh [Ss]ysfs [Tt]oolchains? [Uu]narchived? +[Uu]ngated +[Uu]ntrusted +[Uu]serland +[Uu]serspace [Vv]irtiofs [Vv]irtualize [Ww]alkthrough @@ -178,8 +183,5 @@ systemd tmpfs ufw umask -ungated -userland -untrusted vSphere vpnkit diff --git a/content/manuals/engine/security/seccomp.md b/content/manuals/engine/security/seccomp.md index 1ea65a0b9d0..094bdbffe0a 100644 --- a/content/manuals/engine/security/seccomp.md +++ b/content/manuals/engine/security/seccomp.md @@ -26,13 +26,13 @@ protective while providing wide application compatibility. The default Docker profile can be found [here](https://github.com/moby/moby/blob/master/profiles/seccomp/default.json). -In effect, the profile is an allowlist which denies access to system calls by -default, then allowlists specific system calls. The profile works by defining a +In effect, the profile is an allowlist that denies access to system calls by +default and then allows specific system calls. The profile works by defining a `defaultAction` of `SCMP_ACT_ERRNO` and overriding that action only for specific system calls. The effect of `SCMP_ACT_ERRNO` is to cause a `Permission Denied` error. Next, the profile defines a specific list of system calls which are fully allowed, because their `action` is overridden to be `SCMP_ACT_ALLOW`. Finally, -some specific rules are for individual system calls such as `personality`, and others, +some specific rules are for individual system calls such as `personality`, and others, to allow variants of those system calls with specific arguments. `seccomp` is instrumental for running Docker containers with least privilege. It @@ -53,61 +53,61 @@ $ docker run --rm \ Docker's default seccomp profile is an allowlist which specifies the calls that are allowed. The table below lists the significant (but not all) syscalls that -are effectively blocked because they are not on the Allowlist. The table includes +are effectively blocked because they are not on the allowlist. The table includes the reason each syscall is blocked rather than white-listed. -| Syscall | Description | -|---------------------|---------------------------------------------------------------------------------------------------------------------------------------| -| `acct` | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_PACCT`. | -| `add_key` | Prevent containers from using the kernel keyring, which is not namespaced. | -| `bpf` | Deny loading potentially persistent bpf programs into kernel, already gated by `CAP_SYS_ADMIN`. | -| `clock_adjtime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | -| `clock_settime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | -| `clone` | Deny cloning new namespaces. Also gated by `CAP_SYS_ADMIN` for CLONE_* flags, except `CLONE_NEWUSER`. | -| `create_module` | Deny manipulation and functions on kernel modules. Obsolete. Also gated by `CAP_SYS_MODULE`. | -| `delete_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. | -| `finit_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. | -| `get_kernel_syms` | Deny retrieval of exported kernel and module symbols. Obsolete. | -| `get_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. | -| `init_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. | -| `ioperm` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. | -| `iopl` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. | -| `kcmp` | Restrict process inspection capabilities, already blocked by dropping `CAP_SYS_PTRACE`. | -| `kexec_file_load` | Sister syscall of `kexec_load` that does the same thing, slightly different arguments. Also gated by `CAP_SYS_BOOT`. | -| `kexec_load` | Deny loading a new kernel for later execution. Also gated by `CAP_SYS_BOOT`. | -| `keyctl` | Prevent containers from using the kernel keyring, which is not namespaced. | -| `lookup_dcookie` | Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by `CAP_SYS_ADMIN`. | -| `mbind` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. | -| `mount` | Deny mounting, already gated by `CAP_SYS_ADMIN`. | -| `move_pages` | Syscall that modifies kernel memory and NUMA settings. | -| `nfsservctl` | Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1. | -| `open_by_handle_at` | Cause of an old container breakout. Also gated by `CAP_DAC_READ_SEARCH`. | -| `perf_event_open` | Tracing/profiling syscall, which could leak a lot of information on the host. | -| `personality` | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. | -| `pivot_root` | Deny `pivot_root`, should be privileged operation. | -| `process_vm_readv` | Restrict process inspection capabilities, already blocked by dropping `CAP_SYS_PTRACE`. | -| `process_vm_writev` | Restrict process inspection capabilities, already blocked by dropping `CAP_SYS_PTRACE`. | +| Syscall | Description | +| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `acct` | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_PACCT`. | +| `add_key` | Prevent containers from using the kernel keyring, which is not namespaced. | +| `bpf` | Deny loading potentially persistent BPF programs into kernel, already gated by `CAP_SYS_ADMIN`. | +| `clock_adjtime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | +| `clock_settime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | +| `clone` | Deny cloning new namespaces. Also gated by `CAP_SYS_ADMIN` for CLONE\_\* flags, except `CLONE_NEWUSER`. | +| `create_module` | Deny manipulation and functions on kernel modules. Obsolete. Also gated by `CAP_SYS_MODULE`. | +| `delete_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. | +| `finit_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. | +| `get_kernel_syms` | Deny retrieval of exported kernel and module symbols. Obsolete. | +| `get_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. | +| `init_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. | +| `ioperm` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. | +| `iopl` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. | +| `kcmp` | Restrict process inspection capabilities, already blocked by dropping `CAP_SYS_PTRACE`. | +| `kexec_file_load` | Sister syscall of `kexec_load` that does the same thing, slightly different arguments. Also gated by `CAP_SYS_BOOT`. | +| `kexec_load` | Deny loading a new kernel for later execution. Also gated by `CAP_SYS_BOOT`. | +| `keyctl` | Prevent containers from using the kernel keyring, which is not namespaced. | +| `lookup_dcookie` | Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by `CAP_SYS_ADMIN`. | +| `mbind` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. | +| `mount` | Deny mounting, already gated by `CAP_SYS_ADMIN`. | +| `move_pages` | Syscall that modifies kernel memory and NUMA settings. | +| `nfsservctl` | Deny interaction with the kernel NFS daemon. Obsolete since Linux 3.1. | +| `open_by_handle_at` | Cause of an old container breakout. Also gated by `CAP_DAC_READ_SEARCH`. | +| `perf_event_open` | Tracing/profiling syscall, which could leak a lot of information on the host. | +| `personality` | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulnerabilities. | +| `pivot_root` | Deny `pivot_root`, should be privileged operation. | +| `process_vm_readv` | Restrict process inspection capabilities, already blocked by dropping `CAP_SYS_PTRACE`. | +| `process_vm_writev` | Restrict process inspection capabilities, already blocked by dropping `CAP_SYS_PTRACE`. | | `ptrace` | Tracing/profiling syscall. Blocked in Linux kernel versions before 4.8 to avoid seccomp bypass. Tracing/profiling arbitrary processes is already blocked by dropping `CAP_SYS_PTRACE`, because it could leak a lot of information on the host. | -| `query_module` | Deny manipulation and functions on kernel modules. Obsolete. | -| `quotactl` | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_ADMIN`. | -| `reboot` | Don't let containers reboot the host. Also gated by `CAP_SYS_BOOT`. | -| `request_key` | Prevent containers from using the kernel keyring, which is not namespaced. | -| `set_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. | -| `setns` | Deny associating a thread with a namespace. Also gated by `CAP_SYS_ADMIN`. | -| `settimeofday` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | -| `stime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | -| `swapon` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. | -| `swapoff` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. | -| `sysfs` | Obsolete syscall. | -| `_sysctl` | Obsolete, replaced by /proc/sys. | -| `umount` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. | -| `umount2` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. | -| `unshare` | Deny cloning new namespaces for processes. Also gated by `CAP_SYS_ADMIN`, with the exception of `unshare --user`. | -| `uselib` | Older syscall related to shared libraries, unused for a long time. | -| `userfaultfd` | Userspace page fault handling, largely needed for process migration. | -| `ustat` | Obsolete syscall. | -| `vm86` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. | -| `vm86old` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. | +| `query_module` | Deny manipulation and functions on kernel modules. Obsolete. | +| `quotactl` | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_ADMIN`. | +| `reboot` | Don't let containers reboot the host. Also gated by `CAP_SYS_BOOT`. | +| `request_key` | Prevent containers from using the kernel keyring, which is not namespaced. | +| `set_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. | +| `setns` | Deny associating a thread with a namespace. Also gated by `CAP_SYS_ADMIN`. | +| `settimeofday` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | +| `stime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | +| `swapon` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. | +| `swapoff` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. | +| `sysfs` | Obsolete syscall. | +| `_sysctl` | Obsolete, replaced by /proc/sys. | +| `umount` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. | +| `umount2` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. | +| `unshare` | Deny cloning new namespaces for processes. Also gated by `CAP_SYS_ADMIN`, with the exception of `unshare --user`. | +| `uselib` | Older syscall related to shared libraries, unused for a long time. | +| `userfaultfd` | Userspace page fault handling, largely needed for process migration. | +| `ustat` | Obsolete syscall. | +| `vm86` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. | +| `vm86old` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. | ## Run without the default seccomp profile @@ -115,6 +115,6 @@ You can pass `unconfined` to run a container without the default seccomp profile. ```console -$ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \ +$ docker run --rm -it --security-opt seccomp=unconfined debian:latest \ unshare --map-root-user --user sh -c whoami ```