-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seccomp: Allow specification of syscalls as numbers #1102
base: main
Are you sure you want to change the base?
Conversation
The motivation here is the same as moby/moby#41671, which was closed as requiring a spec change (hence this PR). In short, certain applications use high syscall numbers (e.g. 1000+) for private communication with an associated ptracer. Since these are not real syscalls, there is no corresponding string mapping for them in the runtime. Currently users simply run such applications in privileged containers, which is of course the absolute worst option. I would like to be able to provide these users a seccomp profile that works, but without being able to specify these pseudo-syscalls by number, this is not possible.
@@ -649,7 +649,7 @@ The following parameters can be specified to set up seccomp: | |||
For example, if `defaultAction` is `SCMP_ACT_KILL` and `syscalls` is empty or unset, the kernel will kill the container process on its first syscall. | |||
Each entry has the following structure: | |||
|
|||
* **`names`** *(array of strings, REQUIRED)* - the names of the syscalls. | |||
* **`names`** *(array of strings, REQUIRED)* - the names (or decimal strings of the number) of the syscalls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
positive integers?
@Keno looks like your commit message is missing a DCO sign-off; https://github.com/opencontainers/runtime-spec/pull/1102/checks?check_run_id=2310376370 |
So, generally, I understand and am "positive" on the idea but the devil may be in the details. Let me write down some thoughts 1. architecture-dependentOne reason for (currently) using named syscalls is that conversion of a syscall to its numeric value must take into account what platform and architecture the container will be deployed on, because numeric values will differ: Picking a random syscall ( SYS_EPOLL_CREATE = 213 // amd64
SYS_EPOLL_CREATE = 250 // arm
SYS_EPOLL_CREATE = 249 // s390x
SYS_EPOLL_CREATE = 5207 // mips64le This conversion is (I think) currently handled by libseccomp, and as such will always match the actual architecture; https://github.com/seccomp/libseccomp-golang/blob/febb5ff482c15f8cafd1f7269f3a5eba047f9a5f/seccomp.go#L452 result := C.seccomp_syscall_resolve_name_arch(arch.toNative(), cString) But pushing this "higher up the stack" to higher level runtimes where the profile is generated, it means that those runtimes must be aware of this conversion and might complicate things if (e.g.) the profile is generated on an orchestrator in a mixed-architecture cluster. So taking the above into account, named values would be more "portable". 2. description should include accepted format for numeric valuesGiven that the numeric values should be passed as a string, it's important to describe what formats are accepted. The value will have to be converted from a string to an integer, and things can go wrong in this translation if it's not explicitly specified (marshalling/unmarshalling JSON can differ between implementations). Thinking of strings that could (either intentionally or unintentionally) be regarded numeric values (hexadecimal, octal, etc.): 3. description should be non-ambiguousThis repository is a specification we should make absolutely sure that there's no ambiguity in how runtimes should handle these. For a specification, the goal is to describe exactly how things should be handled, and "explicit > implicit" is important. Given that the existing ("named") syscalls are already widely in use, we'll have to deal with a transition (some runtimes may, and some runtimes may not support numeric). The spec must describe how runtimes should handle this. Some options: A. make support for "numeric" values
|
I agree with your point (1) though the usecase mentioned in this issue is specific to "synthetic" syscalls which have syscall numbers the kernel doesn't use at all (and thus kernel weirdness with syscall numbers don't apply). I completely agree with point (2). However with point (3) I think you're missing one of the key behaviours that runtimes have today -- we ignore unknown syscalls. This means that current behaviour is (3A) and in fact (3B) would be both a backwards compatibility breakage (we cannot add REQUIRED to existing fields in a non-major version) and it would make runc not spec-compliant. (As an aside, when it comes to the fact this makes deny-lists insecure -- deny-lists are fundamentally insecure, so it's more important to make container configurations work on different kernel and libseccomp versions than it is to make deny-lists theoretically more secure.) |
Are we fine to accept this with (3A)? cc @kolyshkin |
The motivation here is the same as moby/moby#41671, which was closed
as requiring a spec change (hence this PR). In short, certain applications use high syscall numbers
(e.g. 1000+) for private communication with an associated ptracer. Since these are not real syscalls,
there is no corresponding string mapping for them in the runtime. Currently users simply run such
applications in privileged containers, which is of course the absolute worst option. I would like to be
able to provide these users a seccomp profile that works, but without being able to specify these
pseudo-syscalls by number, this is not possible.
cc @thaJeztah @staticfloat