Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seccomp: Allow specification of syscalls as numbers #1102

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Keno
Copy link

@Keno Keno commented Apr 10, 2021

The motivation here is the same as moby/moby#41671, which was closed
as requiring a spec change (hence this PR). In short, certain applications use high syscall numbers
(e.g. 1000+) for private communication with an associated ptracer. Since these are not real syscalls,
there is no corresponding string mapping for them in the runtime. Currently users simply run such
applications in privileged containers, which is of course the absolute worst option. I would like to be
able to provide these users a seccomp profile that works, but without being able to specify these
pseudo-syscalls by number, this is not possible.

cc @thaJeztah @staticfloat

The motivation here is the same as moby/moby#41671, which was closed
as requiring a spec change (hence this PR). In short, certain applications use high syscall numbers
(e.g. 1000+) for private communication with an associated ptracer. Since these are not real syscalls,
there is no corresponding string mapping for them in the runtime. Currently users simply run such
applications in privileged containers, which is of course the absolute worst option. I would like to be
able to provide these users a seccomp profile that works, but without being able to specify these
pseudo-syscalls by number, this is not possible.
@@ -649,7 +649,7 @@ The following parameters can be specified to set up seccomp:
For example, if `defaultAction` is `SCMP_ACT_KILL` and `syscalls` is empty or unset, the kernel will kill the container process on its first syscall.
Each entry has the following structure:

* **`names`** *(array of strings, REQUIRED)* - the names of the syscalls.
* **`names`** *(array of strings, REQUIRED)* - the names (or decimal strings of the number) of the syscalls.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

positive integers?

@thaJeztah
Copy link
Member

@Keno looks like your commit message is missing a DCO sign-off; https://github.com/opencontainers/runtime-spec/pull/1102/checks?check_run_id=2310376370

@thaJeztah
Copy link
Member

thaJeztah commented Apr 15, 2021

So, generally, I understand and am "positive" on the idea but the devil may be in the details. Let me write down some thoughts

1. architecture-dependent

One reason for (currently) using named syscalls is that conversion of a syscall to its numeric value must take into account what platform and architecture the container will be deployed on, because numeric values will differ:

Picking a random syscall (epoll_create), and looking at amd64, arm, s390x, and mips64le

SYS_EPOLL_CREATE = 213  // amd64
SYS_EPOLL_CREATE = 250  // arm
SYS_EPOLL_CREATE = 249  // s390x
SYS_EPOLL_CREATE = 5207 // mips64le

This conversion is (I think) currently handled by libseccomp, and as such will always match the actual architecture; https://github.com/seccomp/libseccomp-golang/blob/febb5ff482c15f8cafd1f7269f3a5eba047f9a5f/seccomp.go#L452

result := C.seccomp_syscall_resolve_name_arch(arch.toNative(), cString)

But pushing this "higher up the stack" to higher level runtimes where the profile is generated, it means that those runtimes must be aware of this conversion and might complicate things if (e.g.) the profile is generated on an orchestrator in a mixed-architecture cluster.

So taking the above into account, named values would be more "portable".

2. description should include accepted format for numeric values

Given that the numeric values should be passed as a string, it's important to describe what formats are accepted. The value will have to be converted from a string to an integer, and things can go wrong in this translation if it's not explicitly specified (marshalling/unmarshalling JSON can differ between implementations).

Thinking of strings that could (either intentionally or unintentionally) be regarded numeric values (hexadecimal, octal, etc.): "OxFF", "020", "DEADBEEF"

3. description should be non-ambiguous

This repository is a specification we should make absolutely sure that there's no ambiguity in how runtimes should handle these. For a specification, the goal is to describe exactly how things should be handled, and "explicit > implicit" is important.

Given that the existing ("named") syscalls are already widely in use, we'll have to deal with a transition (some runtimes may, and some runtimes may not support numeric). The spec must describe how runtimes should handle this.

Some options:

A. make support for "numeric" values OPTIONAL?

In this case, runtimes MUST / SHOULD log "unknown" syscalls as a warning, and proceed without them

This approach makes the transition easier: "higher level" runtimes can pass both named and numeric values, and depending on what "lower level" runtime is used, those numeric values would be ignored or not.

But this needs further detailing:

  • Seccomp profiles can be an "allow-list" or a "deny-list": ignoring unknown syscalls can therefore reduce security of the container (a syscall that should be blocked is now "allowed")
  • Are "mixed" (named and numeric) values allowed?
  • How should duplicates be handled? (["epoll_create", "213"])
    • pick only the numeric ones (if supported), and pick "named" ones if not supported?
    • pick both (later overrides former)
    • what if there's conflicting options (different options set for "named" and "numeric")?

B. make support for "numeric" values REQUIRED?

In this case, runtimes MUST produce an error, and reject the spec if they don't support numeric values

I think this would be the current behavior; any current runtime would not have the numeric values in their list, and would currently produce an error.

  • This is a safer option; unknown syscalls are not silently ignored, reducing the risk of "not" blocking syscalls that should be blocked
  • But it makes specifying profiles more complicated: higher level runtimes now need to know if the "lower" runtime does, or does not support numeric values

in a "mixed kernel version" cluster, syscalls that are not supported by the kernel on a node would result in a failure. To an extend, this situation can be (edit: apparently I was writing something here, but didn't include it; ignore for now (but perhaps I find it back 😅)

C. Separate fields for "numeric" and "named" values

Separate fields may reduce ambiguity, but I'd have to give this more thinking:

  • (Possibly) would allow (require?) numeric values to be passed per-architecture
  • Can remove the "duplicate entries" ambiguity, depending on how the new field should be treated, which could be:
    • "runtimes that support the new (numeric) field probably MUST pick (one of the fields), and ignore the other one"

However transition could be complicated:

  • runtimes that do not support the new (numeric) fields quite likely would "ignore" it, and as such would fully skip the seccomp profile, which is (again) likely undesirable.
  • during the transition, "higher level runtimes" MUST/SHOULD include both "named" and "numeric" variants in the profile to make sure the profile can be used.

@cyphar
Copy link
Member

cyphar commented Apr 20, 2021

@thaJeztah

I agree with your point (1) though the usecase mentioned in this issue is specific to "synthetic" syscalls which have syscall numbers the kernel doesn't use at all (and thus kernel weirdness with syscall numbers don't apply).

I completely agree with point (2).

However with point (3) I think you're missing one of the key behaviours that runtimes have today -- we ignore unknown syscalls. This means that current behaviour is (3A) and in fact (3B) would be both a backwards compatibility breakage (we cannot add REQUIRED to existing fields in a non-major version) and it would make runc not spec-compliant.

(As an aside, when it comes to the fact this makes deny-lists insecure -- deny-lists are fundamentally insecure, so it's more important to make container configurations work on different kernel and libseccomp versions than it is to make deny-lists theoretically more secure.)

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Sep 1, 2022

Are we fine to accept this with (3A)?

cc @kolyshkin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants