Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ProcessTerminationResult, for better alignment with POSIX #244

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

EdSchouten
Copy link
Collaborator

@EdSchouten EdSchouten commented Mar 8, 2023

POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states
the following:

There are two kinds of process termination:

  1. Normal termination occurs by a return from main(), when requested
    with the exit(), _exit(), or _Exit() functions; or when the last
    thread in the process terminates by returning from its start
    function, by calling the pthread_exit() function, or through
    cancellation.

  2. Abnormal termination occurs when requested by the abort() function
    or when some signals are received.

Both these cases can be detected by parent processes separately.
Unfortunately, REv2 currently doesn't allow this, as we only provide an
exit code. Implementations such as Buildbarn tend to set the exit code
for abnormal termination to 128+signum, meaning that if a segmentation
fault occurs, an action terminates with exit code 139. This tends to be
fairly confusing for non-expert users.

The goal of this change is to make all of this less confusing. It adds a
new message type named ProcessTerminationResult, which contains the
termination result in a more structured way. It can now either be an
exit code or a signal number. If it turns out that other operating
systems provide even more ways, we can add additional oneof cases.

For the exit code in case of normal termination I decided to use an
int64 instead of int32. The reason being that POSIX allows you to get
the full exit code back, using APIs like waitid(). On ILP64 systems, the
exit code may thus be 64 bits.

For the signal in case of abnormal termination I decided to use a
string. The reason being that signal numbers are not covered by POSIX.
Everybody always assumes that 9 is SIGKILL, but POSIX only documents
this as a convention for the command line flags of the kill utility.
Not the the actual SIG* constants. Adding an enumeration would also be
unwise, as operating systems are free to add their own signals that are
not covered by POSIX (e.g., SIGINFO on BSD).

Fixes: #240

@EdSchouten EdSchouten force-pushed the eschouten/20230308-termination branch 3 times, most recently from 25af301 to e3cce78 Compare March 15, 2023 15:52
POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states
the following:

> There are two kinds of process termination:
>
> 1. Normal termination occurs by a return from main(), when requested
>    with the exit(), _exit(), or _Exit() functions; or when the last
>    thread in the process terminates by returning from its start
>    function, by calling the pthread_exit() function, or through
>    cancellation.
>
> 2. Abnormal termination occurs when requested by the abort() function
>    or when some signals are received.

Both these cases can be detected by parent processes separately.
Unfortunately, REv2 currently doesn't allow this, as we only provide an
exit code. Implementations such as Buildbarn tend to set the exit code
for abnormal termination to 128+signum, meaning that if a segmentation
fault occurs, an action terminates with exit code 139. This tends to be
fairly confusing for non-expert users.

The goal of this change is to make all of this less confusing. It adds a
new message type named ProcessTerminationResult, which contains the
termination result in a more structured way. It can now either be an
exit code or a signal number. If it turns out that other operating
systems provide even more ways, we can add additional `oneof` cases.

For the exit code in case of normal termination I decided to use an
int64 instead of int32. The reason being that POSIX allows you to get
the full exit code back, using APIs like waitid(). On ILP64 systems, the
exit code may thus be 64 bits.

For the signal in case of abnormal termination I decided to use a
string. The reason being that signal numbers are not covered by POSIX.
Everybody always assumes that 9 is SIGKILL, but POSIX only documents
this as a convention for the command line flags of the `kill` utility.
Not the the actual `SIG*` constants. Adding an enumeration would also be
unwise, as operating systems are free to add their own signals that are
not covered by POSIX (e.g., SIGINFO on BSD).

Fixes: bazelbuild#240
@EdSchouten EdSchouten force-pushed the eschouten/20230308-termination branch from e3cce78 to 19fd68b Compare March 15, 2023 15:54
@EdSchouten
Copy link
Collaborator Author

Hi @EricBurnett @juergbi,

Could you folks please take a look at the latest version of this PR? I think I have processed your feedback accordingly:

  • The exit_code field is no longer marked as deprecated. Instead, each of the oneof arms of ProcessTerminationResult.reason now documents how its value maps to exit_code.
  • What @juergbi brought up during yesterday's meeting is correct: exit codes may also be exception values. But what's interesting is that they seemingly don't live in a distinct space from exit code. A process terminating because of an exception will under the hood merely call TerminateProcess() with the exception value as the exit code. This means that the Signaled arm in oneof will remain exclusively used by POSIX-like systems.

Thanks!

@@ -1312,6 +1323,57 @@ message OutputSymlink {
NodeProperties node_properties = 4;
}

// The termination result of a process, as reported by the operating
// system to the parent process (e.g., using wait*() on POSIX-like
// systems, or GetExitCodeProcess() on Microsoft Windows).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest removing or weakening the statement "as reported by the operating system..." . This structure of exit_code + process_termination_result I think would actually extend to other concepts as well in the future.

For example, a pattern I expect to need at some point is "flake due to intermediate on-worker infrastructure", where there is a logical test that we care about the result of, but also some intermediate scaffolding that's been injected to run it where we'd like to be able to distinguish flakes due to the scaffolding itself (that doesn't necessarily imply a problem with the test) from flakes of the test.

(To be clear, don't want to extend your PR to cover this; that can be a future discussion. I'd just say that the "process termination result" need not be only OS-provided information.

string signal_name = 1;
}

oneof reason {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about not using a oneof and simply having a set of optional fields? signal_name implying the lack of exit_code makes sense in this particular case, but I'm not sure every future result is going to follow the pattern of being mutually exclusive. E.g. if we also provided info from the syslog or Windows Event Log so you could know why the OS killed the process, say, where you'd get both a SIGKILL and some indication that it was killed due to OOM or too many handles or whatever.

So e.g.

message ProcessTerminationResult {
    //...
    int64 exit_code = 1;
    //...
    string signal_name = 2;
    //...
    string human_readable_explanation_for_signal = 3; 
}

With comments detailing the relationship between the various fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that open the door to problems with default values? I'm thinking about exit_code in particular.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. For many the empty field is trivially uninteresting (e.g. string signal, empty==missing) but you're right about exit_code.

A few options:

  1. Do nothing, since exit_code can be distinguished fairly readily: the only case of concern is when it's 0, where we shouldn't actually need to distinguish if its set or not. (Do we really need to provide a ProcessTerminationResult just to provide a 0 exit_code? Given that the top-level field already exists, it seems just as reasonable to omit the whole message entirely in that case. So "0 means unset" for exit_code is just fine, since it introduces no real ambiguity in practice.)
  2. Do something special for exit_code, since so far it's the only one that has an ambiguous default. A second "has_exit_code" field would do, for example.
  3. If you know a set that will definitely be mutually exclusive, a oneof suffices, and then we can use optional fields for the rest. So e.g. exit_code and signal_name within the oneof, but with the intent for augmenting fields (like the human_readable_explanation_for_signal example) to be optionals outside of it. Which I guess is...what you already have, just with the understanding that maybe not everything will go in the oneof?
  4. Make these structs with a single field instead of raw fields, since you can check for the presence of a struct. E.g.
message ProcessTerminationResult {
    message ExitCode {
        //...
        int64 code = 1;
    }
    ExitCode exit_code = 1;
    message Signal {
        //...
        string name = 1;
        string human_readable_explanation_for_signal = 2; 
    }
    Signal signal = 2;
}

I think I vote for (1): doesn't really feel like there's "real" ambiguity being introduced in this case, even for exit_code.

Copy link

@bduffany bduffany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this change; our impl currently returns -1 for exit code in this case, mostly as a consequence of using exec.Command(...).Wait() in Golang which returns -1 whenever the process doesn't exit(). It would be nice to surface more details as to whether it's an abort() or something else.

// Abnormal process termination occurred.
//
// On POSIX-like systems, this is achieved by calling abort(), or by
// receiving a signal for which no signal handler is installed.
Copy link

@bduffany bduffany Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SIGKILL (e.g. OOM-kill) and SIGSTOP (probably less common in practice) in particular can't be handled, so maybe this wording is slightly clearer:

Suggested change
// receiving a signal for which no signal handler is installed.
// receiving a signal that is not handled by the process.

@@ -1156,6 +1156,17 @@ message ActionResult {
// The exit code of the command.
int32 exit_code = 4;

// An optional reason the process launched by this action terminated,
Copy link

@bduffany bduffany Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is optional, it seems that checking ActionResult.exit_code == 0 is still how the client should check whether an execution was successful or not, rather than inspecting this new field.

To avoid confusion now that this additional field is being added (which also contains exit code information), should the ActionResult.exit_code field be further specified so that it must be non-zero if the action did not succeed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Let exit_code be better aligned with C/POSIX
4 participants