Implement GNU Make 4.4+ jobserver fifo / semaphore client support #2450

hundeboll · 2024-05-17T14:08:22Z

The principle of such a job server is rather simple: Before starting a new job (edge in ninja-speak), a token must be acquired from an external entity. On posix systems, that entity is simply a fifo filled with N characters. On win32 systems it is a semaphore initialized to N. Once a job is finished, the token must be returned to the external entity.

This functionality is desired when ninja is used as part of a bigger build, such as builds with Yocto/OpenEmbedded, Buildroot and Android. Here, multiple compile jobs are executed in parallel to maximize cpu utilization, but if each compile job uses all available cores, the system is over loaded.

Note: this is a re-implementation of the last part[1] of the previous attempt to implement jobserver functionality. I have left out the server[2] part, and the older "pipe"[3] methods from here, as I don't need those. Doing so allows for a much simpler implementation.

Note note: I don't have windows or mac systems available. I would greatly appreciate anyone who can test on those for me.

[1] #2263
[2] #2260
[3] #1140

jhasse

Thanks for the PR!
My suggestions:

Declare variables when you use them, not C89-style at the beginning of the scope.
Move all function definitions to .cc files, not in a .h
Use {} even for one line statements

src/jobserver.h

src/build.h

hundeboll · 2024-05-18T07:05:47Z

Any idea why the windows build fails?

jhasse · 2024-05-18T07:32:56Z

missing #include <cassert>

hundeboll · 2024-05-18T07:42:33Z

missing #include <cassert>

Aah, missed that.

I was referring to the other error:

D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,20): error C2062: type 'unknown-type' unexpected [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]

hundeboll · 2024-05-18T07:58:09Z

missing #include <cassert>

Aah, missed that.

I was referring to the other error:

D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,20): error C2062: type 'unknown-type' unexpected [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]

Fixed now.

src/jobserver.h

src/build.cc

src/jobserver-win32.cc

hundeboll · 2024-05-19T17:06:34Z

May 19, 2024 19:03:13 David Turner ***@***.***>:

***@***.**** commented on this pull request. ---------------------------------------- In src/build.cc[#2450 (comment)]: > @@ -789,7 +805,7 @@ bool Builder::Build(string* err) { while (plan_.more_to_do()) { // See if we can start any more commands. if (failures_allowed) { - size_t capacity = command_runner_->CanRunMore(); + size_t capacity = command_runner_->CanRunMore(plan_.JobserverEnabled()); while (capacity > 0) { I think I found it: FindWork() will return null if the token could not be acquired, so this effectively limits the number of processes that Ninja will ever spawn. And the Acquire() / Release() methods are never blocking (related to my other comment in jobserver.h). Now I am curious to understand what Ninja does where there are no spawned processes anymore, and no tokens available. Do you know if Ninja would be busy-looping in this case?

It breaks out of the inner while loop, and descents into WaitForCommand() below (still in the outer while loop). That command uses ppoll()/select() to wait for any command to return without busy looping. (Not at my laptop now, so details might differ slightly...) // Martin

…

— Reply to this email directly, view it on GitHub[#2450 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIGB4SYSPHG36IPCBI3ZDDLM7AVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTKMJRHE]. You are receiving this because you authored the thread. [Tracking image][https://github.com/notifications/beacon/AAA6PICPL7UQFARDWJXE7WLZDDLM7A5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C4T66.gif]

src/jobserver-posix.cc

src/jobserver.h

hundeboll · 2024-05-19T17:15:40Z

May 19, 2024 19:13:55 David Turner ***@***.***>:

***@***.**** commented on this pull request. ---------------------------------------- In src/jobserver.h[#2450 (comment)]: > + +struct Jobserver { + Jobserver(); + ~Jobserver(); + void Init(); + bool Enabled() const; + bool Acquire(); + void Release(); + +private: + bool ParseJobserverAuth(const char *type); + bool AcquireToken(); + void ReleaseToken(); + + std::string jobserver_name_; + size_t token_count_; I recommend using default initialization for members here to simplify the source code, e.g.: * size_t token_count_ = 0; #ifdef _WIN32 HANDLE sem_ = INVALID_HANDLE_VALUE; #else int fd_ = -1; #endif * wdyt?

All for it! Just didn't look enough around to see if it was something done elsewhere too...

…

— Reply to this email directly, view it on GitHub[#2450 (review)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIDSQ3L6RLFEJSKVOXDZDDMVFAVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTMNJXGA]. You are receiving this because you authored the thread. [Tracking image][https://github.com/notifications/beacon/AAA6PIFFW2K3PZJLMJVPAJTZDDMVFA5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C4WZU.gif]

hundeboll · 2024-05-19T17:45:01Z

May 19, 2024 19:37:01 David Turner ***@***.***>:

***@***.**** commented on this pull request. ---------------------------------------- In src/build.cc[#2450 (comment)]: > @@ -789,7 +805,7 @@ bool Builder::Build(string* err) { while (plan_.more_to_do()) { // See if we can start any more commands. if (failures_allowed) { - size_t capacity = command_runner_->CanRunMore(); + size_t capacity = command_runner_->CanRunMore(plan_.JobserverEnabled()); while (capacity > 0) { Thanks, replying here to your email answer,

My answer shows up on GitHub too :) which mentions that the outer loop ends up blocking in SubprocessSet::DoWork(). It looks like, from the implementation, that if there is no running commands, this would either block infintely or just busy-loop calling *perror("ppoll")*, depending on whether ppoll() returns an error where there are no fds and no timeout provided to the syscall. There should always be at least one running command (that's why there's special casing on the _token_count == 0 in jobserver.cc

Neither is really good, but this is probably an edge case that we should document somewhere, and shouldn't stop submitting this PR. And please don't be mistaken by my remarks, this is truly an excellent feature, so thank you for uploading it :-)

You're very much welcome. And thanks for the review!

Server mode, and client-server passthrough will probably require much more complicated changes, but this hits the sweet spot for client-only support!

The new make-4.4 approach is so excellent, because it just requires the MAKEFLAGS env to be passed on to child commands. Which makes me wonder: does ninja filter the env when starting commands? // Martin

…

— Reply to this email directly, view it on GitHub[#2450 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIGHQPIX7HM4BOACJB3ZDDPLZAVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTSMZTGI]. You are receiving this because you authored the thread. [Tracking image][https://github.com/notifications/beacon/AAA6PICMNKLQ5BZZ6CB46ADZDDPLZA5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C44GI.gif]

digit-google · 2024-05-20T09:19:39Z

At the moment, Ninja always passes its environment to sub-commands, so the MAKEFLAGS value will be passed to them as well.

When using a named FIFO mechanism, either Posix or Windows, this is enough for them to participate properly in token negotiation (Ninja taking implicit token for each sub-command before launching it, as expected).

The file descriptor-based scheme will fail though (because Ninja doesn't try to keep these open in the spawned processes), and it's probably not something worthy of supporting, though this should be documented.

I am trying to setup some tests on top of your commits to see how we can ensure everything works as expected, and that we never regress in the future.

OT: Your answer appears in the general conversation for the PR, and not in the specific comment's thread. This loses context and can make things hard to follow. On the other hand, Github doesn't preserve comments when new commit are force-pushed to upload fixes (unlike Gerrit which tracks these very well), so these are not ideal either. Feel free to use whatever you prefer :)

hundeboll · 2024-05-24T08:09:04Z

@jhasse @digit-google I have fixed most of the comments, and responded to the remaining ones. Should I mark the fixed ones as resolved, or do you want to do that?

Is there anything else I need to address?

hundeboll · 2024-05-28T19:58:07Z

Rebased on master and removed the #define NOMINMAX from jobserver.h

jhasse

Thanks for the changes! The documentation is awesome :)

I've added several nitpick comments.

For the parsing I think it might be a good idea to add a unit test which check for the error/warning cases, too (i.e. invalid MAKEFLAGS).

Should I mark the fixed ones as resolved, or do you want to do that?

Feel free to resolve the comments yourself.

src/build.cc

src/build.h

src/jobserver.cc

src/jobserver.h

src/build_test.cc

jhasse · 2024-06-08T20:12:04Z

Would be great if someone could test it and comment here.

robUx4 · 2024-06-11T09:28:49Z

I did some test on VLC. We build 100+ contribs at once from autotools, CMake and meson projects. The build is started from a Makefile calling ninja for CMake and meson projects. We build ninja beforehand to have jobserver support. We use a prebuilt version of the Kitware version on our Docker.

This build is with the Kitware version of ninja with jobserver. This other build on the same machine as the same time, is done with this jobserver branch.

The first thing to notice is that this branch does build successfully. The other thing is that they take about the same time to build (10m51s vs 10m32s), suggesting the parallel usage (on this 48 cores machine) is working as expected. It's even slightly faster but I don't think we can really conclude it's faster.

hundeboll · 2024-06-11T09:36:47Z

@robUx4 thanks for testing. I'm afraid you need to tweak the build system to use the fifo style jobserver instead of the old-style pipe-fd method:
https://code.videolan.org/robUx4/vlc/-/jobs/1800466#L1738

hundeboll · 2024-06-11T10:00:25Z

@robUx4 btw: it's probably just a matter of updating make to version 4.4 or later...

robUx4 · 2024-06-11T10:03:54Z

We use whatever Debian is giving us. It seems Debian doesn't provide make 4.4 yet: https://packages.debian.org/bookworm/make, even in sid: https://packages.debian.org/sid/make

digit-google · 2024-06-17T17:08:04Z

Hello, I could experiment today with this patch applied to a local Ninja binary, used to build a small subset of Fuchsia targets.
This subset involves launching, in the end, about 8 sub-Ninja builds in parallel which fight for CPU resources concurrently to buils around 19,000+ targets each (moderated by default with -j5 on a very powerful workstation). This includes many Rust and C++ compilation / link commands (whose toolchain support MAKEFLAGS natively).

Good news, I see a decent improvement in build times when all remote builds are disabled (which is not our default configuration): 13m47s -> 12m39s.

For fully remote builds, we go: 5m54s -> 4m56s which is even nicer.

So this PR looks really good to me.

NOTE: I wrote a Python script to setup and serve the tokens, then invoking Ninja, see d6c0c1a (probably not the final version).

jdrouhard · 2024-06-17T18:41:06Z

Tested this briefly with our build, and for whatever reason, more tokens are returned to the pool than were originally acquired. This causes more and more parallel jobs to start as the build proceeds which eventually brings the system to a crawl.

At the end of our build:

make: INTERNAL: Exiting with 176 jobserver tokens available; should be 36!

I have 36 CPUs on the build machine, and specify -j 36 to the make invocation. make is 4.4.1 and the command recipe for doing the build spawns ninja with a + ninja logs which fifo it's using for the jobserver so I confirmed make is passing the appropriate file down and this PR is being used.

Is it guaranteed that there will be in equal number of FindWork() calls to EdgeFinished() calls? Briefly looking through the source, I'm seeing more EdgeFinished() calls nested within other aspects of the build process, such as NodeFinished(), EdgeMaybeReady(), etc. Seems like maybe it's possible to call EdgeFinished() more times than the initial FindWork() call.

EDIT:
This call - https://github.com/hundeboll/ninja/blob/be47d5de9312f486425c12e10b82b35d42a0d273/src/build.cc#L263 - is in EdgeMaybeReady() which is used when other nodes complete or dyndep discovery kicks in and the output (dependent edge) doesn't need to directly be built. This is an EdgeFinished() call that doesn't correlate to any FindWork().

This fixes the issue I'm seeing (applied to this PR):

diff --git a/src/build.cc b/src/build.cc
index f05e31e..a1e808e 100644
--- a/src/build.cc
+++ b/src/build.cc
@@ -170,6 +170,7 @@ Edge* Plan::FindWork() {
   }
 
   Edge* work = ready_.top();
+  work->acquired_job_server_token_ = jobserver_.Enabled();
   ready_.pop();
   return work;
 }
@@ -207,7 +208,7 @@ bool Plan::EdgeFinished(Edge* edge, EdgeResult result, string* err) {
   edge->pool()->RetrieveReadyEdges(&ready_);
 
   // Return the token acquired for this very edge to the jobserver
-  if (jobserver_.Enabled()) {
+  if (edge->acquired_job_server_token_) {
     jobserver_.Release();
   }
 
diff --git a/src/graph.h b/src/graph.h
index 314c442..f908d75 100644
--- a/src/graph.h
+++ b/src/graph.h
@@ -227,6 +227,7 @@ struct Edge {
   bool deps_loaded_ = false;
   bool deps_missing_ = false;
   bool generated_by_dep_loader_ = false;
+  bool acquired_job_server_token_ = false;
   TimeStamp command_start_time_ = 0;
 
   const Rule& rule() const { return *rule_; }

hundeboll · 2024-06-17T19:05:35Z

For fully remote builds, we go: 5m54s -> 4m56s which is even nicer.

So this PR looks really good to me.

Nice numbers.

NOTE: I wrote a Python script to setup and serve the tokens, then invoking Ninja, see d6c0c1a (probably not the final version).

Looks good. I've also written something similar, albeit less feature rich:
https://lore.kernel.org/openembedded-core/[email protected]/

hundeboll · 2024-06-17T19:06:27Z

EDIT:
This call - https://github.com/hundeboll/ninja/blob/be47d5de9312f486425c12e10b82b35d42a0d273/src/build.cc#L263 - is in EdgeMaybeReady() which is used when other nodes complete or dyndep discovery kicks in and the output (dependent edge) doesn't need to directly be built. This is an EdgeFinished() call that doesn't correlate to any FindWork()

Uf, good catch. I'll have to look into how the tokens are released again. Suggestions are welcome...

jdrouhard · 2024-06-17T19:10:36Z

EDIT:
This call - https://github.com/hundeboll/ninja/blob/be47d5de9312f486425c12e10b82b35d42a0d273/src/build.cc#L263 - is in EdgeMaybeReady() which is used when other nodes complete or dyndep discovery kicks in and the output (dependent edge) doesn't need to directly be built. This is an EdgeFinished() call that doesn't correlate to any FindWork()

Uf, good catch. I'll have to look into how the tokens are released again. Suggestions are welcome...

See my edit, really small diff that fixes the problem!

The principle of such a job server is rather simple: Before starting a new job (edge in ninja-speak), a token must be acquired from an external entity. On posix systems, that entity is simply a fifo filled with N characters. On win32 systems it is a semaphore initialized to N. Once a job is finished, the token must be returned to the external entity. This functionality is desired when ninja is used as part of a bigger build, such as builds with Yocto/OpenEmbedded, Buildroot and Android. Here, multiple compile jobs are executed in parallel to maximize cpu utilization, but if each compile job uses all available cores, the system is over loaded.

Implement proper testing of the MAKEFLAGS parsing, and the token acquire/release logic in the jobserver class.

hundeboll · 2024-06-17T19:24:51Z

See my edit, really small diff that fixes the problem!

Nice.

See my edit, really small diff that fixes the problem!

Thanks! Pushed the change with some comments added :)

mcprat · 2024-07-22T18:35:25Z

@hundeboll good job so far

I have left out ... the older "pipe"[3] methods from here, as I don't need those.

In the openwrt project, we have been using stefan's PR, but recently I have noticed issues with it. I want to test this one but we actually need the "UNIX pipe" method instead of a fifo. IMO this is not ready to merge until you implement both forms. Also I would not refer to the other method as the "old" one, which suggests it is deprecated or less effective when it isn't...

since it appears your system uses the fifo by default, you have to pass --jobserver-style=pipe to test it.
otherwise, testing jobserver client implementations for ninja is top of my todo list right now, so I'll be testing this intensely once you add support for the simple pipe method, if I don't figure it out myself first.

(it should be easy for you, since you have gotten this far 😃 )

mcprat · 2024-07-22T21:43:42Z

src/jobserver.cc

+  // Ignore the argument if the length or the value of the type value doesn't
+  // match the requested type (i.e. "fifo" on posix or "sem" on windows).
+  if (strlen(type) != static_cast<size_t>(str_colon - str_begin) ||
+      strncmp(str_begin, type, str_colon - str_begin)) {
+    Warning("invalid jobserver type: got %.*s; expected %s",
+        str_colon - str_begin, str_begin, type);
+    return false;
+  }
+
+  // Advance the string pointer to just after the : character
+  str_begin = str_colon + 1;
+


I believe this is unnecessary. I think it should be the job of a different function, platform-specific, to make sure the value is sane and let this function simply collect the value and make sure it's not empty.

According to the docs for Windows jobserver implementation, apparently there is no colon at all (or at least it's not guarenteed)

If you really want to detect whether the value is sane here, this function should be setting the type value instead of being told which one as an input, i.e.:

A. if a \ is found, type = sem
B. if a : is found, type = fifo
C. if a , is found, type = pipe
D. if neither is found, assume sem

then you can match string values instead of lengths for the type in the next function to handle the jobserver string value, if you don't handle it all here

but like I said, sanity checks should really not be done here. It is possible although unlikely for a colon or comma to be in the name of the semaphore object itself.

jhasse · 2024-07-23T08:04:08Z

I want to test this one but we actually need the "UNIX pipe" method instead of a fifo.

Why?

mcprat · 2024-07-23T17:02:29Z

I want to test this one but we actually need the "UNIX pipe" method instead of a fifo.

Why?

My machine has Make 4.3. I suppose I can compile my own Make for myself, but at some point, if we want to incorporate this for the entire openwrt community (and other projects) to see how it goes it would have to have support for some older versions, and otherwise it would default to no jobserver support for many users and maybe even a buildbot, which means spawning way too many jobs that would slow things down.

For example, when ninja is built on my system the jobs number defaults to 6. My machine has only 4 cores. Now imagine ninja being executed by make -j 4 with 6 extra jobs alongside whatever else. The load average easily goes beyond 12 depending on what ninja is building.

We can figure out a patch after the fact, but why not handle it upstream while it is being worked on? I find it strange to add support only for the latest and greatest version of Make first and then add a small change later to make it work for everyone when it can all be done together.

jhasse · 2024-07-23T19:10:30Z

So you're running an "old" make version which is why we refer to the pipe version as the "old" method ;)

[...] which suggests it is deprecated or less effective when it isn't...

I'm not an expert but from the comments I read on both PRs it is less effective.

We're very wary of changes that increase the complexity of Ninja, so a PR that implements both methods while one of them is technical superior and results in less code in Ninja (and to my understand that's the case for fifo), is very unlikely to get merged.

eli-schwartz · 2024-07-23T19:43:46Z

In the openwrt project, we have been using stefan's PR, but recently I have noticed issues with it. I want to test this one but we actually need the "UNIX pipe" method instead of a fifo. IMO this is not ready to merge until you implement both forms. Also I would not refer to the other method as the "old" one, which suggests it is deprecated or less effective when it isn't...

* New feature: The --jobserver-style command line option and named pipes
  A new jobserver method is used on systems where mkfifo(3) is supported.
  This solves a number of obscure issues related to using the jobserver
  and recursive invocations of GNU Make.  This change means that sub-makes
  will connect to the jobserver even if they are not marked as recursive.
  It also means that other tools that want to participate in the jobserver
  will need to be enhanced as described in the GNU Make manual.
  You can force GNU Make to use the simple pipe-based jobserver (perhaps if
  you are integrating with other tools or older versions of GNU Make) by
  adding the '--jobserver-style=pipe' option to the command line of the
  top-level invocation of GNU Make, or via MAKEFLAGS or GNUMAKEFLAGS.
  To detect this change search for 'jobserver-fifo' in the .FEATURES variable.

Sure sounds like the GNU Make documentation claims the anonymous pipe mode is less effective...

How about: https://www.gnu.org/software/make/manual/make.html#POSIX-Jobserver

On POSIX systems the jobserver is implemented in one of two ways: on systems that support it, GNU make will create a named pipe and use that for the jobserver. [...]

If the system doesn’t support named pipes, or if the user provided the --jobserver-style option and specified ‘pipe’, then the jobserver will be implemented as a simple UNIX pipe. [...]

So if possible, GNU Make prefers the fifo based jobserver. Sure sounds like the anonymous pipes are soft deprecated...

...

Overall, I do understand that you want support for anonymous pipes for backwards compatibility with older make. I just don't see why you are arguing that the need to support legacy systems is synonymous with being equally effective and equally un-deprecated. Anonymous pipes are clearly both deprecated and less effective, even if they are sometimes your only available option.

mcprat · 2024-07-24T02:20:22Z

I understand your points. of course, the difference between the methods is not arbitrary, there was a reason for it, but the keyword is "obscure", and every major project that builds with Makefiles has been writing their Makefiles, in some cases for decades, in a way that handles the noted obscure caveat in order to support every version of Make up until the most recent one 2 years ago. Unless the caveat is not being handled, the two methods ought to be considered functionally equivalent, especially since Make does not care which order the tokens are received and returned as long as it is the same token, and I can't imagine there being some kind of performance difference...

I don't mean to be argumentative on this point, but the way I see deprecated options is that it would be labeled as "deprecated" or otherwise planned to no longer be recognized after a certain point in time. The people at GNU often make this perfectly clear instead of just "suggesting" it.

Speaking of complexity, I think it would be a fairly small diff in order to accommodate the support for the "old" way. If you want to prevent excess complexity from making it's way into ninja's source then put some focus on my review comment.

I understand it may not be urgent in your perspective, and that's fine as I do not make decisions here, but please be aware that in the perspective of any downstream project, without "full" support, the feature you're actually adding is instability in the form of code variability, where the code affects some group of users one way, and another group of user another way. Otherwise, if you're certain on not handling it in this PR I would be happy to open my own PR here afterward to add support for the older method if no one else wants to do so. However, it would be nice to hear if you support the idea in general, at least, so that the next version would have "full" support, regardless of what we do in this PR.

mcprat · 2024-07-24T20:56:14Z

src/jobserver.cc

+bool Jobserver::Acquire() {
+  // The first token is implicitly handed to the ninja process, so don't
+  // acquire it from the jobserver
+  if (token_count_ == 0 || AcquireToken()) {
+    token_count_++;
+    return true;
+  }
+
+  return false;
+}
+
+void Jobserver::Release() {
+  assert(token_count_ >= 1);
+  token_count_--;
+
+  // Don't return first token to the jobserver, as it is implicitly handed
+  // to the ninja process
+  if (token_count_ > 0) {
+    ReleaseToken();
+  }
+}
+


these functions are so small, the logic here might as well be combined with the implementation-specific function that they call, unless there is some reason to split the functions between private and public members of the class. If so please correct me.

If that happens, there's only one function left. Merging all the functions in jobserver.cc to the platform-specific files would also allow simplification of the parse function, making it smaller for each build-type and further reduce complexity.

I put this logic in a common place as it is quite important (and a bit non-intuitive) to not acquire/release the first token. Having it duplicated in platform specific implementations increases the risk of getting it wrong.

I'm not really following... I think comments are enough to point out the importance and lack of intuition of those lines. Get it right once and if it needs to be adjusted for a new platform it would start with a copy/paste of the previous one.

mcprat · 2024-07-24T21:24:53Z

src/jobserver-posix.cc

+void Jobserver::ReleaseToken() {
+  char token = '+';


from the Make manual:

It’s important that when you release the job slot, you write back the same character you read. Don’t assume that all tokens are the same character; different characters may have different meanings to GNU make.

I have not looked at Make's code relevant to this yet, so I don't know the significance. It might be that the character being anything other than a + is a rare occurence... maybe someone else can clarify how important it is.

I will follow up with another comment with more information as I continue to play with the PR's changes and review Make's source.

Yeah, I've noticed the same from the Make manual. It might make sense to store the acquired token in the (recently added) Edge::acquired_job_server_token_ member. I'll look into it.

After looking at GNU Make source history, I changed my mind. I don't think this is important. There is no sign that they are going to introduce new characters that have special meaning, and if they somehow do that, it would be pretty easy to add support for it later. Let's save the idea for the future...

digit-google · 2024-07-25T09:59:30Z

src/build.h

 };

 /// CommandRunner is an interface that wraps running the build
 /// subcommands.  This allows tests to abstract out running commands.
 /// RealCommandRunner is an implementation that actually runs commands.
 struct CommandRunner {
  virtual ~CommandRunner() {}
-  virtual size_t CanRunMore() const = 0;
+  virtual size_t CanRunMore(bool jobserver_enabled) const = 0;


I suggest not changing this method's signature at all, to minimize changes.

Since the flag value is not going to change during a build, you can simply pass plan_.JobserverEnabled() to the RealCommandRunner constructor to record its value, and have RealCommandRunner::CanRunMore() return SIZE_MAX when the flag is enabled.

This should work identically, without having to change the DryCommandRunner and TestCommandRunner interfaces / implementations.

Good point.

digit-google · 2024-07-25T10:06:46Z

src/jobserver.cc

+  }
+
+  const char* jobserver_auth = "--jobserver-auth=";
+  const char* str_begin = strstr(makeflags, jobserver_auth);


According to https://www.gnu.org/software/make/manual/html_node/Job-Slots.html:

Be aware that the MAKEFLAGS variable may contain multiple instances of the --jobserver-auth= option. Only the last instance is relevant.

Hence the code should probably try to reflect that.

Also it looks like there is no specific prefix on Win32 for the semaphore name (https://www.gnu.org/software/make/manual/html_node/Windows-Jobserver.html) but that --jobserver-style=sem should be specified in MAKEFLAGS as well.

I am also wary of multiple --jobserver-style values in the input string. I would assume that only the last one should be followed, and that --jobserver-style=pipe should be a condition to ignore the jobserver master.

Given this, I suggest creating a standalone static function that parses a MAKEFLAGS string value as input, and returns a struct describing the wanted jobserver type + argument (fifo path or semaphore name) from it (it doesn't have to contain Win32/Posix specific code paths), so this can be unit-tested properly with lots of edge cases.

Good point.

mcprat · 2024-08-02T10:19:42Z

@jhasse @hundeboll I have been working on my own version of this PR for a while. It would take a long time to explain and demonstrate some of the changes I made. Would it be appropriate to create my own PR with your commit, adding myself as a signoff, and we can review each other's ideas directly?

This could expedite the process of actually getting the feature accepted and merged, in case that the ninja developers are looking to hurry up the process instead of putting it off until the release after the next upcoming release...

Also I'm wondering, is there a timeline for versions or is it just based on how many changes since last release?

hundeboll · 2024-08-02T10:20:33Z

My machine has Make 4.3.

I'm actually surprised that Debian / Ubuntu hasn't upgraded make in 1½ years by now. Maybe we should try to bug them a bit:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1029106

hundeboll · 2024-08-02T10:21:52Z

@jhasse @hundeboll I have been working on my own version of this PR for a while. It would take a long time to explain and demonstrate some of the changes I made. Would it be appropriate to create my own PR with your commit, adding myself as a signoff, and we can review each other's ideas directly?

Sure, go ahead. My goal is to add jobserver support in OpenEmbedded, so adding it to ninja is just a means to that.

jhasse reviewed May 17, 2024

View reviewed changes

src/jobserver.h Outdated Show resolved Hide resolved

src/jobserver.h Outdated Show resolved Hide resolved

src/build.h Outdated Show resolved Hide resolved

hundeboll force-pushed the jobserver branch from 91fda41 to 5b613ea Compare May 18, 2024 07:02

hundeboll force-pushed the jobserver branch from 5b613ea to a81fb9f Compare May 18, 2024 07:39

hundeboll force-pushed the jobserver branch 2 times, most recently from 27a269b to cc1044e Compare May 18, 2024 07:56