-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GNU Make 4.4+ jobserver fifo / semaphore client support #2450
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
My suggestions:
- Declare variables when you use them, not C89-style at the beginning of the scope.
- Move all function definitions to .cc files, not in a .h
- Use {} even for one line statements
Any idea why the windows build fails? |
missing |
Aah, missed that. I was referring to the other error:
|
27a269b
to
cc1044e
Compare
Fixed now. |
May 19, 2024 19:03:13 David Turner ***@***.***>:
***@***.**** commented on this pull request.
----------------------------------------
In src/build.cc[#2450 (comment)]:
> @@ -789,7 +805,7 @@ bool Builder::Build(string* err) {
while (plan_.more_to_do()) {
// See if we can start any more commands.
if (failures_allowed) {
- size_t capacity = command_runner_->CanRunMore();
+ size_t capacity = command_runner_->CanRunMore(plan_.JobserverEnabled());
while (capacity > 0) {
I think I found it: FindWork() will return null if the token could not be acquired, so this effectively limits the number of processes that Ninja will ever spawn. And the Acquire() / Release() methods are never blocking (related to my other comment in jobserver.h).
Now I am curious to understand what Ninja does where there are no spawned processes anymore, and no tokens available. Do you know if Ninja would be busy-looping in this case?
It breaks out of the inner while loop, and descents into WaitForCommand() below (still in the outer while loop). That command uses ppoll()/select() to wait for any command to return without busy looping.
(Not at my laptop now, so details might differ slightly...)
// Martin
…
—
Reply to this email directly, view it on GitHub[#2450 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIGB4SYSPHG36IPCBI3ZDDLM7AVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTKMJRHE].
You are receiving this because you authored the thread.
[Tracking image][https://github.com/notifications/beacon/AAA6PICPL7UQFARDWJXE7WLZDDLM7A5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C4T66.gif]
|
May 19, 2024 19:13:55 David Turner ***@***.***>:
***@***.**** commented on this pull request.
----------------------------------------
In src/jobserver.h[#2450 (comment)]:
> +
+struct Jobserver {
+ Jobserver();
+ ~Jobserver();
+ void Init();
+ bool Enabled() const;
+ bool Acquire();
+ void Release();
+
+private:
+ bool ParseJobserverAuth(const char *type);
+ bool AcquireToken();
+ void ReleaseToken();
+
+ std::string jobserver_name_;
+ size_t token_count_;
I recommend using default initialization for members here to simplify the source code, e.g.:
* size_t token_count_ = 0;
#ifdef _WIN32
HANDLE sem_ = INVALID_HANDLE_VALUE;
#else
int fd_ = -1;
#endif
*
wdyt?
All for it! Just didn't look enough around to see if it was something done elsewhere too...
…
—
Reply to this email directly, view it on GitHub[#2450 (review)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIDSQ3L6RLFEJSKVOXDZDDMVFAVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTMNJXGA].
You are receiving this because you authored the thread.
[Tracking image][https://github.com/notifications/beacon/AAA6PIFFW2K3PZJLMJVPAJTZDDMVFA5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C4WZU.gif]
|
May 19, 2024 19:37:01 David Turner ***@***.***>:
***@***.**** commented on this pull request.
----------------------------------------
In src/build.cc[#2450 (comment)]:
> @@ -789,7 +805,7 @@ bool Builder::Build(string* err) {
while (plan_.more_to_do()) {
// See if we can start any more commands.
if (failures_allowed) {
- size_t capacity = command_runner_->CanRunMore();
+ size_t capacity = command_runner_->CanRunMore(plan_.JobserverEnabled());
while (capacity > 0) {
Thanks, replying here to your email answer,
My answer shows up on GitHub too :)
which mentions that the outer loop ends up blocking in SubprocessSet::DoWork(). It looks like, from the implementation, that if there is no running commands, this would either block infintely or just busy-loop calling *perror("ppoll")*, depending on whether ppoll() returns an error where there are no fds and no timeout provided to the syscall.
There should always be at least one running command (that's why there's special casing on the _token_count == 0 in jobserver.cc
Neither is really good, but this is probably an edge case that we should document somewhere, and shouldn't stop submitting this PR. And please don't be mistaken by my remarks, this is truly an excellent feature, so thank you for uploading it :-)
You're very much welcome. And thanks for the review!
Server mode, and client-server passthrough will probably require much more complicated changes, but this hits the sweet spot for client-only support!
The new make-4.4 approach is so excellent, because it just requires the MAKEFLAGS env to be passed on to child commands. Which makes me wonder: does ninja filter the env when starting commands?
// Martin
…
—
Reply to this email directly, view it on GitHub[#2450 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIGHQPIX7HM4BOACJB3ZDDPLZAVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTSMZTGI].
You are receiving this because you authored the thread.
[Tracking image][https://github.com/notifications/beacon/AAA6PICMNKLQ5BZZ6CB46ADZDDPLZA5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C44GI.gif]
|
At the moment, Ninja always passes its environment to sub-commands, so the When using a named FIFO mechanism, either Posix or Windows, this is enough for them to participate properly in token negotiation (Ninja taking implicit token for each sub-command before launching it, as expected). The file descriptor-based scheme will fail though (because Ninja doesn't try to keep these open in the spawned processes), and it's probably not something worthy of supporting, though this should be documented. I am trying to setup some tests on top of your commits to see how we can ensure everything works as expected, and that we never regress in the future. OT: Your answer appears in the general conversation for the PR, and not in the specific comment's thread. This loses context and can make things hard to follow. On the other hand, Github doesn't preserve comments when new commit are force-pushed to upload fixes (unlike Gerrit which tracks these very well), so these are not ideal either. Feel free to use whatever you prefer :) |
69bf358
to
e92f95b
Compare
@jhasse @digit-google I have fixed most of the comments, and responded to the remaining ones. Should I mark the fixed ones as resolved, or do you want to do that? Is there anything else I need to address? |
Rebased on master and removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! The documentation is awesome :)
I've added several nitpick comments.
For the parsing I think it might be a good idea to add a unit test which check for the error/warning cases, too (i.e. invalid MAKEFLAGS).
Should I mark the fixed ones as resolved, or do you want to do that?
Feel free to resolve the comments yourself.
6d81a64
to
d4f279a
Compare
Would be great if someone could test it and comment here. |
I did some test on VLC. We build 100+ contribs at once from autotools, CMake and meson projects. The build is started from a Makefile calling ninja for CMake and meson projects. We build ninja beforehand to have jobserver support. We use a prebuilt version of the Kitware version on our Docker. This build is with the Kitware version of ninja with jobserver. This other build on the same machine as the same time, is done with this jobserver branch. The first thing to notice is that this branch does build successfully. The other thing is that they take about the same time to build (10m51s vs 10m32s), suggesting the parallel usage (on this 48 cores machine) is working as expected. It's even slightly faster but I don't think we can really conclude it's faster. |
@robUx4 thanks for testing. I'm afraid you need to tweak the build system to use the fifo style jobserver instead of the old-style pipe-fd method: |
@robUx4 btw: it's probably just a matter of updating |
We use whatever Debian is giving us. It seems Debian doesn't provide make 4.4 yet: https://packages.debian.org/bookworm/make, even in sid: https://packages.debian.org/sid/make |
Hello, I could experiment today with this patch applied to a local Ninja binary, used to build a small subset of Fuchsia targets. Good news, I see a decent improvement in build times when all remote builds are disabled (which is not our default configuration): 13m47s -> 12m39s. For fully remote builds, we go: 5m54s -> 4m56s which is even nicer. So this PR looks really good to me. NOTE: I wrote a Python script to setup and serve the tokens, then invoking Ninja, see d6c0c1a (probably not the final version). |
Tested this briefly with our build, and for whatever reason, more tokens are returned to the pool than were originally acquired. This causes more and more parallel jobs to start as the build proceeds which eventually brings the system to a crawl. At the end of our build:
I have 36 CPUs on the build machine, and specify Is it guaranteed that there will be in equal number of EDIT: This fixes the issue I'm seeing (applied to this PR): diff --git a/src/build.cc b/src/build.cc
index f05e31e..a1e808e 100644
--- a/src/build.cc
+++ b/src/build.cc
@@ -170,6 +170,7 @@ Edge* Plan::FindWork() {
}
Edge* work = ready_.top();
+ work->acquired_job_server_token_ = jobserver_.Enabled();
ready_.pop();
return work;
}
@@ -207,7 +208,7 @@ bool Plan::EdgeFinished(Edge* edge, EdgeResult result, string* err) {
edge->pool()->RetrieveReadyEdges(&ready_);
// Return the token acquired for this very edge to the jobserver
- if (jobserver_.Enabled()) {
+ if (edge->acquired_job_server_token_) {
jobserver_.Release();
}
diff --git a/src/graph.h b/src/graph.h
index 314c442..f908d75 100644
--- a/src/graph.h
+++ b/src/graph.h
@@ -227,6 +227,7 @@ struct Edge {
bool deps_loaded_ = false;
bool deps_missing_ = false;
bool generated_by_dep_loader_ = false;
+ bool acquired_job_server_token_ = false;
TimeStamp command_start_time_ = 0;
const Rule& rule() const { return *rule_; } |
Nice numbers.
Looks good. I've also written something similar, albeit less feature rich: |
Uf, good catch. I'll have to look into how the tokens are released again. Suggestions are welcome... |
See my edit, really small diff that fixes the problem! |
The principle of such a job server is rather simple: Before starting a new job (edge in ninja-speak), a token must be acquired from an external entity. On posix systems, that entity is simply a fifo filled with N characters. On win32 systems it is a semaphore initialized to N. Once a job is finished, the token must be returned to the external entity. This functionality is desired when ninja is used as part of a bigger build, such as builds with Yocto/OpenEmbedded, Buildroot and Android. Here, multiple compile jobs are executed in parallel to maximize cpu utilization, but if each compile job uses all available cores, the system is over loaded.
Implement proper testing of the MAKEFLAGS parsing, and the token acquire/release logic in the jobserver class.
Nice.
Thanks! Pushed the change with some comments added :) |
@hundeboll good job so far
In the openwrt project, we have been using stefan's PR, but recently I have noticed issues with it. I want to test this one but we actually need the "UNIX pipe" method instead of a fifo. IMO this is not ready to merge until you implement both forms. Also I would not refer to the other method as the "old" one, which suggests it is deprecated or less effective when it isn't... since it appears your system uses the fifo by default, you have to pass (it should be easy for you, since you have gotten this far 😃 ) |
// Ignore the argument if the length or the value of the type value doesn't | ||
// match the requested type (i.e. "fifo" on posix or "sem" on windows). | ||
if (strlen(type) != static_cast<size_t>(str_colon - str_begin) || | ||
strncmp(str_begin, type, str_colon - str_begin)) { | ||
Warning("invalid jobserver type: got %.*s; expected %s", | ||
str_colon - str_begin, str_begin, type); | ||
return false; | ||
} | ||
|
||
// Advance the string pointer to just after the : character | ||
str_begin = str_colon + 1; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is unnecessary. I think it should be the job of a different function, platform-specific, to make sure the value is sane and let this function simply collect the value and make sure it's not empty.
According to the docs for Windows jobserver implementation, apparently there is no colon at all (or at least it's not guarenteed)
If you really want to detect whether the value is sane here, this function should be setting the type value instead of being told which one as an input, i.e.:
A. if a \
is found, type = sem
B. if a :
is found, type = fifo
C. if a ,
is found, type = pipe
D. if neither is found, assume sem
then you can match string values instead of lengths for the type in the next function to handle the jobserver string value, if you don't handle it all here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but like I said, sanity checks should really not be done here. It is possible although unlikely for a colon or comma to be in the name of the semaphore object itself.
Why? |
My machine has Make 4.3. I suppose I can compile my own Make for myself, but at some point, if we want to incorporate this for the entire openwrt community (and other projects) to see how it goes it would have to have support for some older versions, and otherwise it would default to no jobserver support for many users and maybe even a buildbot, which means spawning way too many jobs that would slow things down. For example, when ninja is built on my system the jobs number defaults to 6. My machine has only 4 cores. Now imagine ninja being executed by We can figure out a patch after the fact, but why not handle it upstream while it is being worked on? I find it strange to add support only for the latest and greatest version of Make first and then add a small change later to make it work for everyone when it can all be done together. |
So you're running an "old" make version which is why we refer to the pipe version as the "old" method ;)
I'm not an expert but from the comments I read on both PRs it is less effective. We're very wary of changes that increase the complexity of Ninja, so a PR that implements both methods while one of them is technical superior and results in less code in Ninja (and to my understand that's the case for fifo), is very unlikely to get merged. |
Sure sounds like the GNU Make documentation claims the anonymous pipe mode is less effective... How about: https://www.gnu.org/software/make/manual/make.html#POSIX-Jobserver
So if possible, GNU Make prefers the fifo based jobserver. Sure sounds like the anonymous pipes are soft deprecated... ... Overall, I do understand that you want support for anonymous pipes for backwards compatibility with older make. I just don't see why you are arguing that the need to support legacy systems is synonymous with being equally effective and equally un-deprecated. Anonymous pipes are clearly both deprecated and less effective, even if they are sometimes your only available option. |
I understand your points. of course, the difference between the methods is not arbitrary, there was a reason for it, but the keyword is "obscure", and every major project that builds with Makefiles has been writing their Makefiles, in some cases for decades, in a way that handles the noted obscure caveat in order to support every version of Make up until the most recent one 2 years ago. Unless the caveat is not being handled, the two methods ought to be considered functionally equivalent, especially since Make does not care which order the tokens are received and returned as long as it is the same token, and I can't imagine there being some kind of performance difference... I don't mean to be argumentative on this point, but the way I see deprecated options is that it would be labeled as "deprecated" or otherwise planned to no longer be recognized after a certain point in time. The people at GNU often make this perfectly clear instead of just "suggesting" it. Speaking of complexity, I think it would be a fairly small diff in order to accommodate the support for the "old" way. If you want to prevent excess complexity from making it's way into ninja's source then put some focus on my review comment. I understand it may not be urgent in your perspective, and that's fine as I do not make decisions here, but please be aware that in the perspective of any downstream project, without "full" support, the feature you're actually adding is instability in the form of code variability, where the code affects some group of users one way, and another group of user another way. Otherwise, if you're certain on not handling it in this PR I would be happy to open my own PR here afterward to add support for the older method if no one else wants to do so. However, it would be nice to hear if you support the idea in general, at least, so that the next version would have "full" support, regardless of what we do in this PR. |
bool Jobserver::Acquire() { | ||
// The first token is implicitly handed to the ninja process, so don't | ||
// acquire it from the jobserver | ||
if (token_count_ == 0 || AcquireToken()) { | ||
token_count_++; | ||
return true; | ||
} | ||
|
||
return false; | ||
} | ||
|
||
void Jobserver::Release() { | ||
assert(token_count_ >= 1); | ||
token_count_--; | ||
|
||
// Don't return first token to the jobserver, as it is implicitly handed | ||
// to the ninja process | ||
if (token_count_ > 0) { | ||
ReleaseToken(); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these functions are so small, the logic here might as well be combined with the implementation-specific function that they call, unless there is some reason to split the functions between private and public members of the class. If so please correct me.
If that happens, there's only one function left. Merging all the functions in jobserver.cc to the platform-specific files would also allow simplification of the parse function, making it smaller for each build-type and further reduce complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this logic in a common place as it is quite important (and a bit non-intuitive) to not acquire/release the first token. Having it duplicated in platform specific implementations increases the risk of getting it wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really following... I think comments are enough to point out the importance and lack of intuition of those lines. Get it right once and if it needs to be adjusted for a new platform it would start with a copy/paste of the previous one.
void Jobserver::ReleaseToken() { | ||
char token = '+'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the Make manual:
It’s important that when you release the job slot, you write back the same character you read. Don’t assume that all tokens are the same character; different characters may have different meanings to GNU make.
I have not looked at Make's code relevant to this yet, so I don't know the significance. It might be that the character being anything other than a +
is a rare occurence... maybe someone else can clarify how important it is.
I will follow up with another comment with more information as I continue to play with the PR's changes and review Make's source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've noticed the same from the Make manual. It might make sense to store the acquired token in the (recently added) Edge::acquired_job_server_token_
member. I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at GNU Make source history, I changed my mind. I don't think this is important. There is no sign that they are going to introduce new characters that have special meaning, and if they somehow do that, it would be pretty easy to add support for it later. Let's save the idea for the future...
}; | ||
|
||
/// CommandRunner is an interface that wraps running the build | ||
/// subcommands. This allows tests to abstract out running commands. | ||
/// RealCommandRunner is an implementation that actually runs commands. | ||
struct CommandRunner { | ||
virtual ~CommandRunner() {} | ||
virtual size_t CanRunMore() const = 0; | ||
virtual size_t CanRunMore(bool jobserver_enabled) const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not changing this method's signature at all, to minimize changes.
Since the flag value is not going to change during a build, you can simply pass plan_.JobserverEnabled()
to the RealCommandRunner
constructor to record its value, and have RealCommandRunner::CanRunMore()
return SIZE_MAX
when the flag is enabled.
This should work identically, without having to change the DryCommandRunner and TestCommandRunner interfaces / implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
} | ||
|
||
const char* jobserver_auth = "--jobserver-auth="; | ||
const char* str_begin = strstr(makeflags, jobserver_auth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://www.gnu.org/software/make/manual/html_node/Job-Slots.html:
Be aware that the MAKEFLAGS variable may contain multiple instances of the --jobserver-auth= option. Only the last instance is relevant.
Hence the code should probably try to reflect that.
Also it looks like there is no specific prefix on Win32 for the semaphore name (https://www.gnu.org/software/make/manual/html_node/Windows-Jobserver.html) but that --jobserver-style=sem
should be specified in MAKEFLAGS
as well.
I am also wary of multiple --jobserver-style
values in the input string. I would assume that only the last one should be followed, and that --jobserver-style=pipe
should be a condition to ignore the jobserver master.
Given this, I suggest creating a standalone static function that parses a MAKEFLAGS string value as input, and returns a struct describing the wanted jobserver type + argument (fifo path or semaphore name) from it (it doesn't have to contain Win32/Posix specific code paths), so this can be unit-tested properly with lots of edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
@jhasse @hundeboll I have been working on my own version of this PR for a while. It would take a long time to explain and demonstrate some of the changes I made. Would it be appropriate to create my own PR with your commit, adding myself as a signoff, and we can review each other's ideas directly? This could expedite the process of actually getting the feature accepted and merged, in case that the ninja developers are looking to hurry up the process instead of putting it off until the release after the next upcoming release... Also I'm wondering, is there a timeline for versions or is it just based on how many changes since last release? |
I'm actually surprised that Debian / Ubuntu hasn't upgraded make in 1½ years by now. Maybe we should try to bug them a bit: |
Sure, go ahead. My goal is to add jobserver support in OpenEmbedded, so adding it to ninja is just a means to that. |
The principle of such a job server is rather simple: Before starting a new job (edge in ninja-speak), a token must be acquired from an external entity. On posix systems, that entity is simply a fifo filled with N characters. On win32 systems it is a semaphore initialized to N. Once a job is finished, the token must be returned to the external entity.
This functionality is desired when ninja is used as part of a bigger build, such as builds with Yocto/OpenEmbedded, Buildroot and Android. Here, multiple compile jobs are executed in parallel to maximize cpu utilization, but if each compile job uses all available cores, the system is over loaded.
Note: this is a re-implementation of the last part[1] of the previous attempt to implement jobserver functionality. I have left out the server[2] part, and the older "pipe"[3] methods from here, as I don't need those. Doing so allows for a much simpler implementation.
Note note: I don't have windows or mac systems available. I would greatly appreciate anyone who can test on those for me.
[1] #2263
[2] #2260
[3] #1140