-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up parsing #2519
base: master
Are you sure you want to change the base?
Speed up parsing #2519
Conversation
f96c667
to
19396e6
Compare
Thanks, cool ideas in there. While I didn't look into it for very long, a few remarks:
|
I was about to hit merge when I saw digit's comments. This is all small self-contained commits; I thought the PR is in pretty good shape. (In particular, I don't think it necessarily needs to be split. Splitting it wouldn't hurt either ofc, but it's more work, and I'd rather merge this in this form that not merging it due to us not merging it over review back-and-forth.) The third_party suggestion is good. I think a hash change does need a manifest version bump. If the arena Sending the hash map mingw fix upstream would be nice (but not a blocker). Thanks for the PR! |
Oh, and this fails to build because I didn't add |
After fixing this manually, I am also seeing multiple failures in BuildLog unit-tests. Please take care of these as well. |
src/arena.h
Outdated
|
||
struct Arena { | ||
public: | ||
static constexpr size_t kAlignTarget = sizeof(uint64_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regretabbly a static constexpr size_t
member will need an empty declaration in the corresponding .cc file, or some compilers will complain in debug mode. But this definition seems to only be used inside of Alloc(), do you really it, i.e. could you turn that into a simple function local variable?
src/arena.cc
Outdated
size_t to_allocate = std::max(next_size_, num_bytes); | ||
|
||
Block new_block; | ||
new_block.mem.reset(new char[to_allocate]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Used an aligned new here since there is technically no guarantee that this will happen (yes, I know most allocators would use size_t as a minimum). For now the arena is only used to store character sequences so this doesn't matter, but this could become problematic if it is used later to allocate other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requires C++17, no? Is that allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. In theory you should be able to use a dedicated type with alignas()
specifier, and use this in a new AlignedType[(to_allocate + sizeof(AlignedType) - 1) & ~sizeof(AlignedType)])
, but this is getting ugly, and frankly there is no need for this feature here. What do you think about dropping the alignment requirement entirely, and just call this StringPieceArena
for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arena is taken out now, so we'll deal with it in a separate PR if we get there.
I think everything should be taken care of now, short of the alignment issue. For the memory usage, I wonder if we should consider making a short-string optimization to StringPiece()? It should be possible to have a 15-byte string inline, without allocating anything on the arena. We could then also possibly bring back the concatenation optimization in EvalString, so that if you do AddString("foo"); AddString("bar"); you get a single short-string StringPiece with 6 bytes in them. |
I think Since we're talking about StringPiece / string lookup performance, the Android Ninja fork has introduced a HashedStringView type that embeds a (pointer + 32-bit size + 32-bit hash) to speed up hash table lookups considerably (with a corresponding HashedString type, which is an std::string + 32-bit hash). I had ported the idea on top of my fork some time ago, and usint this improved things noticeably (between 5% and 11% for no-op Ninja builds, depending on the build graph). This could be another way to improve speed, at the cost of extra complexity though, and historically Ninja maintainers have been very reluctant of such changes. It would be nice to know if @nico and @jhasse would be ok with this before trying to implement these schemes. |
I looked a bit more at this; the issue isn't that we have a lot of short stirngs (we don't). it is simply that EvalString doesn't need to live past the end of ManifestParser::ParseDefault(). So we could simply have a small arena that lives only in ManifestParser, and is cleared (with memory available for reuse) at the end of ParseEdge(). It would be a little more complex, but we would probably get rid of all of the memory bloat. What do you think? |
I see some autobuilders are failing, too:
|
Thanks a ton @sesse. I have tried your latest patch with a moderate Fuchsia build plan (e.g. I wanted to see how much the non-arena related improvements impacted performance, and after a small rebase, I got a newer version of your patchset (see attached patch) that actually runs slightly faster, without increasing memory at all. So it looks like the arena is not helping after all. |
The MacOS compilation issue seems to be a bug, this should definitely compile as C++11. I think this is entirely unrelated to this Cl. This is bad :-( For the spelling issue, the codespell invocation in For linting, |
I definitely see differences between the arena and non-arena; there's a measurement there right at the first patch. But like I said, maybe the arena can do with a smaller scope/lifetime. We can review the non-arena parts first, and then come back to it after the other stuff is in? I'm not that keen on all the re-measuring to get the commit messages right, though. |
Looking at the MacOS issue, it looks like the |
Let's try to fix the |
FYI: I have uploaded my rebase at https://github.com/digit-google/ninja/pull/new/sesse-ninja-pr2519-790f571-without-arena if that can help you (you do not have to use it, and I want you to get full credit for this work, just to be clear). |
For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 5.76 to 5.48 seconds.
I took out the arena (I intend to re-measure it after we have all of the other stuff in). It was my own rebase, but I looked to yours for confirmation in a couple of places, so that was useful; thanks. |
Arrgh, modifying |
Thx for the PR :) For the spelling errors / trailing whitespace I would create a PR on the upstream project.
Hm ... why not? |
I doubt you can fix everything anyway; one of them was Windows line endings, and I doubt rapidhash would take in such a change. |
Apparently, this would have nearly the same effect. Looking at
I don´t think there is a best solution here, whatever suits you I guess. |
As a drive-by comment from someone who has confused incompatible ninja versions before, the "build log version" message is much more helpful than just rebuilding everything. The "command line changed" explanation wouldn't be seen by default, right? |
Thanks a ton @sesse, this is great, here are my benchmarking results for the following versions:
First, regarding performance: $ ARGS="-C out/default --quiet -n"
$ hyperfine --runs=5 "/tmp/ninja-upstream $ARGS" "/tmp/ninja-sesse1 $ARGS" "/tmp/ninja-sesse1-no-arena $ARGS" "/tmp/ninja-sesse3 $ARGS" "/tmp/ninja-sesse3-arena $ARGS"
Benchmark 1: /tmp/ninja-upstream -C out/default -n --quiet
Time (mean ± σ): 11.448 s ± 0.062 s [User: 6.813 s, System: 4.612 s]
Range (min … max): 11.397 s … 11.550 s 5 runs
Benchmark 2: /tmp/ninja-sesse1 -C out/default -n --quiet
Time (mean ± σ): 10.999 s ± 0.054 s [User: 6.099 s, System: 4.877 s]
Range (min … max): 10.921 s … 11.061 s 5 runs
Benchmark 3: /tmp/ninja-sesse1-no-arena -C out/default -n --quiet
Time (mean ± σ): 10.663 s ± 0.025 s [User: 6.101 s, System: 4.537 s]
Range (min … max): 10.619 s … 10.682 s 5 runs
Benchmark 4: /tmp/ninja-sesse3 -C out/default -n --quiet
Time (mean ± σ): 10.586 s ± 0.032 s [User: 6.076 s, System: 4.485 s]
Range (min … max): 10.561 s … 10.642 s 5 runs
Warning: The first benchmarking run for this command was significantly slower than the rest (10.642 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
Benchmark 5: /tmp/ninja-sesse3-arena -C out/default -n --quiet
Time (mean ± σ): 10.820 s ± 0.040 s [User: 6.160 s, System: 4.636 s]
Range (min … max): 10.779 s … 10.869 s 5 runs
Summary
/tmp/ninja-sesse3 -C out/default -n --quiet ran
1.01 ± 0.00 times faster than /tmp/ninja-sesse1-no-arena -C out/default -n --quiet
1.02 ± 0.00 times faster than /tmp/ninja-sesse3-arena -C out/default -n --quiet
1.04 ± 0.01 times faster than /tmp/ninja-sesse1 -C out/default -n --quiet
1.08 ± 0.01 times faster than /tmp/ninja-upstream -C out/default -n --quiet Second, regarding peak RAM usage: $ /usr/bin/time -f%M /tmp/ninja-upstream $ARGS
1808180
$ /usr/bin/time -f%M /tmp/ninja-sesse3 $ARGS
1803832
$ /usr/bin/time -f%M /tmp/ninja-sesse3-arena $ARGS
1918924 Conclusion: for this specific build plan, the current patchset at |
@tsniatowski : You make an excellent point. The current patchset contains the log version change to 7 and seems to be a great improvement, I say let's ship it! |
I'm surprised you don't win more than 8%, but I guess perhaps your time is dominated by something else than parsing? It's impossible to say for sure without seeing a profile, although 4.6 seconds system time hints at some heavy OS involvement (stat-ing, perhaps?). (Also, how are you running these against the same build directory without having the problem of the hash changing and getting a full rebuild?) In any case, the arena is now not part of the PR, so we can discuss that separately, I believe. |
You don't know until you try :) Where do the changes in lexer.cc come from? Technically (unlike with e.g. the zlib license) the MIT license is infectious, meaning that we would have to distribute the copyright and the license terms with binary distributions of ninja after this PR. 99% of people aren't aware of this and break this rule all the time though, not sure if it really matters. |
I don't know what changed lexer.cc, I guess something in the build system does? I haven't modified it by hand. :-) |
Ah yes, I think building via Python changes it in-place using re2c. Can you remove that change from your commit? Only needed when modifying lexer.in.cc. |
I don't really trust hyperfine's statistics, but since you wanted comparative measurements:
This is on a 5950X (Zen 3), with a fairly normal NVMe SSD and Debian unstable. |
This very often holds only a single RAW token, so we do not need to allocate elements on an std::vector for it in the common case. For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 5.48 to 5.14 seconds. Note that this opens up for a potential optimization where EvalString::Evaluate() could just return a StringPiece, without making a std::string out of it (which requires allocation; this is about 5% of remaining runtime). However, this would also require that CanonicalizePath() somehow learned to work with StringPiece (presumably allocating a new StringPiece if and only if changes were needed).
This is much faster than std::unordered_map, and also slightly faster than phmap::flat_hash_map that was included in PR ninja-build#2468. It is MIT-licensed, and we just include the .h file wholesale. I haven't done a detailed test of all the various unordered_maps out there, but this is the overall highest-ranking contender on https://martin.ankerl.com/2022/08/27/hashmap-bench-01/ except for ankerl::unordered_dense::map, which requires C++17. For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 5.14 to 4.62 seconds.
This is the currently fastest hash that passes SMHasher and does not require special instructions (e.g. SIMD). Like emhash8, it is MIT-licensed, and we include the .h file directly. For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 4.62 to 4.22 seconds. (NOTE: This is a more difficult measurement than the previous ones, as it necessarily involves removing the entire build log and doing a clean build. However, just switching the HashMap hash takes to 4.47 seconds or so.)
I took out the lexer.cc changes. |
ftell() must go ask the kernel for the file offset, in case someone knew the underlying file descriptor number and seeked it. Thus, we can save a couple hundred thousand syscalls by just caching the offset and maintaining it ourselves. This cuts another ~170ms off a no-op Chromium build.
This cuts off another ~100 ms, most likely because the compiler doesn't have smart enough alias analysis to do the same (trivial) transformation.
Can you benchmark only the hasmap change against MSVC's std::unordered_map and libc++ (e.g. on macOS)? |
I haven't really had a Windows machine since 2001 or so, so that's a bit tricky :-) I have a Mac at work, so I can make the test there next week. Or I can probably make a test with libc++ on Linux (with Clang) if that works? |
Yes, would be even more interesting to have a direct comparison on the same system between libstdc++ and libc++ :) |
Linux, still 5950X:
|
@jhasse Is there anything missing here for this to be merged? |
This patch series speeds up ninja parsing (as measured by a no-op Chromium build) by about 40–50%. The main win is reducing allocation rate by punting StringPiece allocation to a arena/bump allocator, but we also switch out the hash table and hash functions in use.
The series might seem large, but a) most of it is vendoring EmHash8 and rapidhash, and b) most of the rest is just piping an arena pointer through to the various functions and tests.