Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up parsing #2519

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Speed up parsing #2519

wants to merge 8 commits into from

Conversation

sesse
Copy link

@sesse sesse commented Nov 1, 2024

This patch series speeds up ninja parsing (as measured by a no-op Chromium build) by about 40–50%. The main win is reducing allocation rate by punting StringPiece allocation to a arena/bump allocator, but we also switch out the hash table and hash functions in use.

The series might seem large, but a) most of it is vendoring EmHash8 and rapidhash, and b) most of the rest is just piping an arena pointer through to the various functions and tests.

@sesse sesse force-pushed the master branch 2 times, most recently from f96c667 to 19396e6 Compare November 1, 2024 12:40
@digit-google
Copy link
Contributor

Thanks, cool ideas in there. While I didn't look into it for very long, a few remarks:

  • When adding new classes, or new metthods, provide unit-tests for them.

  • Please use a "_" suffix when adding new class members (e.g. ins_, outs_, validations_ instead of ins, outs, validations).

  • Put third-party headers under src/third_party/xxx to make their origin clearer. Add a README.ninja file as well indicating the original URL, SPDX License tag, etc.. If patches were applied, there must be some way to show that to make future updates easier (e.g. Chromium usually provides a patches/ sub-directory that contains the patches applied to the local copy).

  • Separating these into multiple PRs will help them get accepted quicker.

  • StringPiece::Persist() should be Arena::PersistStringPiece(). The StringPiece class should not know or depend on Arena at all.

  • I don't think the log version needs to be changed if the hash is computed differently. @jhasse, what do you think?

  • Make the arena a member of the State class, which will remove 90% of the code you added to send its pointer to various layers of the code, and make everything much easier to understand. It looks like there are no benefit for using an Arena whose lifecycle exceeds that of State, but let me know if I'm wrong.

  • Would appreciate measurements related to peak RAM usage before/after the patch is applied for large build graphs. I'll try to get these numbers myself for the Fuchsia build (which is about 10x larger than the Chromium one) next week.

@nico
Copy link
Collaborator

nico commented Nov 1, 2024

I was about to hit merge when I saw digit's comments. This is all small self-contained commits; I thought the PR is in pretty good shape. (In particular, I don't think it necessarily needs to be split. Splitting it wouldn't hurt either ofc, but it's more work, and I'd rather merge this in this form that not merging it due to us not merging it over review back-and-forth.)

The third_party suggestion is good.

I think a hash change does need a manifest version bump.

If the arena State suggestion works out, that's also a good suggestion.

Sending the hash map mingw fix upstream would be nice (but not a blocker).

Thanks for the PR!

@digit-google
Copy link
Contributor

Oh, and this fails to build because I didn't add arena.cc to either configure.py or CMakeLists.txt !!

@digit-google
Copy link
Contributor

After fixing this manually, I am also seeing multiple failures in BuildLog unit-tests. Please take care of these as well.

src/arena.h Outdated Show resolved Hide resolved
src/arena.h Outdated

struct Arena {
public:
static constexpr size_t kAlignTarget = sizeof(uint64_t);
Copy link
Contributor

@digit-google digit-google Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regretabbly a static constexpr size_t member will need an empty declaration in the corresponding .cc file, or some compilers will complain in debug mode. But this definition seems to only be used inside of Alloc(), do you really it, i.e. could you turn that into a simple function local variable?

src/arena.cc Outdated
size_t to_allocate = std::max(next_size_, num_bytes);

Block new_block;
new_block.mem.reset(new char[to_allocate]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Used an aligned new here since there is technically no guarantee that this will happen (yes, I know most allocators would use size_t as a minimum). For now the arena is only used to store character sequences so this doesn't matter, but this could become problematic if it is used later to allocate other things.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That requires C++17, no? Is that allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In theory you should be able to use a dedicated type with alignas() specifier, and use this in a new AlignedType[(to_allocate + sizeof(AlignedType) - 1) & ~sizeof(AlignedType)]), but this is getting ugly, and frankly there is no need for this feature here. What do you think about dropping the alignment requirement entirely, and just call this StringPieceArena for clarity?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arena is taken out now, so we'll deal with it in a separate PR if we get there.

src/arena.cc Outdated Show resolved Hide resolved
src/arena.h Outdated Show resolved Hide resolved
src/eval_env.cc Outdated Show resolved Hide resolved
@sesse
Copy link
Author

sesse commented Nov 4, 2024

I think everything should be taken care of now, short of the alignment issue.

For the memory usage, I wonder if we should consider making a short-string optimization to StringPiece()? It should be possible to have a 15-byte string inline, without allocating anything on the arena. We could then also possibly bring back the concatenation optimization in EvalString, so that if you do AddString("foo"); AddString("bar"); you get a single short-string StringPiece with 6 bytes in them.

@digit-google
Copy link
Contributor

I think StringPiece is really designed to be a view over content (just like std::string_view). What you describe would be a completely different type, but that's a interesting idea. Maybe for another PR?

Since we're talking about StringPiece / string lookup performance, the Android Ninja fork has introduced a HashedStringView type that embeds a (pointer + 32-bit size + 32-bit hash) to speed up hash table lookups considerably (with a corresponding HashedString type, which is an std::string + 32-bit hash).

I had ported the idea on top of my fork some time ago, and usint this improved things noticeably (between 5% and 11% for no-op Ninja builds, depending on the build graph). This could be another way to improve speed, at the cost of extra complexity though, and historically Ninja maintainers have been very reluctant of such changes.

It would be nice to know if @nico and @jhasse would be ok with this before trying to implement these schemes.

@sesse
Copy link
Author

sesse commented Nov 4, 2024

I looked a bit more at this; the issue isn't that we have a lot of short stirngs (we don't). it is simply that EvalString doesn't need to live past the end of ManifestParser::ParseDefault(). So we could simply have a small arena that lives only in ManifestParser, and is cleared (with memory available for reuse) at the end of ParseEdge(). It would be a little more complex, but we would probably get rid of all of the memory bloat. What do you think?

@sesse
Copy link
Author

sesse commented Nov 4, 2024

I see some autobuilders are failing, too:

  • Linux / Fedora complains that rapidhash.h has trailing spaces. Should I fix this?
  • macOS seems to build in C++03 mode, so e.g. constexpr is not allowed. This seems to conflict with the desire to use C++11 range-based for loops in this review?
  • Linux / build complains about spelling errors in third-party code. What about this one?

@digit-google
Copy link
Contributor

Thanks a ton @sesse. I have tried your latest patch with a moderate Fuchsia build plan (e.g. fx set core.x64 --with //bundles/tests) and there is a great improvement in no-op time (e.g. from 11.6s to 10.8s, i.e. -7%). On the other hand, peal RAM usage (measured with /usr/bin/time) goes from 1.7 GiB to 2.4 GiB which really considerable (+40%).

I wanted to see how much the non-arena related improvements impacted performance, and after a small rebase, I got a newer version of your patchset (see attached patch) that actually runs slightly faster, without increasing memory at all. So it looks like the arena is not helping after all.

0001-Remove-arena-from-pr-2519-commit-790f571.patch.zip

@digit-google
Copy link
Contributor

The MacOS compilation issue seems to be a bug, this should definitely compile as C++11. I think this is entirely unrelated to this Cl. This is bad :-(

For the spelling issue, the codespell invocation in .github/workflows/linux.yml needs to be updated to add a command-line option, probably something like --skip="src/third_party/*" or similar.

For linting, misc/ci.py needs to be modified to skip src/third_party as well.

@sesse
Copy link
Author

sesse commented Nov 4, 2024

I definitely see differences between the arena and non-arena; there's a measurement there right at the first patch. But like I said, maybe the arena can do with a smaller scope/lifetime.

We can review the non-arena parts first, and then come back to it after the other stuff is in? I'm not that keen on all the re-measuring to get the commit messages right, though.

@digit-google
Copy link
Contributor

Looking at the MacOS issue, it looks like the CMakeLists.txt only forces C++11 when compiling libninja sources. For anything else, e.g. tests and binaries, the compiler's defaults are being used. I guess adding -DCMAKE_CXX_STANDARD=11 in the CMake configuration command might help).

@digit-google
Copy link
Contributor

Let's try to fix the third_party/ and MacOS issues in a separate PR to get it approved more quickly... I'll prepare one.

@digit-google
Copy link
Contributor

FYI: I have uploaded my rebase at https://github.com/digit-google/ninja/pull/new/sesse-ninja-pr2519-790f571-without-arena if that can help you (you do not have to use it, and I want you to get full credit for this work, just to be clear).

For a no-op build of Chromium (Linux, Zen 2),
this reduces time spent from 5.76 to 5.48 seconds.
@sesse
Copy link
Author

sesse commented Nov 4, 2024

I took out the arena (I intend to re-measure it after we have all of the other stuff in). It was my own rebase, but I looked to yours for confirmation in a couple of places, so that was useful; thanks.

@digit-google
Copy link
Contributor

Arrgh, modifying misc/ci.py is quite easy, but it seems there is no way to tell codespell to skip a given directory, only filename globs are supported in the skip: section (or the --skip command-line argument).

@jhasse
Copy link
Collaborator

jhasse commented Nov 5, 2024

Thx for the PR :)

For the spelling errors / trailing whitespace I would create a PR on the upstream project.

I don't think the log version needs to be changed if the hash is computed differently. @jhasse, what do you think?

Hm ... why not?

@sesse
Copy link
Author

sesse commented Nov 5, 2024

For the spelling errors / trailing whitespace I would create a PR on the upstream project.

I doubt you can fix everything anyway; one of them was Windows line endings, and I doubt rapidhash would take in such a change.

@digit-google
Copy link
Contributor

I don't think the log version needs to be changed if the hash is computed differently. @jhasse, what do you think?
Hm ... why not?

Apparently, this would have nearly the same effect. Looking at DependencyScan::RecomputeOutputDirty() and comparing these two cases:

  1. Change the hash value + the version number: the next incremental build will print a warning such as "build log version too old...", clear the log, then all commands will be rebuilt with an explanation of "command line not found in log for {target}"

  2. Change the hash value only: the next incremental build will rebuild all commands with an explanation of "command line changed for {target}"

I don´t think there is a best solution here, whatever suits you I guess.

@tsniatowski
Copy link

As a drive-by comment from someone who has confused incompatible ninja versions before, the "build log version" message is much more helpful than just rebuilding everything. The "command line changed" explanation wouldn't be seen by default, right?

@digit-google
Copy link
Contributor

Thanks a ton @sesse, this is great, here are my benchmarking results for the following versions:

  • ninja-upstream: Unmodified upstream Ninja sources
  • ninja-sesse1: Your original patchset with the initial arena implementation
  • ninja-sesse1-no-arena: The same without the arena (my version)
  • ninja-sesse3: The version you just uploaded as commit a0f10b9, which is currently the master commit in your repo.
  • ninja-sesse3-arena: The wip version that is the arena commit in you repo right now (777574b):

First, regarding performance:

$ ARGS="-C out/default --quiet -n"
$ hyperfine --runs=5 "/tmp/ninja-upstream $ARGS" "/tmp/ninja-sesse1 $ARGS" "/tmp/ninja-sesse1-no-arena $ARGS" "/tmp/ninja-sesse3 $ARGS" "/tmp/ninja-sesse3-arena $ARGS"
Benchmark 1: /tmp/ninja-upstream -C out/default -n --quiet
  Time (mean ± σ):     11.448 s ±  0.062 s    [User: 6.813 s, System: 4.612 s]
  Range (min … max):   11.397 s … 11.550 s    5 runs
 
Benchmark 2: /tmp/ninja-sesse1 -C out/default -n --quiet
  Time (mean ± σ):     10.999 s ±  0.054 s    [User: 6.099 s, System: 4.877 s]
  Range (min … max):   10.921 s … 11.061 s    5 runs
 
Benchmark 3: /tmp/ninja-sesse1-no-arena -C out/default -n --quiet
  Time (mean ± σ):     10.663 s ±  0.025 s    [User: 6.101 s, System: 4.537 s]
  Range (min … max):   10.619 s … 10.682 s    5 runs
 
Benchmark 4: /tmp/ninja-sesse3 -C out/default -n --quiet
  Time (mean ± σ):     10.586 s ±  0.032 s    [User: 6.076 s, System: 4.485 s]
  Range (min … max):   10.561 s … 10.642 s    5 runs
 
  Warning: The first benchmarking run for this command was significantly slower than the rest (10.642 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
 
Benchmark 5: /tmp/ninja-sesse3-arena -C out/default -n --quiet
  Time (mean ± σ):     10.820 s ±  0.040 s    [User: 6.160 s, System: 4.636 s]
  Range (min … max):   10.779 s … 10.869 s    5 runs
 
Summary
  /tmp/ninja-sesse3 -C out/default -n --quiet ran
    1.01 ± 0.00 times faster than /tmp/ninja-sesse1-no-arena -C out/default -n --quiet
    1.02 ± 0.00 times faster than /tmp/ninja-sesse3-arena -C out/default -n --quiet
    1.04 ± 0.01 times faster than /tmp/ninja-sesse1 -C out/default -n --quiet
    1.08 ± 0.01 times faster than /tmp/ninja-upstream -C out/default -n --quiet

Second, regarding peak RAM usage:

$ /usr/bin/time -f%M /tmp/ninja-upstream $ARGS
1808180

$ /usr/bin/time -f%M /tmp/ninja-sesse3 $ARGS
1803832

$ /usr/bin/time -f%M /tmp/ninja-sesse3-arena $ARGS
1918924

Conclusion: for this specific build plan, the current patchset at a0f10b9 is the best. Surprisingly the arena is still slightly slower here, but the memory increase has been drastically reduced. I am now curious about numbers for your build plan :)

@digit-google
Copy link
Contributor

@tsniatowski : You make an excellent point. The current patchset contains the log version change to 7 and seems to be a great improvement, I say let's ship it!

@sesse
Copy link
Author

sesse commented Nov 5, 2024

I'm surprised you don't win more than 8%, but I guess perhaps your time is dominated by something else than parsing? It's impossible to say for sure without seeing a profile, although 4.6 seconds system time hints at some heavy OS involvement (stat-ing, perhaps?). (Also, how are you running these against the same build directory without having the problem of the hash changing and getting a full rebuild?)

In any case, the arena is now not part of the PR, so we can discuss that separately, I believe.

@jhasse
Copy link
Collaborator

jhasse commented Nov 5, 2024

one of them was Windows line endings, and I doubt rapidhash would take in such a change.

You don't know until you try :)

Where do the changes in lexer.cc come from?

Technically (unlike with e.g. the zlib license) the MIT license is infectious, meaning that we would have to distribute the copyright and the license terms with binary distributions of ninja after this PR. 99% of people aren't aware of this and break this rule all the time though, not sure if it really matters.
Also not sure about BSD-2-Clause.

@sesse
Copy link
Author

sesse commented Nov 5, 2024

I don't know what changed lexer.cc, I guess something in the build system does? I haven't modified it by hand. :-)

src/deps_log.cc Outdated Show resolved Hide resolved
@jhasse
Copy link
Collaborator

jhasse commented Nov 5, 2024

I don't know what changed lexer.cc, I guess something in the build system does? I haven't modified it by hand. :-)

Ah yes, I think building via Python changes it in-place using re2c. Can you remove that change from your commit? Only needed when modifying lexer.in.cc.

@sesse
Copy link
Author

sesse commented Nov 5, 2024

I don't really trust hyperfine's statistics, but since you wanted comparative measurements:

Benchmark 1: ./ninja-old -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      3.097 s ±  0.010 s    [User: 2.745 s, System: 0.351 s]
  Range (min … max):    3.089 s …  3.113 s    5 runs
 
Benchmark 2: ./ninja -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      2.122 s ±  0.009 s    [User: 1.730 s, System: 0.391 s]
  Range (min … max):    2.115 s …  2.136 s    5 runs
 
Summary
  ./ninja -C ~/chromium/src/out/Default -n --quiet ran
    1.46 ± 0.01 times faster than ./ninja-old -C ~/chromium/src/out/Default -n --quiet

This is on a 5950X (Zen 3), with a fairly normal NVMe SSD and Debian unstable.

Steinar H. Gunderson added 5 commits November 5, 2024 18:11
This very often holds only a single RAW token, so we do not
need to allocate elements on an std::vector for it in the
common case.

For a no-op build of Chromium (Linux, Zen 2),
this reduces time spent from 5.48 to 5.14 seconds.

Note that this opens up for a potential optimization where
EvalString::Evaluate() could just return a StringPiece, without
making a std::string out of it (which requires allocation; this is
about 5% of remaining runtime). However, this would also require
that CanonicalizePath() somehow learned to work with StringPiece
(presumably allocating a new StringPiece if and only if changes
were needed).
This is much faster than std::unordered_map, and also slightly faster
than phmap::flat_hash_map that was included in PR ninja-build#2468.
It is MIT-licensed, and we just include the .h file wholesale.

I haven't done a detailed test of all the various unordered_maps
out there, but this is the overall highest-ranking contender on

  https://martin.ankerl.com/2022/08/27/hashmap-bench-01/

except for ankerl::unordered_dense::map, which requires C++17.

For a no-op build of Chromium (Linux, Zen 2),
this reduces time spent from 5.14 to 4.62 seconds.
This is the currently fastest hash that passes SMHasher and does not
require special instructions (e.g. SIMD). Like emhash8, it is
MIT-licensed, and we include the .h file directly.

For a no-op build of Chromium (Linux, Zen 2),
this reduces time spent from 4.62 to 4.22 seconds.
(NOTE: This is a more difficult measurement than the previous ones,
as it necessarily involves removing the entire build log and doing
a clean build. However, just switching the HashMap hash takes
to 4.47 seconds or so.)
@sesse
Copy link
Author

sesse commented Nov 5, 2024

I took out the lexer.cc changes.

Steinar H. Gunderson added 2 commits November 5, 2024 18:21
ftell() must go ask the kernel for the file offset, in case
someone knew the underlying file descriptor number and seeked it.
Thus, we can save a couple hundred thousand syscalls by just
caching the offset and maintaining it ourselves.

This cuts another ~170ms off a no-op Chromium build.
This cuts off another ~100 ms, most likely because the compiler
doesn't have smart enough alias analysis to do the same (trivial)
transformation.
@jhasse
Copy link
Collaborator

jhasse commented Nov 7, 2024

Can you benchmark only the hasmap change against MSVC's std::unordered_map and libc++ (e.g. on macOS)?

@sesse
Copy link
Author

sesse commented Nov 7, 2024

I haven't really had a Windows machine since 2001 or so, so that's a bit tricky :-) I have a Mac at work, so I can make the test there next week. Or I can probably make a test with libc++ on Linux (with Clang) if that works?

@jhasse
Copy link
Collaborator

jhasse commented Nov 7, 2024

Yes, would be even more interesting to have a direct comparison on the same system between libstdc++ and libc++ :)

@sesse
Copy link
Author

sesse commented Nov 7, 2024

Linux, still 5950X:

Benchmark 1: ./ninja-libstdc++ -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      5.262 s ±  0.095 s    [User: 4.153 s, System: 1.108 s]
  Range (min … max):    5.094 s …  5.317 s    5 runs
 
Benchmark 2: ./ninja-libc++ -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      5.192 s ±  0.101 s    [User: 4.097 s, System: 1.093 s]
  Range (min … max):    5.080 s …  5.275 s    5 runs
 
Benchmark 3: ./ninja-libstdc++-emhash8 -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      4.913 s ±  0.066 s    [User: 3.808 s, System: 1.104 s]
  Range (min … max):    4.800 s …  4.973 s    5 runs
 
Benchmark 4: ./ninja-libc++-emhash8 -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      4.815 s ±  0.101 s    [User: 3.720 s, System: 1.094 s]
  Range (min … max):    4.752 s …  4.990 s    5 runs
 
Summary
  ./ninja-libc++-emhash8 -C ~/chromium/src/out/Default -n --quiet ran
    1.02 ± 0.03 times faster than ./ninja-libstdc++-emhash8 -C ~/chromium/src/out/Default -n --quiet
    1.08 ± 0.03 times faster than ./ninja-libc++ -C ~/chromium/src/out/Default -n --quiet
    1.09 ± 0.03 times faster than ./ninja-libstdc++ -C ~/chromium/src/out/Default -n --quiet

@sesse
Copy link
Author

sesse commented Nov 13, 2024

@jhasse Is there anything missing here for this to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants