string/aarch64: optimizing SVE routines #65

dssgabriel · 2024-01-24T20:10:16Z

Optimizing SVE string routines

This PR brings compelling optimizations to some of the SVE-implemented string routines.

About

On the Arm Neoverse V1 microarchitecture, INCx instructions used to increment the loop offset can cause significant slowdowns. One solution is to hoist the retrieval of the SVE register width out of the loop using CNTx in the loop prelude and replace the INCx with a simple ADD. This change is only used in paths where the whole vector is known to be valid, i.e., when no elements in the register are masked using a predicate. Conveniently, this is the critical path in most scenarios.
This change should allow reclaiming the performance penalty incurred by INCx instructions without affecting performance on other SVE-supporting microarchitectures (e.g., Neoverse V2).

The proposed changes systematically yield better performance across all tested buffer sizes, with performance improvements of up to 45% depending on the routine.

Context

This contribution was carried out as part of the EMOPASS European Project at the Laboratory of Parallelism and Distributed Networks and Algorithm (LI-PaRAD) of the University of Versailles Saint-Quentin-en-Yvelines (UVSQ — Paris-Saclay University), in collaboration with industrial partners: SiPearl and Eviden.

While benchmarking the performance of HPC applications on Arm Neoverse V1 microarchitectures (in particular on the AWS Graviton3 and Graviton3E processors), we have observed that using INCx SVE instructions to increment the loop offset can lead to noticeable performance degradation. Such slowdown was also observed in code for Data-Dependent Exit (DDE) loops for processing null-terminated strings while collaborating with the SiPearl compiler team. It seems to be a known issue, as it has been previously mentioned in this LLVM commit.

Tests

The build and all the tests validating the correct behavior of the modified routines have passed (enabled by adding -march=armv8.4-a+sve in the config.mk file).

Benchmarks

Thorough benchmarks demonstrating the performance improvements are available at the following GitHub repository: https://github.com/dssgabriel/sve-string-routines-benchmarks

Benchmarking was performed on an AWS Graviton3E CPU (hpc7g.16xlarge instance).

I hope the benchmark repository given above answers most of your questions. As this PR doesn't come from Arm staff, I'd happily discuss any specific details below. 🙂

Similar optimizations may be possible for other routines, particularly in the math or pl subprojects, but I haven't had the chance to look into it yet.

Should this PR be merged, how should I send in the signed Arm contributor agreement?

Optimized the following SVE string routines: `memcmp`, `strchr`, `strcmp`, `strcpy`, `strlen`, `strncmp`, `strnlen`, `strrchr`. On Arm Neoverse V1 microarchitectures, `INCx` instructions used to increment the loop offset can cause significant slowdowns. One solution is to hoist the retrieval of the SVE register width out of the loop using `CNTx` in the loop prelude and replace the `INCx` with a simple `ADD`. This change should not incur any performance penalty on other SVE-supporting microarchitectures (e.g., Neoverse V2, A64FX, etc...).

zatrazz · 2024-01-26T13:17:44Z

Thanks for working on this, but if you check by changing your benchmark call to the libc instead you will see that all the fixed SVE optimized-routines implementations are slower than the one provided by glibc. On the memcpy one [2] (which is also used on glibc), you will see that SVE is used solely for small sizes (for large sizes ldp/stp yields better performance).

Back when I added these SVE implementations, I just copied them from old Linaro cortex-strings and the original author (Richard Henderson) added them mainly for his QEMU AVE work enablement. So although these implementations use SVE in a manual-oriented way, they are far from optimized in real SVE implementations.

Even the string that shows a slight improvement over glibc, strncmp, shows some drawbacks depending on the size and alignment. You can check by using a branch [3] where I added the strncmp as an experiment and running glibc bench tests:

$ make bench-build $ ./testrun.sh benchtests/bench-strncmp > bencht-strncmp.out $ <glibc-src>/benchtests/scripts/compare_strings.py -i ./bencht-strncmp.out -s <glibc-src>/benchtests/scripts/benchout_strings.schema.json -a timings,len,align1,align2

You will see that for small and unaligned sizes the SVE variant shows worse performance than the default one. Also, your benchmark is not representative of str* functions: they are generally used on small strings (up to 64 bytes) and possibly unaligned ones. So you should also add benchmarks for such inputs to get a better idea of whether the implementation is indeed better than the base one.

So I think it would be better to just delete the memcmp, strchr, strcmp, strcpy, strlen, strncmp, strnlen, and strrchr SVE since currently they do not show any possible improvement on current SVE hardware.

[1] https://sourceware.org/git/?p=glibc.git
[2] https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memcpy-sve.S
[3] https://github.com/zatrazz/glibc/tree/azanella/graviton-sve branch

Wilco1 · 2024-01-30T17:42:31Z

I agree with Adhemerval - the SVE implementations are experimental and not optimized at all. For example the main loop of the patched __strlen_aarch64_sve is still 3 times slower than the existing strlen implementations. I believe the way I did the SVE memcpy is the correct approach: start from the fastest implementation and then introduce SVE instructions where it improves performance further.

Deleting the SVE implementations may be too much given they have been in AOR for years now, but maybe we could move them into a different directory.

zatrazz · 2024-01-30T18:01:21Z

I agree with Adhemerval - the SVE implementations are experimental and not optimized at all. For example the main loop of the patched __strlen_aarch64_sve is still 3 times slower than the existing strlen implementations. I believe the way I did the SVE memcpy is the correct approach: start from the fastest implementation and then introduce SVE instructions where it improves performance further.

Deleting the SVE implementations may be too much given they have been in AOR for years now, but maybe we could move them into a different directory.

Afaik NDK does not support SVE [1] nor bionic supports SVE for dynamic dispatch (ifunc) [2]. And taking in consideration that none of these functions do actually show any performance improvement, I still think it would be better to just remove them to avoid projects to actually start using them.

[1] https://android.googlesource.com/platform/bionic
[2] https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/arch-arm64/dynamic_function_dispatch.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

string/aarch64: optimizing SVE routines #65

string/aarch64: optimizing SVE routines #65

dssgabriel commented Jan 24, 2024 •

edited

Loading

zatrazz commented Jan 26, 2024

Wilco1 commented Jan 30, 2024

zatrazz commented Jan 30, 2024

string/aarch64: optimizing SVE routines #65

Are you sure you want to change the base?

string/aarch64: optimizing SVE routines #65

Conversation

dssgabriel commented Jan 24, 2024 • edited Loading

Optimizing SVE string routines

About

Context

Tests

Benchmarks

zatrazz commented Jan 26, 2024

Wilco1 commented Jan 30, 2024

zatrazz commented Jan 30, 2024

dssgabriel commented Jan 24, 2024 •

edited

Loading