-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
string/aarch64: optimizing SVE routines #65
base: master
Are you sure you want to change the base?
Conversation
Optimized the following SVE string routines: `memcmp`, `strchr`, `strcmp`, `strcpy`, `strlen`, `strncmp`, `strnlen`, `strrchr`. On Arm Neoverse V1 microarchitectures, `INCx` instructions used to increment the loop offset can cause significant slowdowns. One solution is to hoist the retrieval of the SVE register width out of the loop using `CNTx` in the loop prelude and replace the `INCx` with a simple `ADD`. This change should not incur any performance penalty on other SVE-supporting microarchitectures (e.g., Neoverse V2, A64FX, etc...).
Thanks for working on this, but if you check by changing your benchmark call to the libc instead you will see that all the fixed SVE optimized-routines implementations are slower than the one provided by glibc. On the memcpy one [2] (which is also used on glibc), you will see that SVE is used solely for small sizes (for large sizes ldp/stp yields better performance). Back when I added these SVE implementations, I just copied them from old Linaro cortex-strings and the original author (Richard Henderson) added them mainly for his QEMU AVE work enablement. So although these implementations use SVE in a manual-oriented way, they are far from optimized in real SVE implementations. Even the string that shows a slight improvement over glibc, strncmp, shows some drawbacks depending on the size and alignment. You can check by using a branch [3] where I added the strncmp as an experiment and running glibc bench tests:
You will see that for small and unaligned sizes the SVE variant shows worse performance than the default one. Also, your benchmark is not representative of str* functions: they are generally used on small strings (up to 64 bytes) and possibly unaligned ones. So you should also add benchmarks for such inputs to get a better idea of whether the implementation is indeed better than the base one. So I think it would be better to just delete the memcmp, strchr, strcmp, strcpy, strlen, strncmp, strnlen, and strrchr SVE since currently they do not show any possible improvement on current SVE hardware. [1] https://sourceware.org/git/?p=glibc.git |
I agree with Adhemerval - the SVE implementations are experimental and not optimized at all. For example the main loop of the patched __strlen_aarch64_sve is still 3 times slower than the existing strlen implementations. I believe the way I did the SVE memcpy is the correct approach: start from the fastest implementation and then introduce SVE instructions where it improves performance further. Deleting the SVE implementations may be too much given they have been in AOR for years now, but maybe we could move them into a different directory. |
Afaik NDK does not support SVE [1] nor bionic supports SVE for dynamic dispatch (ifunc) [2]. And taking in consideration that none of these functions do actually show any performance improvement, I still think it would be better to just remove them to avoid projects to actually start using them. [1] https://android.googlesource.com/platform/bionic |
Optimizing SVE string routines
This PR brings compelling optimizations to some of the SVE-implemented string routines.
About
On the Arm Neoverse V1 microarchitecture,
INCx
instructions used to increment the loop offset can cause significant slowdowns. One solution is to hoist the retrieval of the SVE register width out of the loop usingCNTx
in the loop prelude and replace theINCx
with a simpleADD
. This change is only used in paths where the whole vector is known to be valid, i.e., when no elements in the register are masked using a predicate. Conveniently, this is the critical path in most scenarios.This change should allow reclaiming the performance penalty incurred by
INCx
instructions without affecting performance on other SVE-supporting microarchitectures (e.g., Neoverse V2).The proposed changes systematically yield better performance across all tested buffer sizes, with performance improvements of up to 45% depending on the routine.
Context
This contribution was carried out as part of the EMOPASS European Project at the Laboratory of Parallelism and Distributed Networks and Algorithm (LI-PaRAD) of the University of Versailles Saint-Quentin-en-Yvelines (UVSQ — Paris-Saclay University), in collaboration with industrial partners: SiPearl and Eviden.
While benchmarking the performance of HPC applications on Arm Neoverse V1 microarchitectures (in particular on the AWS Graviton3 and Graviton3E processors), we have observed that using
INCx
SVE instructions to increment the loop offset can lead to noticeable performance degradation. Such slowdown was also observed in code for Data-Dependent Exit (DDE) loops for processing null-terminated strings while collaborating with the SiPearl compiler team. It seems to be a known issue, as it has been previously mentioned in this LLVM commit.Tests
The build and all the tests validating the correct behavior of the modified routines have passed (enabled by adding
-march=armv8.4-a+sve
in theconfig.mk
file).Benchmarks
Thorough benchmarks demonstrating the performance improvements are available at the following GitHub repository: https://github.com/dssgabriel/sve-string-routines-benchmarks
Benchmarking was performed on an AWS Graviton3E CPU (
hpc7g.16xlarge
instance).I hope the benchmark repository given above answers most of your questions. As this PR doesn't come from Arm staff, I'd happily discuss any specific details below. 🙂
Similar optimizations may be possible for other routines, particularly in the
math
orpl
subprojects, but I haven't had the chance to look into it yet.Should this PR be merged, how should I send in the signed Arm contributor agreement?