Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change the implementation of
commonMemSetLargePattern
to use the largest pattern word size supported by the backend into which the pattern can be divided. That is, use 4-byte words if the pattern size is a multiple of 4, otherwise use 2-byte words for even sizes and 1-byte words for odd sizes.Keep the idea of filling the entire destination region with the first word, and only start strided fill from the second, but implement it correctly. The previous implementation produced incorrect results for any pattern size which wasn't a multiple of 4. For example, a pattern of
std::array<uint8_t,5>{{1,2,3,4,5}}
produced1 2 3 4 5 2 3 4 1 5 3 4 1 2 5 ...
instead of1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 ...
.For HIP, the strided fill remains to be always in 1-byte increments because HIP API doesn't provide strided multi-byte memset functions like CUDA does. For CUDA, both the initial memset and the strided ones use the largest possible word size.
Add a new optimisation skipping the strided fills completely if the pattern is equal to the first word repeated throughout. This is most commonly the case for a pattern of all zeros, but other cases are possible. This optimisation is implemented in both CUDA and HIP adapters.
CUDA correctness and performance before (left) and afer (right) this patch:
We can see the new solution has O(1) complexity for the optimised case of first word repeated throughout the pattern. For a varied pattern, the complexity depends on pattern size S and is O(S) when S is a multiple of 4, otherwise O(2S) when S is even and O(4S) when S is odd.
Similarly for HIP we see the following, where all results are now correct and we get O(1) performance for all-zeros. We also see we don't get the performance boost for multiples of 2B and 4B due to the lack of strided 2B and 4B memset functions in HIP API.
The validation/microbenchmark code is available here:
https://gist.github.com/rafbiels/807cfff4b023d8c53a0663be99ffa2ef
and was adapted into a new e2e test in the intel/llvm PR.
intel/llvm PR: intel/llvm#15991