Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

…tern Change the implementation of commonMemSetLargePattern to use the largest pattern word size supported by the backend into which the pattern can be divided. That is, use 4-byte words if the pattern size is a multiple of 4, 2-byte words for even sizes and 1-byte words for odd sizes. Keep the idea of filling the entire destination region with the first word, and only start strided fill from the second, but implement it correctly. The previous implementation produced incorrect results for any pattern size which wasn't a multiple of 4. For HIP, the strided fill remains to be always in 1-byte increments because HIP API doesn't provide strided multi-byte memset functions like CUDA does. For CUDA, both the initial memset and the strided ones use the largest possible word size. Add a new optimisation skipping the strided fills completely if the pattern is equal to the first word repeated throughout. This is most commonly the case for a pattern of all zeros, but other cases are possible. This optimisation is implemented in both CUDA and HIP adapters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

Commits on Nov 4, 2024

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

Are you sure you want to change the base?

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

Commits on Nov 4, 2024