Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Commits on Nov 4, 2024

  1. Fix incorrect outputs and improve performance of commonMemSetLargePat…

    …tern
    
    Change the implementation of commonMemSetLargePattern to use the largest
    pattern word size supported by the backend into which the pattern can be
    divided. That is, use 4-byte words if the pattern size is a multiple of 4,
    2-byte words for even sizes and 1-byte words for odd sizes.
    
    Keep the idea of filling the entire destination region with the first word,
    and only start strided fill from the second, but implement it correctly.
    The previous implementation produced incorrect results for any pattern size
    which wasn't a multiple of 4. For HIP, the strided fill remains to be always
    in 1-byte increments because HIP API doesn't provide strided multi-byte
    memset functions like CUDA does. For CUDA, both the initial memset and the
    strided ones use the largest possible word size.
    
    Add a new optimisation skipping the strided fills completely if the pattern
    is equal to the first word repeated throughout. This is most commonly the
    case for a pattern of all zeros, but other cases are possible. This
    optimisation is implemented in both CUDA and HIP adapters.
    rafbiels committed Nov 4, 2024
    Configuration menu
    Copy the full SHA
    cc52821 View commit details
    Browse the repository at this point in the history