-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove
vectorization
#4987
base: main
Are you sure you want to change the base?
remove
vectorization
#4987
Conversation
Irony: a PR that adds vectorization entitled " |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Modern AMD data would be interesting. |
uint8_t _Shuf[_Size_v][_Size_h]; | ||
uint8_t _Size[_Size_v]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to elaborate on the table element bit with decision, though it probably doesn't deserve a code comment.
_Shuf
is explicitly widened for 32-bit and 64-bit cases using _mm256_cvtepu8_epi32(_mm_loadu_si64(...));
_Size
is implicitly widened for all cases.
This widening causes some extra instructions for _Shuf
. For _Size
apparently it is free (movzx
instead of mov
)
I prefer having to widen over having larger tables. Larger tables are consuming more cache, which is hard to notice in a synthetic benchmark, but likely to be noticeable in a realistic program.
Currently the _Shuf
table, which is the largest of the two, is 128 bytes for 64-bit elements, 2 KiB for 8 bit and 32 bit elements, 4 KiB for 16 bit elements. Typical L1 cache size is 64 KiB per core.
I'm now seeing multiple solutions how to do |
📜 The algorithm
In-place
remove
algorithm that uses bit mask of vector comparisons as shuffle index to remove certain elements.The destination is advanced by a size taken from a lookup table too, although
popcnt
could have been used.The details vary on depending on element size:
pshufb
/_mm_shuffle_epi8
to remove elements.vpermd
/_mm256_permutevar8x32_epi32
to remove elements, which is cross lane. SEE fallbacks are used with smaller tables, still surprisingly more efficient than scalar.pmovmskb
. 32 and 64 bit usevmovmskps
/vmovmskpd
, though they are for floating types, they fit well, and avoid the need of cross-lane swizzling to compress the mask. For 16-bit,packsswb
is used, althoughpshufb
could have been used as well.🔍 Find first!
Before even starting, find is performed to find the first mismatch element. This is done for the correctness, and also there are performance reasons why it is good:
The existing
find
implementation is called. Hypothetically I could implement it inline and save some instructions in some cases, but such optimization has too negligible effect on performance, while increasing complexity noticeably. Though this might be revised for futureremove_copy
if that and this would share the implementation.The algorithm removes elements from the source vector (of 8 or less elements) by a shuffle operation, so that non-removed elements are placed contiguously in that vector. Then it writes the whole vector to the destination, and advances the destination pointer to the size of non-removed elements.
As a result:
I have no doubts that overwriting elements in the resulting range to to some intermediate values before setting them to the expected values is correct. The write and the data race (in abstract machine terms) exist anyway, so extra write is not observable.
I have concerns regarding damaging the removed range. Changing these values is observable.
I'd appeal to that elements in removed range stay in valid-but-uspecified state, although I understand that the purpose of standard saying that is to enable moving of non-trivially-copyables, but not to do what I did.
Note that:
remove_copy
in a similar way have to avoid superfluous writes anyway🗄️ Memory usage
Unlike most other vectorization algorithms, this one uses large lookup tables. 8 and 32 bit variants use 2 KiB table, 16 bit variant uses 4 KiB table.
This has different performance characteristics, compared to pure-computational optimizations. In particular, it tends to behave worse in some programs that don't fit cache well on their critical path. This doesn't apply to benchmarks, but unfortunately often applies to realistic programs, especially the ones that are not written with having performance in mind.
I believe that the optimization is still good or at least not bad most of the time where it is needed.
⏱️ Benchmark results