TL/MLX5: Fix segmentation fault in a2a mpi test #996
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related bug: https://redmine.mellanox.com/issues/3706049
What
Set rcache alignment back from
ucc_get_page_size()
toUCS_PGT_ADDR_ALIGN
Re-activate tl/mlx5 alltoall
Why ?
This bug reproduces only when using ucx anterior to openucx/ucx@85d2d9d0f, which introduced dynamic rcache alignment.
#877 (specifically b13b87d) sets the alignment to
ucc_get_page_size()
whereas it wasUCS_PGT_ADDR_ALIGN
before.Setting the alignment back to
UCS_PGT_ADDR_ALIGN
solves the bug.The reason is yet to be found out.
Performance tests
Below is a comparison of the performances with tl/ucp and hcoll