Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTEST/IB: Don't fatal if an SL is blocked #10197

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Oct 1, 2024

What

Proposal to keep running gtests when TB does not support some SL.

Why ?

We have test bed that does not let traffic through for SL > 7, and it is causing gtest crash. In this specific traffic test, we can keep going, signaling failure anyways.

How ?

Ignore fatal.

@brminich
Copy link
Contributor

brminich commented Oct 1, 2024

can you pls post an output of filed test with this PR?

@tvegas1
Copy link
Contributor Author

tvegas1 commented Oct 1, 2024

Log is still dumped and it contains the corresponding SL.

[ RUN      ] rc_mlx5/test_uct_ib_sl.check_ib_sl_config/1 <rc_mlx5/mlx5_2:1>
[     INFO ] < Transport retry count exceeded on mlx5_2:1/IB (synd 0x15 vend 0x81 hw_synd 0/0) >
[     INFO ] < RC QP 0x6782 wqe[0]: SEND --e [inl len 18] [rqpn 0x6780 dlid=17 sl=8 port=1 src_path_bits=0 dgid=fe81::ec0d:9a03:c0:45bc flow_label=0x0 sgid_index=0 traffic_class=0] >
uct/ib/test_ib.cc:74: Failure
Expected equality of these values:
  sizeof(send_data)
    Which is: 8
  recv_buffer->length
    Which is: 0

...

[     INFO ] < Transport retry count exceeded on mlx5_2:1/IB (synd 0x15 vend 0x81 hw_synd 0/0) >
[     INFO ] < RC QP 0x679e wqe[0]: SEND --e [inl len 18] [rqpn 0x679c dlid=17 sl=15 port=1 src_path_bits=0 dgid=fe81::ec0d:9a03:c0:45bc flow_label=0x0 sgid_index=0 traffic_class=0] >
uct/ib/test_ib.cc:74: Failure
Expected equality of these values:
  sizeof(send_data)
    Which is: 8
  recv_buffer->length
    Which is: 0
[  FAILED  ] rc_mlx5/test_uct_ib_sl.check_ib_sl_config/1, where GetParam() = rc_mlx5/mlx5_2:1 (129841 ms)
[ RUN      ] rc_mlx5/test_uct_ib_sl.check_ib_sl_config/2 <rc_mlx5/mlx5_3:1>
[     SKIP ] (OOO SL mask couldn't be detected for mlx5_3:1)
[       OK ] rc_mlx5/test_uct_ib_sl.check_ib_sl_config/2 (59 ms)
[ RUN      ] rc_mlx5/test_uct_ib_sl.check_ib_sl_config/3 <rc_mlx5/mlx5_4:1>
[       OK ] rc_mlx5/test_uct_ib_sl.check_ib_sl_config/3 (822 ms)
[----------] 4 tests from rc_mlx5/test_uct_ib_sl (131975 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (131975 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] rc_mlx5/test_uct_ib_sl.check_ib_sl_config/1, where GetParam() = rc_mlx5/mlx5_2:1

 1 FAILED TEST

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure it solves the issue, since we still want to fail if the SL is mapped to a valid VL

@tvegas1
Copy link
Contributor Author

tvegas1 commented Oct 7, 2024

i'm not sure it solves the issue, since we still want to fail if the SL is mapped to a valid VL

The test still fails as it's behavior is not changed. What is changed is that previously when it would be crashing, now it fails instead.
I think it's ok to add because the test checks are related to ability to transfer data, so the transport retry count failure is somehow related to the test, not an unrelated error.

@tvegas1
Copy link
Contributor Author

tvegas1 commented Oct 14, 2024

/azp run UCX PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants