-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tpetra: broken unit tests with cuda 12.4 + h100 gpus #13399
Comments
@vasylivy Relevant machine is down for upgrades. We will compare against our configuration and try to reproduce when it comes back up. |
Tested config 1 w/ the following turned off -DKokkos_ENABLE_CUDA_UVM=OFF the unit tests pass, so it would appear to be UVM related. Yaro |
@vasylivy I built all the unit tests the way the perf tests build on Hops and they all pass. The RDC build failed because evidently you need CuSPARSE enabled to build with RDC (why?). Will fix and report back when that finishes. I can try a UVM one as well w/o RDC. As an aside, I just got new MPI settings from @jjellio that I need to try. |
@vasylivy Yeah, it appears to be UVM, because RDC by itself has exactly 1 failing test. |
@vasylivy UVM on tests vortex passed. I'm going to try CEE a100s and h100s to see if this is machine-specific or accelerator specific. Edit: CEE V100 & A100 cuda-12.4 tests all pass Second Edit: CEE H100 cuda-12.4 has a number of failing tests. So our problem is not cuda version specific, it is hardware specific. |
@csiefer2 had one failure in tpetra on ada arch w/ uvm so would indeed appear specific to hopper |
Hi,
Reporting broken unit tests with cuda 12.4 + h100 gpus. See configuration 1 reported here #13397.
Tests that time out with 300s, were fine with non-UVM config. I'll have to retry these later. If you have a recommended time out let me know.
Thanks,
Yaro
The text was updated successfully, but these errors were encountered: