-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
muelu: broken unit tests with cuda 12.4 + h100 gpus #13397
Comments
Automatic mention of the @trilinos/muelu team |
Am I reading this right that "approach 1", the build with passing tests has cuSPARSE, UVM, and CUDA RDC disabled, but is otherwise the same as the failing build? Or did I miss some other change? Were those three changes together necessary to make all the tests pass? |
The machine is down so I haven't isolated which of the options was the culprit but yes approach 1 with those options enabled had those failing tests. The second snippet (also using cmake directly but w/ those options disabled) had all tests pass. The last config using spack env / install w/ the shown env / variants also reproduced some of the same issues as in approach 1. Yaro |
I'm not surprised stuff doesn't work with RDC enabled. I don't think RDC enabled is tested on any platform, because it takes sooooooo long. @sebrowne Correct me if I'm wrong. |
Tested config 1 w/ the following turned off
the unit tests pass, so it would appear to be UVM related. Yaro |
Triaging tests for config 1, they have the following errors
Various 300s timeouts that do not occur otherwise and do not report any errors
|
Hi,
I'm seeing various errors in muelu unit tests on nvidia h100 gpu using cuda 12.4 w/ kokkos uvm flag enabled. I'm not sure which approach is preferred for reporting here but I've tested using two different approaches to build Trilinos for h100s. Can someone with access to h100s try to reproduce the failures?
Approach 1. Use cmake directly and build Trilinos master SHA
bf922e75428
. The config file is shown belowThe following set of muelu unit tests fails, note that some of these are timeouts. Doubling the timeout didn't help with other testing so I've tried to be consistent and set it to 300 across all tests.
I see various errors across these tests including cuda errors e.g. triaging some of these failures
If on the other I hand I build using the following modified trilinos config, then ALL tests pass within the same 300s timeout and on the same machine.
Approach 2. Instead of using cmake directly, use spack to install trilinos master with following spack env activated. You may need to tweak the yaml file depending on the machines / modules available.
Note that no one has updated spack develop branch yet, so you need to update kokkos to 4.4.00 by making the following modifications to built in package.py found under
/path/to/spack/var/spack/repos/builtin/packages/package-name
When installing use --keep-stage to keep the build directories and run ctest from there after finishing, note Amesos tests don't build so probably turn that off. When I tested things using this approach, the Trilinos master SHA that was checked out was
2ad26029
and the following set of MueLU tests failedI did not see the same set of errors compared to using the direct cmake approach (as the options are probably slightly different) but the tests are still hit with various errors e.g.
Thanks,
Yaro
The text was updated successfully, but these errors were encountered: