Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate system information generated by CM submission tree script #364

Open
rysc3 opened this issue Oct 8, 2024 · 2 comments
Open

Inaccurate system information generated by CM submission tree script #364

rysc3 opened this issue Oct 8, 2024 · 2 comments
Labels
bug Something isn't working scc24

Comments

@rysc3
Copy link

rysc3 commented Oct 8, 2024

There are some inaccuracies in the information that the script generates, some more important than others I believe. For example, it sets our OS to ubuntu since that is the operating system inside the container despite us running rocky outside of the container ( I figure this is not a big deal), more importantly when I generate results using the given default cm run offline script for both base and main for scc24:

https://docs.mlcommons.org/cm4mlperf-inference/

it ends up saying we are using 3x H100 NVL. Our system has 4x H100 NVL, and they are all accessible. nvidia-smi yields the correct result inside the container and at multiple steps during runtime we can see when it iterates over cuda devices and lists indexes 0..3.

Furthermore, I'm not sure if this is the expected behavior or an error but by defauly without making any new configurations shouldn't it be running on only a single gpu and record that as the result accordingly? I've run it manually and monitored and verified that it is indeed only ever using the same GPU (index 0), so I would think it should then report only a single H100 being utilized?

Either way, I figure this should be reporting 1x or 4x. The cm run scripts I'm referencing are here:

https://docs.mlcommons.org/inference/benchmarks/text_to_image/reproducibility/scc24/

And note my submissions here to see the 3x H100s as mentioned on the leaderboard:
https://docs.mlcommons.org/cm4mlperf-inference/

@arjunsuresh
Copy link
Contributor

Hi @rysc3 , yes that's a bug and should be fixed here.

Running on a single GPU - this is happening with the reference implementation right? Actually that's a problem with the reference implementation and there will be points if you can make it run using all the GPUs and give a PR to the inference repository.

For Nvidia implementation - all GPUs are expected to be used.

@arjunsuresh arjunsuresh added bug Something isn't working scc24 labels Oct 8, 2024
@rysc3
Copy link
Author

rysc3 commented Oct 8, 2024

Yes, this is when using the reference implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working scc24
Development

No branches or pull requests

2 participants