Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure GPU investigation #61

Open
vsoch opened this issue Sep 25, 2024 · 0 comments
Open

Azure GPU investigation #61

vsoch opened this issue Sep 25, 2024 · 0 comments

Comments

@vsoch
Copy link
Member

vsoch commented Sep 25, 2024

When parsing the data I noticed that ECC being yes/no was inconsistent. It seemed random at times. But I think this could be an important finding for our study, because (I am reading) that NVIDIA GPUs have ECC (error correcting code) memory that allows the system to detect when memory errors occur. It sounds great, but activating it slows down VRAM. Specifically:

Turning ECC on:

  • It reduces the amount of available memory by 12.5%.
  • It makes context synchronization more expensive.
  • Uncoalesced memory transactions are more expensive when ECC is enabled than otherwise.

That is from https://www.cudahandbook.com/. It makes me wonder the implications for having some on, some off. I think we will need to look at the data more closely to see how consistent the setting is within experiment environment and sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant