Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA support #71

Open
ocaisa opened this issue Feb 10, 2021 · 7 comments
Open

CUDA support #71

ocaisa opened this issue Feb 10, 2021 · 7 comments
Assignees

Comments

@ocaisa
Copy link
Member

ocaisa commented Feb 10, 2021

I was experimenting with CUDA support within EESSI and ran into the issue that, when using CUDA compiled with the EESSI stack, the CUDA libraries from the host are not seen by the executables created by nvcc. This is because it looks for the CUDA driver libraries in the prefix, where they do not exist. There are a few viable solutions:

@peterstol
Copy link
Contributor

For systems with glibc in the prefix using /usr/lib64/nvidia as location may be a good choice.

On my brightcomputing system the CUDA libraries can be simply linked from /cm/local/apps/cuda/libs/current/lib64
Creating the symlinks will require admin privileges though and can vary between systems.
Could EESSI provide different CUDA versions and use the matching one with the kernel driver in a similar way as the architecture is matched with Archspec?

@bedroge
Copy link
Collaborator

bedroge commented Feb 17, 2021

Bart reported on Slack yesterday that the libcuda.so of new CUDA versions no longer link against this versioned libnvidia-fatbinaryload.so.XXX.YY:

Removed libnvidia-fatbinaryloader.so from the driver package. This functionality is now built into other driver libraries.

This means that, with these newer version, we could also make symlinks to the host's libcuda. The only annoying thing is that the location can, apparently, differ between distros, so I guess we would need something like a variant symlink that allows site to override the location of libcuda.so if necessary.

@ocaisa
Copy link
Member Author

ocaisa commented Feb 17, 2021

I don't understand this stuff well enough but I think that a symlink might only get us out of jail with a pure CUDA code? Comparing the lib64 to the stubs directory inside a CUDA toolkit installation, it looks like you also need libnvidia-ml.so.

I took a look at the OpenGL configuration for JSC and if we want to use visualisation capabilities on the available GPU, we would also need some of the other driver libraries.

@bedroge
Copy link
Collaborator

bedroge commented Feb 17, 2021

Yes, you probably need more, but I guess the same thing could be done for those? The advantage of that approach could be that it would for instance work out of the box on all systems that have these libraries in /usr/lib64. On other systems, the variable for the variant symlink would have to be overridden in the cvmfs configuration.

@ocaisa
Copy link
Member Author

ocaisa commented Feb 17, 2021

We could probably learn a lot from what happens with containers and GPUs, they would have the same issues with passing through driver libraries.

@ocaisa
Copy link
Member Author

ocaisa commented Feb 17, 2021

@ocaisa
Copy link
Member Author

ocaisa commented Apr 8, 2021

Thanks to #91, the 2021.03 release of EESSI can successfully compile and run CUDA code if symlinks to the driver libraries are placed in /opt/eessi/lib

@bedroge bedroge assigned ocaisa, bedroge and boegel and unassigned bedroge Nov 15, 2021
poksumdo pushed a commit to poksumdo/compatibility-layer that referenced this issue Jun 8, 2023
Add jsonschema as test dependency for archspec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants