Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add amd gpu autodetect=rsmi support #2057

Open
wants to merge 1 commit into
base: 3.x
Choose a base branch
from

Conversation

antonycleave
Copy link

This allows Autodetect=rsmi to be used in the slurm GRES.conf

technically no modification to the Spec file is needed but this is just a reminder that the rocm-smi-lib rpm needs to be installed as a build requirement on the OBS nodes

The rpm is included the AMD rocm repos which can be found here:
https://repo.radeon.com/rocm/el9/6.2.2/main

I have published a release in here with the rpms attached to verify that it works.
https://github.com/antonycleave/openhpc-slurm-with-rocm/releases

Copy link

Test Results

0 files   - 18  0 suites   - 18   0s ⏱️ -27s
0 tests  - 53  0 ✅  - 49  0 💤  - 4  0 ❌ ±0 
0 runs   - 66  0 ✅  - 62  0 💤  - 4  0 ❌ ±0 

Results for commit f1007ab. ± Comparison against base commit 5926e6b.

@adrianreber
Copy link
Member

Thanks for the pull request. Also thanks for providing a build with the feature enabled. That makes things easier.

I will bring this to the technical steering committee to see what they think.

An external repository providing a dependency is not the usual approach in OpenHPC, but from the description at AMD it seems to be all open source. We also include the Intel repository in our build system so that means we already do something like it.

It is not clear to me how this exactly works. It seems to be there a new Slurm plugin called gpu_rsmi.so. Does this plugin have a runtime dependency on any package in the AMD repository? Looking at the source code it seems to have a runtime dependency.

Without talking to the TSC I would say the plugin needs to be in a separate sub-package to not pull in dependencies for people who are not interested in this feature.

As this probably also requires runtime packages installed we need the correct runtime dependencies expressed in the RPM and a way to easily enable the AMD repository. For the Intel repository we ship a DNF repository definition.

This also needs to be added to the documentation as an optional step. If people enable this the recipes should automatically enable the repository and install the corresponding runtime dependencies.

GitHub Actions also needs to deal correctly with this runtime dependency.

We would also need some tests to be able to verify this change actually works. The minimal test would be that the AMD repository is correctly enabled and the plugin is installed. Actually testing something with AMD GPUs would be even better.

Do you have a system which the OpenHPC project could access to run the corresponding tests?

@@ -16,6 +16,7 @@
%global _with_slurmrestd 1
%global _with_multiple_slurmd 1
%global _with_freeipmi 1
%global _with_rsmi /opt/rocm/lib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a path? This is never used anywhere else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was leftover from my very first try when I was building manually and I had difference versions of the rocm stack and I wanted to be sure it was using the right one. With a fresh rpm install the path is not required anymore and this can be 1

@@ -94,6 +94,7 @@ Patch0: slurm.conf.example.patch
%bcond_with lua
%bcond_with numa
%bcond_with pmix
%bcond_with rsmi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repository you mentioned seems to be for RHEL. So this only needs to be enabled on RHEL builds for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is true.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no arm packages either will this matter for arm64 builds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that also needs to be excluded. Better make it x86_64 only.

@@ -452,7 +456,6 @@ module load hwloc
%{?_with_nvml} \
--with-hwloc=%{OHPC_LIBS}/hwloc \
%{?_with_cflags} || { cat config.log && exit 1; }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid unnecessary whitespace changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, my bad! I'll fix this this afternoon along with the unnecessary path in with rsmi

@antonycleave
Copy link
Author

yes there is a new plugin and it ends up in /usr/lib64/slurm/gpu_rsmi.so

Does this plugin have a runtime dependency on any package in the AMD repository?

It has no hard dependencies on packages in the AMD repos on the ctld or on the computes this is just a build time dependency

We have tested the current rpms on:

  1. a vm with no GPU and no rocm-smi-lib or amd repos configured
  2. a compute node with AMD gpus and rocm-smi-lib installed from the AMD repos

in case 2 everything works and the GPUs are detected

in case 1 with the defaults (i.e. no autodetecte enabled in gres.conf) nothing changed at all
in case 1) with Autodetect=rsmi in in gres.conf slurmd continues to start but the autodetection fails to detect any GPUs with the message "no GPUs detected". If you try this on a system where it is not built in you will see a different message saying that "slurm was not build with rsmi support" instead.

I would imagine that on a system with AMD GPUs and no rocm-smi-lib installed it would still fail to detect any GPUs but thats unlikely to occur if you have installed the drivers and any of the rocm stack to use the GPUs for compute.

Do you have a system which the OpenHPC project could access to run the corresponding tests?

Yes, I need to officially check about access but as long as you give us some notice beforehand I doubt there will be any issues. We currently have MI250X availiable.

@antonycleave
Copy link
Author

just ripped out the rocm-smi-lib on an active compute node and restarted slurmd in the forground on a rocky linux 8.10 system where I have done the same

[rocky@nscale-compute-gpu-16 ~]$ sudo rpm -e rocm-smi-lib-7.3.0.60200-66.el8.x86_64 --nodeps
[rocky@nscale-compute-gpu-16 ~]$ sudo stop slurmd
[rocky@nscale-compute-gpu-16 ~]$ sudo slurmd --conf-server=nscale-control-0 -Dvvv
slurmd: debug:  Log file re-opened
slurmd-nscale-compute-gpu-16: debug2: hwloc_topology_init
slurmd-nscale-compute-gpu-16: debug2: hwloc_topology_load
slurmd-nscale-compute-gpu-16: debug2: hwloc_topology_export_xml
slurmd-nscale-compute-gpu-16: debug:  CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
slurmd-nscale-compute-gpu-16: debug:  cgroup/v1: init: Cgroup v1 plugin loaded
slurmd-nscale-compute-gpu-16: debug2: hwloc_topology_init
slurmd-nscale-compute-gpu-16: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurm/hwloc_topo_whole.xml) found
slurmd-nscale-compute-gpu-16: debug:  CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
slurmd-nscale-compute-gpu-16: debug2: gres.conf: AutoDetect GPU flags were locally set, so ignoring global flags
slurmd-nscale-compute-gpu-16: debug:  gres/gpu: init: loaded
slurmd-nscale-compute-gpu-16: Configured with rsmi, but that lib wasn't found.
slurmd-nscale-compute-gpu-16: debug:  gpu/generic: init: init: GPU Generic plugin loaded
slurmd-nscale-compute-gpu-16: warning: Ignoring file-less GPU gpu:MI250 from final GRES list

and putting it back on:

[rocky@nscale-compute-gpu-16 ~]$ sudo dnf install rocm-smi-lib
Last metadata expiration check: 2:34:23 ago on Wed 13 Nov 2024 12:26:53 PM UTC.
Dependencies resolved.
=============================================================================================================================================================================================================================================
 Package                                                   Architecture                                        Version                                                           Repository                                             Size
=============================================================================================================================================================================================================================================
Installing:
 rocm-smi-lib                                              x86_64                                              7.3.0.60200-66.el8                                                ROCm-6.2                                              772 k

Transaction Summary
=============================================================================================================================================================================================================================================
Install  1 Package

Total download size: 772 k
Installed size: 2.6 M
Is this ok [y/N]: y
Downloading Packages:
rocm-smi-lib-7.3.0.60200-66.el8.x86_64.rpm                                                                                                                                                                   739 kB/s | 772 kB     00:01
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                                                                                                                        737 kB/s | 772 kB     00:01
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                                                                                                                     1/1
  Installing       : rocm-smi-lib-7.3.0.60200-66.el8.x86_64                                                                                                                                                                              1/1
  Running scriptlet: rocm-smi-lib-7.3.0.60200-66.el8.x86_64                                                                                                                                                                              1/1
  Verifying        : rocm-smi-lib-7.3.0.60200-66.el8.x86_64                                                                                                                                                                              1/1

Installed:
  rocm-smi-lib-7.3.0.60200-66.el8.x86_64

Complete!

[rocky@nscale-compute-gpu-16 ~]$ sudo slurmd --conf-server=nscale-control-0 -Dvv
slurmd: debug:  Log file re-opened
slurmd-nscale-compute-gpu-16: debug:  CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
slurmd-nscale-compute-gpu-16: debug:  cgroup/v1: init: Cgroup v1 plugin loaded
slurmd-nscale-compute-gpu-16: debug:  CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
slurmd-nscale-compute-gpu-16: debug:  gres/gpu: init: loaded
slurmd-nscale-compute-gpu-16: debug:  gpu/rsmi: init: init: GPU RSMI plugin loaded
slurmd-nscale-compute-gpu-16: debug:  gpu/rsmi: _get_system_gpu_list_rsmi: AMD Graphics Driver Version: 6.8.5
slurmd-nscale-compute-gpu-16: debug:  gpu/rsmi: _get_system_gpu_list_rsmi: RSMI Library Version: 0
slurmd-nscale-compute-gpu-16: gpu/rsmi: _get_system_gpu_list_rsmi: 8 GPU system device(s) detected
slurmd-nscale-compute-gpu-16: debug:  Gres GPU plugin: Merging configured GRES with system GPUs

@adrianreber
Copy link
Member

In today's TSC meeting everyone was in favour of this change. We will continue to work with you here to get this merged.

With 3.2 released this week, we will target this change for the 3.3 release which might be in May 2025.

@adrianreber adrianreber added this to the 3.3 milestone Nov 13, 2024
@antonycleave
Copy link
Author

thats great news! I hope to get time to finish cleaning this up next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants