-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable To Use The GPU Node Pool On Azure AKS #906
Comments
I am having a similar issue. The plugin installs and my node has the capacity listed. However, any pod running on the GPU node cannot detect the device. Doing kubectl exec into the plugin pod and running nvidia-smi returns
|
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. |
I have no problems when deploying the GPU operator on an AKS cluster and invoking the GPU. Values tested shown below. # These values validated on v24.6.1 of the NVIDIA GPU Operator.
driver:
enabled: true
toolkit:
enabled: true
cdi:
enabled: false
nfd:
enabled: true
gfd:
enabled: true
migManager:
enabled: false
devicePlugin:
enabled: true
dcgmExporter:
enabled: true |
Here is the setup of my AKS cluster:
AKS Versions: 1.29.2
type of node pools :3 , system pool, general node pool, and GPU
tried NVIDIA driver plugins: Nvidia device plugin and GPU operator
OS IMAGE: Ubuntu 22.04.4 LTS
KERNEL VERSION: 5.15.0-1068-azure
CONTAINER-RUNTIME: containerd://1.7.15-1
NVIDIA PLUGIN VERSION: - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0, - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.0
Documentation used to create the GPU node pool: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#install-nvidia-device-plugin
Here is the issue:
As per the above document, if the Nividia plugin driver is installed successfully then under the Capacity section, the GPU should list as nvidia.com/gpu: 1. However I did not see that when I described my GPU-enabled node.
I also tried the gpu-operator but that did not help either. I
The text was updated successfully, but these errors were encountered: