Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable To Use The GPU Node Pool On Azure AKS #906

Open
sello4354 opened this issue Aug 14, 2024 · 1 comment
Open

Unable To Use The GPU Node Pool On Azure AKS #906

sello4354 opened this issue Aug 14, 2024 · 1 comment

Comments

@sello4354
Copy link

Here is the setup of my AKS cluster:

AKS Versions: 1.29.2
type of node pools :3 , system pool, general node pool, and GPU
tried NVIDIA driver plugins: Nvidia device plugin and GPU operator
OS IMAGE: Ubuntu 22.04.4 LTS
KERNEL VERSION: 5.15.0-1068-azure
CONTAINER-RUNTIME: containerd://1.7.15-1
NVIDIA PLUGIN VERSION: - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0, - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.0

Documentation used to create the GPU node pool: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#install-nvidia-device-plugin

Here is the issue:

As per the above document, if the Nividia plugin driver is installed successfully then under the Capacity section, the GPU should list as nvidia.com/gpu: 1. However I did not see that when I described my GPU-enabled node.

I also tried the gpu-operator but that did not help either. I

@edoyon90
Copy link

edoyon90 commented Aug 21, 2024

I am having a similar issue. The plugin installs and my node has the capacity listed. However, any pod running on the GPU node cannot detect the device. Doing kubectl exec into the plugin pod and running nvidia-smi returns Failed to initialize NVML: Unknown Error running a pod with python and attempting to use torch results in a similar issue.

import torch
torch.cuda.is_available()
False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants