Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable To Use The GPU Node Pool On Azure AKS #906

Open
sello4354 opened this issue Aug 14, 2024 · 3 comments
Open

Unable To Use The GPU Node Pool On Azure AKS #906

sello4354 opened this issue Aug 14, 2024 · 3 comments

Comments

@sello4354
Copy link

Here is the setup of my AKS cluster:

AKS Versions: 1.29.2
type of node pools :3 , system pool, general node pool, and GPU
tried NVIDIA driver plugins: Nvidia device plugin and GPU operator
OS IMAGE: Ubuntu 22.04.4 LTS
KERNEL VERSION: 5.15.0-1068-azure
CONTAINER-RUNTIME: containerd://1.7.15-1
NVIDIA PLUGIN VERSION: - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0, - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.0

Documentation used to create the GPU node pool: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#install-nvidia-device-plugin

Here is the issue:

As per the above document, if the Nividia plugin driver is installed successfully then under the Capacity section, the GPU should list as nvidia.com/gpu: 1. However I did not see that when I described my GPU-enabled node.

I also tried the gpu-operator but that did not help either. I

@edoyon90
Copy link

edoyon90 commented Aug 21, 2024

I am having a similar issue. The plugin installs and my node has the capacity listed. However, any pod running on the GPU node cannot detect the device. Doing kubectl exec into the plugin pod and running nvidia-smi returns Failed to initialize NVML: Unknown Error running a pod with python and attempting to use torch results in a similar issue.

import torch
torch.cuda.is_available()
False

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2024
@chipzoller
Copy link
Contributor

I have no problems when deploying the GPU operator on an AKS cluster and invoking the GPU. Values tested shown below.

# These values validated on v24.6.1 of the NVIDIA GPU Operator.
driver:
  enabled: true
toolkit:
  enabled: true
cdi:
  enabled: false
nfd:
  enabled: true
gfd:
  enabled: true
migManager:
  enabled: false
devicePlugin:
  enabled: true
dcgmExporter:
  enabled: true

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants