Unable To Use The GPU Node Pool On Azure AKS #906

sello4354 · 2024-08-14T20:06:57Z

Here is the setup of my AKS cluster:

AKS Versions: 1.29.2
type of node pools :3 , system pool, general node pool, and GPU
tried NVIDIA driver plugins: Nvidia device plugin and GPU operator
OS IMAGE: Ubuntu 22.04.4 LTS
KERNEL VERSION: 5.15.0-1068-azure
CONTAINER-RUNTIME: containerd://1.7.15-1
NVIDIA PLUGIN VERSION: - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0, - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.0

Documentation used to create the GPU node pool: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#install-nvidia-device-plugin

Here is the issue:

As per the above document, if the Nividia plugin driver is installed successfully then under the Capacity section, the GPU should list as nvidia.com/gpu: 1. However I did not see that when I described my GPU-enabled node.

I also tried the gpu-operator but that did not help either. I

edoyon90 · 2024-08-21T18:49:44Z

I am having a similar issue. The plugin installs and my node has the capacity listed. However, any pod running on the GPU node cannot detect the device. Doing kubectl exec into the plugin pod and running nvidia-smi returns Failed to initialize NVML: Unknown Error running a pod with python and attempting to use torch results in a similar issue.

import torch
torch.cuda.is_available()
False

github-actions · 2024-11-20T04:28:50Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

chipzoller · 2024-12-18T15:37:12Z

I have no problems when deploying the GPU operator on an AKS cluster and invoking the GPU. Values tested shown below.

# These values validated on v24.6.1 of the NVIDIA GPU Operator.
driver:
  enabled: true
toolkit:
  enabled: true
cdi:
  enabled: false
nfd:
  enabled: true
gfd:
  enabled: true
migManager:
  enabled: false
devicePlugin:
  enabled: true
dcgmExporter:
  enabled: true

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2024

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable To Use The GPU Node Pool On Azure AKS #906

Unable To Use The GPU Node Pool On Azure AKS #906

sello4354 commented Aug 14, 2024

edoyon90 commented Aug 21, 2024 •

edited

Loading

github-actions bot commented Nov 20, 2024

chipzoller commented Dec 18, 2024

Unable To Use The GPU Node Pool On Azure AKS #906

Unable To Use The GPU Node Pool On Azure AKS #906

Comments

sello4354 commented Aug 14, 2024

edoyon90 commented Aug 21, 2024 • edited Loading

github-actions bot commented Nov 20, 2024

chipzoller commented Dec 18, 2024

edoyon90 commented Aug 21, 2024 •

edited

Loading