gpu pod Pending #852

imenselmi · 2024-07-30T11:26:51Z

I’m trying to prepare GPU worker nodes and enable GPU support on Kubernetes to use GPU nodes. I followed the steps in the README file link , but the pod always remains pending and is not working.Itried to use cuda 10 as tuto and also i changed to 12 and always not working.

1. Quick Debug Information

OS/Version : Ubuntu 22.04.4 LTS (Jammy Jellyfish)
cuda version : 12.2
NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2
*server type : Nvidia L40S : link
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Docker version 27.1.1, build 6312585
Docker Compose version v2.29.1
CRI-O version: 1.24.6
nvidia-container-toolkit version (1.16.0-1).
kubectl version :
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
minikube version: v1.33.1
helm Version:"v3.15.3"

2. Issue or feature description

Events:
Type Reason Age From Message

Warning FailedScheduling 26m (x150 over 12h) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

3.

kubectl get pods NAME READY STATUS RESTARTS AGE gpu-demo-vectoradd 0/1 Pending 0 12h gpu-operator-test 0/1 Pending 0 13h gpu-operator-test1 0/1 Pending 0 13h gpu-pod 0/1 Pending 0 13h

`kubectl describe pod gpu-pod
Name: gpu-pod
Namespace: default
Priority: 0
Service Account: default
Node:
Labels:
Annotations:
Status: Pending
IP:
IPs:
Containers:
cuda-container:
Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Port:
Host Port:
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ww9jw (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-ww9jw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message

Warning FailedScheduling 26m (x150 over 12h) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.`

The text was updated successfully, but these errors were encountered:

FelixMertin · 2024-08-09T12:51:26Z

Did you deploy nvidia-device-plugin via helm? If so, which helm chart are you using? I am currently facing the same problem after upgrading from 0.14.0 to 0.16.1.

elezar · 2024-08-14T11:04:33Z

@imenselmi / @FelixMertin could you please provide the logs for the k8s-device-plugin device-plugin container?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu pod Pending #852

gpu pod Pending #852

imenselmi commented Jul 30, 2024

FelixMertin commented Aug 9, 2024

elezar commented Aug 14, 2024

gpu pod Pending #852

gpu pod Pending #852

Comments

imenselmi commented Jul 30, 2024

1. Quick Debug Information

2. Issue or feature description

3.

FelixMertin commented Aug 9, 2024

elezar commented Aug 14, 2024