Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu pod Pending #852

Open
imenselmi opened this issue Jul 30, 2024 · 2 comments
Open

gpu pod Pending #852

imenselmi opened this issue Jul 30, 2024 · 2 comments

Comments

@imenselmi
Copy link

I’m trying to prepare GPU worker nodes and enable GPU support on Kubernetes to use GPU nodes. I followed the steps in the README file link , but the pod always remains pending and is not working.Itried to use cuda 10 as tuto and also i changed to 12 and always not working.

1. Quick Debug Information

  • OS/Version : Ubuntu 22.04.4 LTS (Jammy Jellyfish)
  • cuda version : 12.2
  • NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2
    *server type : Nvidia L40S : link
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Docker version 27.1.1, build 6312585
  • Docker Compose version v2.29.1
  • CRI-O version: 1.24.6
  • nvidia-container-toolkit version (1.16.0-1).
  • kubectl version :
    Client Version: v1.30.3
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.30.0
  • minikube version: v1.33.1
  • helm Version:"v3.15.3"

2. Issue or feature description

Events:
Type Reason Age From Message


Warning FailedScheduling 26m (x150 over 12h) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

3.

kubectl get pods NAME READY STATUS RESTARTS AGE gpu-demo-vectoradd 0/1 Pending 0 12h gpu-operator-test 0/1 Pending 0 13h gpu-operator-test1 0/1 Pending 0 13h gpu-pod 0/1 Pending 0 13h

`kubectl describe pod gpu-pod
Name: gpu-pod
Namespace: default
Priority: 0
Service Account: default
Node:
Labels:
Annotations:
Status: Pending
IP:
IPs:
Containers:
cuda-container:
Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Port:
Host Port:
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ww9jw (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-ww9jw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message


Warning FailedScheduling 26m (x150 over 12h) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.`

@FelixMertin
Copy link

Did you deploy nvidia-device-plugin via helm? If so, which helm chart are you using? I am currently facing the same problem after upgrading from 0.14.0 to 0.16.1.

@elezar
Copy link
Member

elezar commented Aug 14, 2024

@imenselmi / @FelixMertin could you please provide the logs for the k8s-device-plugin device-plugin container?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants