Why there is no GPU resource allocatable on a GPU cloud instance #834

shizhouhu · 2024-07-19T10:30:39Z

when i describe node, there is no gpu resource, why?

Capacity:
  cpu:                48
  ephemeral-storage:  574137520Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263603720Ki
  pods:               110
Allocatable:
  cpu:                48
  ephemeral-storage:  529125137556
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263501320Ki
  pods:               110

(this is the node description)

I have installed nvidia driver

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       Off | 00000000:86:00.0 Off |                    0 |
| N/A   28C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P4                       Off | 00000000:87:00.0 Off |                    0 |
| N/A   29C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla P4                       Off | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla P4                       Off | 00000000:D8:00.0 Off |                    0 |
| N/A   31C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

(this is nvidia driver for tesla p4)

I have installed nvidia container toolkit, and configured the runtime as containerd

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

(this is the containerd config for nvidia container runtime)

3.I have installed nvidia k8s plugin nvidia-device-plugin

NAMESPACE      NAME                                      READY   STATUS    RESTARTS      AGE
kube-flannel   kube-flannel-ds-x2pzs                     1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-2k9mg                  1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-nr6tz                  1/1     Running   2 (16h ago)   7d18h
kube-system    etcd-ubuntu-2288h-v5                      1/1     Running   3 (16h ago)   7d18h
kube-system    kube-apiserver-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    kube-controller-manager-ubuntu-2288h-v5   1/1     Running   3 (16h ago)   7d18h
kube-system    kube-proxy-p6gk9                          1/1     Running   2 (16h ago)   7d18h
kube-system    kube-scheduler-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    metrics-server-6875467c8d-k6sd6           1/1     Running   2 (16h ago)   2d15h
kube-system    nvidia-device-plugin-daemonset-57kxg      1/1     Running   0             10h

(this is the nvidia device plugin for k8s)

does anyone know the problem? thanks.

The text was updated successfully, but these errors were encountered:

jaffe-fly · 2024-07-24T09:20:38Z

Having the same problem

jaffe-fly · 2024-08-01T13:02:32Z

you need install GFD or label you node

Bugaoxingxx · 2024-08-27T13:02:33Z

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

shizhouhu · 2024-09-17T05:08:16Z

you need install GFD or label you node

thanks, will try

shizhouhu · 2024-09-17T05:08:39Z

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why there is no GPU resource allocatable on a GPU cloud instance #834

Why there is no GPU resource allocatable on a GPU cloud instance #834

shizhouhu commented Jul 19, 2024

jaffe-fly commented Jul 24, 2024

jaffe-fly commented Aug 1, 2024

Bugaoxingxx commented Aug 27, 2024

shizhouhu commented Sep 17, 2024

shizhouhu commented Sep 17, 2024

Why there is no GPU resource allocatable on a GPU cloud instance #834

Why there is no GPU resource allocatable on a GPU cloud instance #834

Comments

shizhouhu commented Jul 19, 2024

jaffe-fly commented Jul 24, 2024

jaffe-fly commented Aug 1, 2024

Bugaoxingxx commented Aug 27, 2024

shizhouhu commented Sep 17, 2024

shizhouhu commented Sep 17, 2024