Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resources are not split when using “time slicing” with the NVIDIA device plugin for Kubernetes #990

Open
y-shida-tg opened this issue Oct 11, 2024 · 8 comments

Comments

@y-shida-tg
Copy link

Referring to “GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes”,
we have implemented the " NVIDIA device plugin for Kubernetes" and are trying out time slicing,
but encountering issues. Specifically, the GPU capacity is displayed as follows,
with only “1” GPU capacity shown instead of “4” (expected to be 4 due to replicas: 4 in the YAML).
What could be the reason why “Capacity” is not increasing?

# kubectl describe node test-server
Capacity:
  nvidia.com/gpu: 1
  nvidia.com/gpu: 1

times.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
data:
  time-sliced: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

Hardware Information:
Server: PowerEdge R750 (SKU=090E, ModelName=PowerEdge R750)
CPU: Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz

GPGPU Information:
GPGPU: A100 80GB
CUDA Version: 12.2
Driver Version: 535.54.03
nvidia-container-runtime: runc version 1.0.2、spec: 1.0.2-dev、go: go1.16.7、libseccomp: 2.5.1

Linux Information:
OS: CentOS Linux release 8.5.2111
k8s environment:
kubectl version:
Client Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.6”, GitCommit: “ad3338546da947756e8a88aa6822e9c11e7eac22”, GitTreeState: “clean”, BuildDate: “2022-04-14T08:49:13Z”, GoVersion: “go1.17.9”, Compiler: “gc”, Platform: “linux/amd64”}
Server Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.17”, GitCommit: “953be8927218ec8067e1af2641e540238ffd7576”, GitTreeState: “clean”, BuildDate: “2023-02-22T13:27:46Z”, GoVersion: “go1.19.6”, Compiler: “gc”, Platform: “linux/amd64”}
crio version: 1.23.5

NVIDIA device plugin for Kubernetes version used: v0.16.1

@klueska
Copy link
Contributor

klueska commented Oct 11, 2024

The only reason this would happen is if your plugin on the node isn't actually pointing to this config. Did you launch the plugin pointing to this config map and then update the label on the node to point to the particular time-slicing config within that config map?

https://github.com/NVIDIA/k8s-device-plugin/tree/main?tab=readme-ov-file#multiple-config-file-example

@y-shida-tg
Copy link
Author

y-shida-tg commented Oct 22, 2024

Thank you for your reply.

I created dp-example-config0.yaml and dp-example-config1.yaml, applied the config with helm upgrade -i nvdp nvdp/nvidia-device-plugin, and then started the device plugin with kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml. However, the node information still shows only 1 GPU as follows:

Capacity:
nvidia.com/gpu: 1
Allocatable:
nvidia.com/gpu: 1

The following config is added to the node labels:
Labels: nvidia.com/device-plugin.config=config0

Are there any other items to check in the device plugin or settings to be configured on the node side?
Currently, the node side has only been modified for compatibility with the cri-o device plugin.

Detailed file contents and commands are shown below.

## Configuration file contents

# cat dp-example-config0.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4


# cat dp-example-config1.yaml
version: v1
flags:
  migStrategy: "mixed" # Only change from config0.yaml
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid

## Apply configuration file
# helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.16.1 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set config.default=config0 \
    --set-file config.map.config0=dp-example-config0.yaml \
    --set-file config.map.config1=dp-example-config1.yaml

# kubectl label nodes onp1-4-r750 --overwrite \
    nvidia.com/device-plugin.config=config0


## Check the contents of the ConfigMap
# kubectl describe configmaps -n nvidia-device-plugin
Name:         kube-root-ca.crt
Namespace:    nvidia-device-plugin
Labels:       <none>
Annotations:  kubernetes.io/description:
                Contains a CA bundle that can be used to verify the kube-apiserver when using internal endpoints such as the internal service IP or kubern...

Data
====
ca.crt:
----
-----BEGIN CERTIFICATE-----
MIIC/jCCAeagAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl
cm5ldGVzMB4XDTI0MDYxODA4NDQwNFoXDTM0MDYxNjA4NDQwNFowFTETMBEGA1UE
AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAN6a
txp2J29lwgQ7eEiQ+h2DOXYecFcnodeyXt0jTXy2YacPh7kvt3alZ7bm+NIuDhkt
2dAnx7qJQRSnnM5xEP6bliHjkqVRMDyQf5BqgfLyKf2+usuYyas3dAevtKqI0qFP
5MnoHhUI2z+T5xleCguWxdsl39kQErD8WjWmQ2tR2a1JQOvUE/8QBo4tP0peyBFE
BwurzgDwFuaVRjrzREBL1BCzdQbG3XtGCiEyMvcgm2yO1kNcjYibqK5kc5R/zQ31
p/yJRPs4tcQEcRlh62S9HgghhYpQQb1whVaK7mZP3BJ3a+ku7Dp1E8+rnNkVtRgO
icItv/Esv57OBX9MNwkCAwEAAaNZMFcwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB
/wQFMAMBAf8wHQYDVR0OBBYEFLol1Lsh1L1n76Nz1uay7TkdCYgnMBUGA1UdEQQO
MAyCCmt1YmVybmV0ZXMwDQYJKoZIhvcNAQELBQADggEBAJz3AlS8e8CoyFxoBp3j
b/sbgeL6DXNfOPafOPUvMJrOfTw4ZhXuHmB2kY/dws9hPxSuiVO1Z3woymeYGHrl
aIFy1f5d4XtTrsjKWkV9aqcw+UZ4Z4H2R73F8A5VrVAq9zUSre3J45H7QVdAYIdP
PUI+uvtg0o+IBKIYZo43uBjMsZm1h2zQe03+Bf8DOQd8WByb/VEWM4/blYLwiMs7
4pvImNdTJChSrL3tbelM/X2M78RYXYXNZqkGw0iIRS07Tv9B688Xx8dUhs5WxjZU
9Ge7VFxK+W8lMjo0V3EFHhbYnS0LwMhuMpAryBpd3tcnktOVBh2lPZO2g6WseOVB
RNI=
-----END CERTIFICATE-----


BinaryData
====

Events:  <none>


Name:         nvdp-nvidia-device-plugin-configs
Namespace:    nvidia-device-plugin
Labels:       app.kubernetes.io/instance=nvdp
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nvidia-device-plugin
              app.kubernetes.io/version=0.16.1
              helm.sh/chart=nvidia-device-plugin-0.16.1
Annotations:  meta.helm.sh/release-name: nvdp
              meta.helm.sh/release-namespace: nvidia-device-plugin

Data
====
config0:
----
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4
config1:
----
version: v1
flags:
  migStrategy: "mixed" # Only change from config0.yaml
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid

BinaryData
====

Events:  <none>

## Node state (without nvidia-device-plugin-daemonset)
# kubectl describe nodes onp1-4-r750
Name:               onp1-4-r750
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=onp1-4-r750
                    kubernetes.io/os=linux
                    nvidia.com/device-plugin.config=config0
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64
                    projectcalico.org/IPv6Address: fc00:a000::14/64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 18 Jun 2024 18:00:26 +0900
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  onp1-4-r750
  AcquireTime:     <unset>
  RenewTime:       Tue, 15 Oct 2024 18:11:08 +0900
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 09 Oct 2024 09:41:27 +0900   Wed, 09 Oct 2024 09:41:27 +0900   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  fc00:a000::14
  Hostname:    onp1-4-r750
Capacity:
  cpu:                112
  ephemeral-storage:  2737838616Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395422092Ki
  nvidia.com/gpu:     0
  pods:               110
Allocatable:
  cpu:                112
  ephemeral-storage:  2523192064328
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395319692Ki
  nvidia.com/gpu:     0
  pods:               110
System Info:
  Machine ID:                 d4e91833fac54bb0b9458e38819fdf2b
  System UUID:                4c4c4544-0046-5110-8051-c3c04f395633
  Boot ID:                    6de46ba4-46ee-4413-8fde-74cf7ff5473d
  Kernel Version:             5.10.57
  OS Image:                   CentOS Linux 8
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.23.5
  Kubelet Version:            v1.23.6
  Kube-Proxy Version:         v1.23.6
PodCIDR:                      1100:0:0:1::/64
PodCIDRs:                     1100:0:0:1::/64,10.244.1.0/24
Non-terminated Pods:          (3 in total)
  Namespace                   Name                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                    ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-m8xcl       250m (0%)     0 (0%)      0 (0%)           0 (0%)         118d
  kube-system                 kube-multus-ds-cps4h    100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      118d
  kube-system                 kube-proxy-zhwt4        0 (0%)        0 (0%)      0 (0%)           0 (0%)         119d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                350m (0%)  100m (0%)
  memory             50Mi (0%)  50Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
  nvidia.com/gpu     0          0
Events:              <none>



■Node status (after starting nvidia-device-plugin-daemonset)
# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml
 
# kubectl describe nodes onp1-4-r750
Name:               onp1-4-r750
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=onp1-4-r750
                    kubernetes.io/os=linux
                    nvidia.com/device-plugin.config=config0
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64
                    projectcalico.org/IPv6Address: fc00:a000::14/64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 18 Jun 2024 18:00:26 +0900
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  onp1-4-r750
  AcquireTime:     <unset>
  RenewTime:       Tue, 15 Oct 2024 18:12:09 +0900
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 09 Oct 2024 09:41:27 +0900   Wed, 09 Oct 2024 09:41:27 +0900   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  fc00:a000::14
  Hostname:    onp1-4-r750
Capacity:
  cpu:                112
  ephemeral-storage:  2737838616Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395422092Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                112
  ephemeral-storage:  2523192064328
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395319692Ki
  nvidia.com/gpu:     1
  pods:               110
System Info:
  Machine ID:                 d4e91833fac54bb0b9458e38819fdf2b
  System UUID:                4c4c4544-0046-5110-8051-c3c04f395633
  Boot ID:                    6de46ba4-46ee-4413-8fde-74cf7ff5473d
  Kernel Version:             5.10.57
  OS Image:                   CentOS Linux 8
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.23.5
  Kubelet Version:            v1.23.6
  Kube-Proxy Version:         v1.23.6
PodCIDR:                      1100:0:0:1::/64
PodCIDRs:                     1100:0:0:1::/64,10.244.1.0/24
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-m8xcl                       250m (0%)     0 (0%)      0 (0%)           0 (0%)         118d
  kube-system                 kube-multus-ds-cps4h                    100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      118d
  kube-system                 kube-proxy-zhwt4                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         119d
  kube-system                 nvidia-device-plugin-daemonset-drdv2    0 (0%)        0 (0%)      0 (0%)           0 (0%)         18s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                350m (0%)  100m (0%)
  memory             50Mi (0%)  50Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
  nvidia.com/gpu     0          0
Events:              <none>

@klueska
Copy link
Contributor

klueska commented Oct 22, 2024

I'm confused by this step that you reference:

and then started the device plugin with kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml

The helm install/upgrade command already starts the device plugin configured to be aware of the configs you point it at. The static deployment from the URL you reference is not aware of these configs and would require a substantical amount of addition code to make it aware of them (which is why helm is the preferred installation method for the plugin).

@y-shida-tg
Copy link
Author

The helm install/upgrade command already starts the device plugin configured to be aware of the configs you point it at. The staticdeployment from the URL you reference is not aware of these configs and would require a substantical amount of addition code to make it aware of them (which is why helm is the preferred installation method for the plugin).

According to the document, I tried to proceed only with helm operations, but the node information is as follows, and the nvidia-device-plugin was not running.

Capacity:
nvidia.com/gpu: 0
Allocatable:
nvidia.com/gpu: 0

I believe the issue is that the nvidia-device-plugin does not start with helm operations.
Are there any items to check?

Below are the command and configuration details.

##Contents of the config file
# cat dp-example-config0.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4

## In the case of using the dp-example-config0.yaml on the bulletin board
-----------------------------------------------------------------------
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins




# cat dp-example-config1.yaml
version: v1
flags:
  migStrategy: "mixed" # Only change from config0.yaml
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid


## Apply config file
# helm search repo nvdp --devel
NAME                            CHART VERSION   APP VERSION     DESCRIPTION
nvdp/gpu-feature-discovery      0.16.2          0.16.2          A Helm chart for gpu-feature-discovery on Kuber...
nvdp/nvidia-device-plugin       0.16.2          0.16.2          A Helm chart for the nvidia-device-plugin on Ku...
# helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.16.2 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set config.default=config0 \
    --set-file config.map.config0=dp-example-config0.yaml \
    --set-file config.map.config1=dp-example-config1.yaml

# kubectl label nodes onp1-4-r750 --overwrite \
    nvidia.com/device-plugin.config=config0


## Checking the contents of the config map
# kubectl describe configmaps -n nvidia-device-plugin
Name:         kube-root-ca.crt
Namespace:    nvidia-device-plugin
Labels:       <none>
Annotations:  kubernetes.io/description:
                Contains a CA bundle that can be used to verify the kube-apiserver when using internal endpoints such as the internal service IP or kubern...

Data
====
ca.crt:
----
-----BEGIN CERTIFICATE-----
MIIC/jCCAeagAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl
cm5ldGVzMB4XDTI0MDYxODA4NDQwNFoXDTM0MDYxNjA4NDQwNFowFTETMBEGA1UE
AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAN6a
txp2J29lwgQ7eEiQ+h2DOXYecFcnodeyXt0jTXy2YacPh7kvt3alZ7bm+NIuDhkt
2dAnx7qJQRSnnM5xEP6bliHjkqVRMDyQf5BqgfLyKf2+usuYyas3dAevtKqI0qFP
5MnoHhUI2z+T5xleCguWxdsl39kQErD8WjWmQ2tR2a1JQOvUE/8QBo4tP0peyBFE
BwurzgDwFuaVRjrzREBL1BCzdQbG3XtGCiEyMvcgm2yO1kNcjYibqK5kc5R/zQ31
p/yJRPs4tcQEcRlh62S9HgghhYpQQb1whVaK7mZP3BJ3a+ku7Dp1E8+rnNkVtRgO
icItv/Esv57OBX9MNwkCAwEAAaNZMFcwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB
/wQFMAMBAf8wHQYDVR0OBBYEFLol1Lsh1L1n76Nz1uay7TkdCYgnMBUGA1UdEQQO
MAyCCmt1YmVybmV0ZXMwDQYJKoZIhvcNAQELBQADggEBAJz3AlS8e8CoyFxoBp3j
b/sbgeL6DXNfOPafOPUvMJrOfTw4ZhXuHmB2kY/dws9hPxSuiVO1Z3woymeYGHrl
aIFy1f5d4XtTrsjKWkV9aqcw+UZ4Z4H2R73F8A5VrVAq9zUSre3J45H7QVdAYIdP
PUI+uvtg0o+IBKIYZo43uBjMsZm1h2zQe03+Bf8DOQd8WByb/VEWM4/blYLwiMs7
4pvImNdTJChSrL3tbelM/X2M78RYXYXNZqkGw0iIRS07Tv9B688Xx8dUhs5WxjZU
9Ge7VFxK+W8lMjo0V3EFHhbYnS0LwMhuMpAryBpd3tcnktOVBh2lPZO2g6WseOVB
RNI=
-----END CERTIFICATE-----


BinaryData
====

Events:  <none>


Name:         nvdp-nvidia-device-plugin-configs
Namespace:    nvidia-device-plugin
Labels:       app.kubernetes.io/instance=nvdp
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nvidia-device-plugin
              app.kubernetes.io/version=0.16.1
              helm.sh/chart=nvidia-device-plugin-0.16.1
Annotations:  meta.helm.sh/release-name: nvdp
              meta.helm.sh/release-namespace: nvidia-device-plugin

Data
====
config0:
----
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4
config1:
----
version: v1
flags:
  migStrategy: "mixed" # Only change from config0.yaml
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid

BinaryData
====

Events:  <none>

## Node status (without nvidia-device-plugin-daemonset)
# kubectl describe nodes onp1-4-r750
Name:               onp1-4-r750
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=onp1-4-r750
                    kubernetes.io/os=linux
                    nvidia.com/device-plugin.config=config0
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64
                    projectcalico.org/IPv6Address: fc00:a000::14/64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 18 Jun 2024 18:00:26 +0900
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  onp1-4-r750
  AcquireTime:     <unset>
  RenewTime:       Tue, 15 Oct 2024 18:11:08 +0900
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 09 Oct 2024 09:41:27 +0900   Wed, 09 Oct 2024 09:41:27 +0900   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  fc00:a000::14
  Hostname:    onp1-4-r750
Capacity:
  cpu:                112
  ephemeral-storage:  2737838616Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395422092Ki
  nvidia.com/gpu:     0
  pods:               110
Allocatable:
  cpu:                112
  ephemeral-storage:  2523192064328
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395319692Ki
  nvidia.com/gpu:     0
  pods:               110
System Info:
  Machine ID:                 d4e91833fac54bb0b9458e38819fdf2b
  System UUID:                4c4c4544-0046-5110-8051-c3c04f395633
  Boot ID:                    6de46ba4-46ee-4413-8fde-74cf7ff5473d
  Kernel Version:             5.10.57
  OS Image:                   CentOS Linux 8
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.23.5
  Kubelet Version:            v1.23.6
  Kube-Proxy Version:         v1.23.6
PodCIDR:                      1100:0:0:1::/64
PodCIDRs:                     1100:0:0:1::/64,10.244.1.0/24
Non-terminated Pods:          (3 in total)
  Namespace                   Name                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                    ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-m8xcl       250m (0%)     0 (0%)      0 (0%)           0 (0%)         118d
  kube-system                 kube-multus-ds-cps4h    100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      118d
  kube-system                 kube-proxy-zhwt4        0 (0%)        0 (0%)      0 (0%)           0 (0%)         119d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                350m (0%)  100m (0%)
  memory             50Mi (0%)  50Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
  nvidia.com/gpu     0          0
Events:              <none>



## Node status (after nvidia-device-plugin-daemonset launch)
# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml
 
# kubectl describe nodes onp1-4-r750
Name:               onp1-4-r750
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=onp1-4-r750
                    kubernetes.io/os=linux
                    nvidia.com/device-plugin.config=config0
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64
                    projectcalico.org/IPv6Address: fc00:a000::14/64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 18 Jun 2024 18:00:26 +0900
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  onp1-4-r750
  AcquireTime:     <unset>
  RenewTime:       Tue, 15 Oct 2024 18:12:09 +0900
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 09 Oct 2024 09:41:27 +0900   Wed, 09 Oct 2024 09:41:27 +0900   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  fc00:a000::14
  Hostname:    onp1-4-r750
Capacity:
  cpu:                112
  ephemeral-storage:  2737838616Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395422092Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                112
  ephemeral-storage:  2523192064328
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395319692Ki
  nvidia.com/gpu:     1
  pods:               110
System Info:
  Machine ID:                 d4e91833fac54bb0b9458e38819fdf2b
  System UUID:                4c4c4544-0046-5110-8051-c3c04f395633
  Boot ID:                    6de46ba4-46ee-4413-8fde-74cf7ff5473d
  Kernel Version:             5.10.57
  OS Image:                   CentOS Linux 8
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.23.5
  Kubelet Version:            v1.23.6
  Kube-Proxy Version:         v1.23.6
PodCIDR:                      1100:0:0:1::/64
PodCIDRs:                     1100:0:0:1::/64,10.244.1.0/24
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-m8xcl                       250m (0%)     0 (0%)      0 (0%)           0 (0%)         118d
  kube-system                 kube-multus-ds-cps4h                    100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      118d
  kube-system                 kube-proxy-zhwt4                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         119d
  kube-system                 nvidia-device-plugin-daemonset-drdv2    0 (0%)        0 (0%)      0 (0%)           0 (0%)         18s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                350m (0%)  100m (0%)
  memory             50Mi (0%)  50Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
  nvidia.com/gpu     0          0
Events:              <none>


@y-shida-tg
Copy link
Author

We have been trying since then, but have not been able to resolve this issue.
If you have anything to try, I would appreciate a reply.

@y-shida-tg
Copy link
Author

y-shida-tg commented Dec 13, 2024

We summarize our current situation as follows.

To briefly describe the problem, the Helm command does not invoke the “NVIDIA device plugin for Kubernetes”.

We have followed your suggested steps and the results are summarized below.
https://github.com/NVIDIA/k8s-device-plugin/tree/main?tab=readme-ov-file#multiple-config-file-example

(1)Edit Helm config file
 Refer to the attached config files (dp-example-config0.yaml,dp-example-config1.yaml)
(2)Apply Helm config file
#helm search repo nvdp --devel
(3)Start NVIDIA device plugin for Kubernetes
#kubectl describe nodes onp1-4-r750

Please refer to “procedure_and_result” for details.

#kubectl describe nodes onp1-4-r750 Execution result
Non-terminated Pods: (4 in total)
Namespace Name 
--------- ----
kube-system calico-node-m8xcl 
kube-system kube-multus-ds-cps4h 
kube-system kube-proxy-zhwt4 

Issue:nvidia-device-plugin-daemonset-drdv2 is not running

@y-shida-tg
Copy link
Author

dp-example-config0.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

dp-example-config1.yaml

version: v1
flags:
  migStrategy: "mixed" # Only change from config0.yaml
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid

procedure_and_result

# helm search repo nvdp --devel
NAME                            CHART VERSION   APP VERSION     DESCRIPTION
nvdp/gpu-feature-discovery      0.16.2          0.16.2          A Helm chart for gpu-feature-discovery on Kuber...
nvdp/nvidia-device-plugin       0.16.2          0.16.2          A Helm chart for the nvidia-device-plugin on Ku...
# helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.16.2 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set config.default=config0 \
    --set-file config.map.config0=dp-example-config0.yaml \
    --set-file config.map.config1=dp-example-config1.yaml

# kubectl label nodes onp1-4-r750 --overwrite \
    nvidia.com/device-plugin.config=config0

# kubectl describe configmaps -n nvidia-device-plugin
Name:         kube-root-ca.crt
Namespace:    nvidia-device-plugin
Labels:       <none>
Annotations:  kubernetes.io/description:
                Contains a CA bundle that can be used to verify the kube-apiserver when using internal endpoints such as the internal service IP or kubern...

Data
====
ca.crt:
----
-----BEGIN CERTIFICATE-----
MIIC/jCCAeagAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl
cm5ldGVzMB4XDTI0MDYxODA4NDQwNFoXDTM0MDYxNjA4NDQwNFowFTETMBEGA1UE
AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAN6a
txp2J29lwgQ7eEiQ+h2DOXYecFcnodeyXt0jTXy2YacPh7kvt3alZ7bm+NIuDhkt
2dAnx7qJQRSnnM5xEP6bliHjkqVRMDyQf5BqgfLyKf2+usuYyas3dAevtKqI0qFP
5MnoHhUI2z+T5xleCguWxdsl39kQErD8WjWmQ2tR2a1JQOvUE/8QBo4tP0peyBFE
BwurzgDwFuaVRjrzREBL1BCzdQbG3XtGCiEyMvcgm2yO1kNcjYibqK5kc5R/zQ31
p/yJRPs4tcQEcRlh62S9HgghhYpQQb1whVaK7mZP3BJ3a+ku7Dp1E8+rnNkVtRgO
icItv/Esv57OBX9MNwkCAwEAAaNZMFcwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB
/wQFMAMBAf8wHQYDVR0OBBYEFLol1Lsh1L1n76Nz1uay7TkdCYgnMBUGA1UdEQQO
MAyCCmt1YmVybmV0ZXMwDQYJKoZIhvcNAQELBQADggEBAJz3AlS8e8CoyFxoBp3j
b/sbgeL6DXNfOPafOPUvMJrOfTw4ZhXuHmB2kY/dws9hPxSuiVO1Z3woymeYGHrl
aIFy1f5d4XtTrsjKWkV9aqcw+UZ4Z4H2R73F8A5VrVAq9zUSre3J45H7QVdAYIdP
PUI+uvtg0o+IBKIYZo43uBjMsZm1h2zQe03+Bf8DOQd8WByb/VEWM4/blYLwiMs7
4pvImNdTJChSrL3tbelM/X2M78RYXYXNZqkGw0iIRS07Tv9B688Xx8dUhs5WxjZU
9Ge7VFxK+W8lMjo0V3EFHhbYnS0LwMhuMpAryBpd3tcnktOVBh2lPZO2g6WseOVB
RNI=
-----END CERTIFICATE-----


BinaryData
====

Events:  <none>


Name:         nvdp-nvidia-device-plugin-configs
Namespace:    nvidia-device-plugin
Labels:       app.kubernetes.io/instance=nvdp
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nvidia-device-plugin
              app.kubernetes.io/version=0.16.1
              helm.sh/chart=nvidia-device-plugin-0.16.1
Annotations:  meta.helm.sh/release-name: nvdp
              meta.helm.sh/release-namespace: nvidia-device-plugin

Data
====
config0:
----
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4
config1:
----
version: v1
flags:
  migStrategy: "mixed" # Only change from config0.yaml
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid

BinaryData
====

Events:  <none>

# kubectl describe nodes onp1-4-r750

Name:               onp1-4-r750
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=onp1-4-r750
                    kubernetes.io/os=linux
                    nvidia.com/device-plugin.config=config0
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64
                    projectcalico.org/IPv6Address: fc00:a000::14/64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 18 Jun 2024 18:00:26 +0900
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  onp1-4-r750
  AcquireTime:     <unset>
  RenewTime:       Tue, 15 Oct 2024 18:11:08 +0900
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 09 Oct 2024 09:41:27 +0900   Wed, 09 Oct 2024 09:41:27 +0900   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  fc00:a000::14
  Hostname:    onp1-4-r750
Capacity:
  cpu:                112
  ephemeral-storage:  2737838616Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395422092Ki
  nvidia.com/gpu:     0
  pods:               110
Allocatable:
  cpu:                112
  ephemeral-storage:  2523192064328
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395319692Ki
  nvidia.com/gpu:     0
  pods:               110
System Info:
  Machine ID:                 d4e91833fac54bb0b9458e38819fdf2b
  System UUID:                4c4c4544-0046-5110-8051-c3c04f395633
  Boot ID:                    6de46ba4-46ee-4413-8fde-74cf7ff5473d
  Kernel Version:             5.10.57
  OS Image:                   CentOS Linux 8
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.23.5
  Kubelet Version:            v1.23.6
  Kube-Proxy Version:         v1.23.6
PodCIDR:                      1100:0:0:1::/64
PodCIDRs:                     1100:0:0:1::/64,10.244.1.0/24
Non-terminated Pods:          (3 in total)
  Namespace                   Name                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                    ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-m8xcl       250m (0%)     0 (0%)      0 (0%)           0 (0%)         118d
  kube-system                 kube-multus-ds-cps4h    100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      118d
  kube-system                 kube-proxy-zhwt4        0 (0%)        0 (0%)      0 (0%)           0 (0%)         119d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                350m (0%)  100m (0%)
  memory             50Mi (0%)  50Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
  nvidia.com/gpu     0          0
Events:              <none>

@chipzoller
Copy link
Contributor

You're setting the wrong Helm value to instruct the device plugin where to find the sharing configuration. You used config.default when it should be config.name. See values reference here. When you have a preexisting ConfigMap, you supply config.name. When you define a plugin configuration stanza in-line with the Helm values, you supply config.name which results in the device plugin creating a new ConfigMap with its value. Once the device plugin knows the name of the ConfigMap, your defined config should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants