-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resources are not split when using “time slicing” with the NVIDIA device plugin for Kubernetes #990
Comments
The only reason this would happen is if your plugin on the node isn't actually pointing to this config. Did you launch the plugin pointing to this config map and then update the label on the node to point to the particular |
Thank you for your reply. I created Capacity: The following config is added to the node labels: Are there any other items to check in the device plugin or settings to be configured on the node side? Detailed file contents and commands are shown below.
|
I'm confused by this step that you reference:
The helm install/upgrade command already starts the device plugin configured to be aware of the configs you point it at. The |
According to the document, I tried to proceed only with helm operations, but the node information is as follows, and the nvidia-device-plugin was not running. Capacity: I believe the issue is that the nvidia-device-plugin does not start with helm operations. Below are the command and configuration details.
|
We have been trying since then, but have not been able to resolve this issue. |
We summarize our current situation as follows. To briefly describe the problem, the Helm command does not invoke the “NVIDIA device plugin for Kubernetes”. We have followed your suggested steps and the results are summarized below. (1)Edit Helm config file Please refer to “procedure_and_result” for details.
|
dp-example-config0.yaml
dp-example-config1.yaml
procedure_and_result
|
You're setting the wrong Helm value to instruct the device plugin where to find the sharing configuration. You used |
Referring to “GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes”,
we have implemented the " NVIDIA device plugin for Kubernetes" and are trying out time slicing,
but encountering issues. Specifically, the GPU capacity is displayed as follows,
with only “1” GPU capacity shown instead of “4” (expected to be 4 due to replicas: 4 in the YAML).
What could be the reason why “Capacity” is not increasing?
times.yaml
Hardware Information:
Server: PowerEdge R750 (SKU=090E, ModelName=PowerEdge R750)
CPU: Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
GPGPU Information:
GPGPU: A100 80GB
CUDA Version: 12.2
Driver Version: 535.54.03
nvidia-container-runtime: runc version 1.0.2、spec: 1.0.2-dev、go: go1.16.7、libseccomp: 2.5.1
Linux Information:
OS: CentOS Linux release 8.5.2111
k8s environment:
kubectl version:
Client Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.6”, GitCommit: “ad3338546da947756e8a88aa6822e9c11e7eac22”, GitTreeState: “clean”, BuildDate: “2022-04-14T08:49:13Z”, GoVersion: “go1.17.9”, Compiler: “gc”, Platform: “linux/amd64”}
Server Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.17”, GitCommit: “953be8927218ec8067e1af2641e540238ffd7576”, GitTreeState: “clean”, BuildDate: “2023-02-22T13:27:46Z”, GoVersion: “go1.19.6”, Compiler: “gc”, Platform: “linux/amd64”}
crio version: 1.23.5
NVIDIA device plugin for Kubernetes version used: v0.16.1
The text was updated successfully, but these errors were encountered: