Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s: GPU Passthrough for Nvidia / AMD / Intel #288037

Open
superherointj opened this issue Feb 11, 2024 · 5 comments
Open

k3s: GPU Passthrough for Nvidia / AMD / Intel #288037

superherointj opened this issue Feb 11, 2024 · 5 comments

Comments

@superherointj
Copy link
Contributor

superherointj commented Feb 11, 2024

This issue is for tracking GPU pass through in K3s.

K3s supports GPU pass through but not in NixOS K3s (last time I tried).

I don't know if it is a solved issue. I think it is not, but:

  • there are notes on how to do GPU passthrough here:

Last time I tried, I had issues with nix paths, ldcache in Nvidia driver. I got lost in the process. I will keep updating this issue until:
GPU pass through is:

  • supported by NixOS K3s.
  • properly documented
  • integrated to NixOS K3s module. (Ideally GPU pass through should be a toggle.)

Other references:

@SomeoneSerge
Copy link
Contributor

Related: #278969 #284507

@OlfillasOdikno
Copy link

I was unable to get the official NVIDIA device plugin to work, since all the heavy lifting is already done in containerd and generating cdi json, I created a device plugin that uses the cdi json and instructs kubernetes to inject the device.
Tested it on a NVIDIA 3060.
https://github.com/OlfillasOdikno/generic-cdi-plugin

@ahirner
Copy link
Contributor

ahirner commented Jun 23, 2024

@OlfillasOdikno thanks, I use your plugin as well for now. I had problems that some containers didn't see libnvml.so.1, nor the generated CDIs.

GPU pass through is:

Question regarding scope: does this issue inlcude shared GPU use? I'm not sure how involved it is.

@Goorzhel
Copy link

Goorzhel commented Aug 7, 2024

After four months of dead ends and failed hacks, I've arrived at this configuration for my k3s node and its GeForce 3070:

In NixOS

  1. Bodge LD_LIBRARY_PATH into the CDI generator's environment.
  2. Ensure /run/opengl is available.
  3. Enable CDI in k3s' bundled containerd.

In Kubernetes

  1. Install @OlfillasOdikno's CDI plugin (thank you!)
  2. Add spec.resources.limits."nvidia.com/gpu-0"=1 to the relevant pod specs.
  3. Enjoy massively-improved video transcoding, etc.

Stray notes

  • containerd/containerd@c8e8a093c will remove the need for NixOS step 3 whenever k3s bundles a version with that commit.
  • Like OlfillasOdikno, I hit a dead end with Nvidia's plugin.
  • Relevant software versions:
    • k3s 1.30.2+k3s2,
    • NixOS 24.05, and
    • Nvidia 550.78.

Nvidia monoculture aside, I also have a Radeon RX 7800 in my desktop. A brief web-search reveals a plugin for AMD GPUs, but I need more time to look into that.

@Goorzhel
Copy link

Goorzhel commented Aug 13, 2024

A brief web-search reveals a plugin for AMD GPUs, but I need more time to look into that.

Little did I know that is the official AMD plugin. I made a one-node k3s cluster of my desktop and installed the plugin's Helm chart—and that's all I needed. One unit of amd.com/gpu became available, without any abstruse hacks like my Nvidia odyssey above.

Some caveats:

  1. Like the Nvidia device plugin, the AMD one hands out whole-GPU leases by default. Unlike the Nvidia plugin, this is non-configurable.
  2. No news on CDI support yet.
  3. On a Jellyfin pod, with VAAPI selected, I got ~20 fps transcoding 2160p HEVC to AVC—in memory. The same video on a magnetic ZFS pool in my main cluster went through NVENC at ~200 fps.

EDIT: Same story with Intel GPUs. All one needs is the device plugin, which I've been using for more than a year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants