-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16 #180
Comments
I've encountered this issue using an amd gpu connected to a P9 host (no QEMU/VFIO) running kernel 5.1.15. Occasionally playing videos will hard reset the system and GUARD out cores. The issue does not seem present with kernels 4.19.52 and 5.0.9 (the two other versions I happen to have on my system). |
FYI - This is being discussed by our RAS team this week. Basically the reason for the guard action is because many decisions were made assuming a solid hypervisor rather than a development kernel. Not all of the error paths have been scrubbed to assume a buggy kernel vs hardware failures. |
I was able to get the machine to hang (randomly after about 2 days of uptime, didn't need to do anything like play video or use qemu) because of apparently the same amdgpu bug on 5.1.17 (with Radeon Pro WX 5100), but it didn't seem to GUARD out anything, I have a Talos 2 Lite with a single 18-core CPU and the latest stable firmware package 1.06. This is on serial console after I attempt a restart https://gist.github.com/q66/e26e10e3cd12ffc991d627351ce9d6a1 |
@q66 @shawnanastasio is there an upstream kernel bug report tracking this issue yet? |
When using QEMU vfio (Kernel 5.1, QEMU 4 GIT master) with VFIO PCIe passthrough and an AMD GPU (amdgpu driver), the test machine hard locks and GUARD entries are created for various slices (e.g. EQ0, EQ1, etc.) on subsequent reboot, gradually removing all functional slices from the system:
The PCIe device being passed through is on Proc1, not Proc0, and the QEMU guest kernel panics if QEMU is not pinned to Proc1.
Is this likely to indicate a defective CPU or a hostboot / kernel bug?
EDIT: Apparently it's related to just the amdgpu driver on 5.1.16 -- still, hostboot shouldn't GUARD out slices based on a kernel bug like this. It makes recovery a lot harder and gives a false impression of a failing CPU.
The text was updated successfully, but these errors were encountered: