Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16 #180

madscientist159 · 2019-07-07T00:23:50Z

When using QEMU vfio (Kernel 5.1, QEMU 4 GIT master) with VFIO PCIe passthrough and an AMD GPU (amdgpu driver), the test machine hard locks and GUARD entries are created for various slices (e.g. EQ0, EQ1, etc.) on subsequent reboot, gradually removing all functional slices from the system:

 10.82404|ISTEP  6. 9 - host_gard
 12.42766|================================================
 12.44729|Error reported by prdf (0xE500) PLID 0x9000009E
 12.44730|  PRD Signature            : 0x60000 0xC6D10010
 12.47015|  Signature Description    : pu.ex:k0:n0:s0:p00:c0 (L2FIR[16]) Cache line inhibited hit cacheable space
 12.49825|  UserData1   : 0x0006000000000101
 12.49826|  UserData2   : 0xc6d1001000000000
 12.49827|------------------------------------------------
 12.51947|  Callout type             : Hardware Callout
 12.51948|  CPU id                   : 4
 12.51950|  Target                   : Physical:/Sys0/Node0/Proc0/EQ0/EX0
 12.51951|  Deconfig State           : NO_DECONFIG
 12.51952|  GARD Error Type          : GARD_Fatal
 12.51952|  Priority                 : SRCI_PRIORITY_MED
 12.51953|------------------------------------------------
 12.51954|
 12.51954|------------------------------------------------
 12.51955|  System checkstop occurred during runtime on previous boot
 12.51956|------------------------------------------------
 12.51956|
 12.51957|------------------------------------------------
 12.51958|  Hostboot Build ID: hostboot-3beba24/hbicore.bin
 12.51959|================================================

The PCIe device being passed through is on Proc1, not Proc0, and the QEMU guest kernel panics if QEMU is not pinned to Proc1.

Is this likely to indicate a defective CPU or a hostboot / kernel bug?

EDIT: Apparently it's related to just the amdgpu driver on 5.1.16 -- still, hostboot shouldn't GUARD out slices based on a kernel bug like this. It makes recovery a lot harder and gives a false impression of a failing CPU.

The text was updated successfully, but these errors were encountered:

shawnanastasio · 2019-07-07T01:30:48Z

I've encountered this issue using an amd gpu connected to a P9 host (no QEMU/VFIO) running kernel 5.1.15. Occasionally playing videos will hard reset the system and GUARD out cores.

The issue does not seem present with kernels 4.19.52 and 5.0.9 (the two other versions I happen to have on my system).

dcrowell77 · 2019-07-09T17:57:18Z

FYI - This is being discussed by our RAS team this week. Basically the reason for the guard action is because many decisions were made assuming a solid hypervisor rather than a development kernel. Not all of the error paths have been scrubbed to assume a buggy kernel vs hardware failures.

q66 · 2019-07-14T16:04:14Z

I was able to get the machine to hang (randomly after about 2 days of uptime, didn't need to do anything like play video or use qemu) because of apparently the same amdgpu bug on 5.1.17 (with Radeon Pro WX 5100), but it didn't seem to GUARD out anything, I have a Talos 2 Lite with a single 18-core CPU and the latest stable firmware package 1.06. This is on serial console after I attempt a restart https://gist.github.com/q66/e26e10e3cd12ffc991d627351ce9d6a1

madscientist159 · 2019-07-15T21:54:35Z

@q66 @shawnanastasio is there an upstream kernel bug report tracking this issue yet?

q66 · 2019-07-15T21:55:38Z

yes: https://bugzilla.kernel.org/show_bug.cgi?id=204145

madscientist159 changed the title ~~Sudden machine hang / GUARD event when QEMU vfio used~~ Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16 Jul 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16 #180

Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16 #180

madscientist159 commented Jul 7, 2019 •

edited

Loading

shawnanastasio commented Jul 7, 2019

dcrowell77 commented Jul 9, 2019

q66 commented Jul 14, 2019 •

edited

Loading

madscientist159 commented Jul 15, 2019

q66 commented Jul 15, 2019

Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16 #180

Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16 #180

Comments

madscientist159 commented Jul 7, 2019 • edited Loading

shawnanastasio commented Jul 7, 2019

dcrowell77 commented Jul 9, 2019

q66 commented Jul 14, 2019 • edited Loading

madscientist159 commented Jul 15, 2019

q66 commented Jul 15, 2019

madscientist159 commented Jul 7, 2019 •

edited

Loading

q66 commented Jul 14, 2019 •

edited

Loading