Skip to content

Linux kernel bugs

Brice Goglin edited this page May 14, 2019 · 24 revisions

The following hwloc error messages are caused by the Linux kernel reporting invalid topology information. Recent errors are listed first.

Invalid L3 cpuset on 24-core AMD EPYC processor

****************************************************************************            
* hwloc 1.11.8 has encountered what looks like an error from the operating system.                                                            
*                                                                                                                                             
* L3 (cpuset 0x60000060) intersects with NUMANode (P#0 cpuset 0x3f00003f
nodeset 0x00000001) without inclusion!                                                                 

Fixed in Linux 4.14 in this commit (and backported in 4.13.16):

commit 2b83809a5e6d619a780876fcaf68cdc42b50d28c
Author: Suravee Suthikulpanit <[email protected]>
Date:   Mon Jul 31 10:51:59 2017 +0200

    x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask

Packages Cut in Halves on Intel Xeon E5 v3/v4 with Cluster-on-Die

Each dual-NUMA package is reported as two single-NUMA packages.

Fixed in Linux 3.18 in this commit:

commit cebf15eb09a2fd2fa73ee4faa9c4d2f813cf0f09
Author: Dave Hansen <[email protected]>
Date:   Thu Sep 18 12:33:34 2014 -0700

    x86, sched: Add new topology for multi-NUMA-node CPUs

Invalid PCI locality on Intel Xeon E5 v3/v4 with Cluster-on-Die

****************************************************************************
* hwloc 1.11.2 has encountered an incorrect PCI locality information.
* PCI bus 0000:80 is supposedly close to 2nd NUMA node of 1st package,
* however hwloc believes this is impossible on this architecture.
* Therefore the PCI bus will be moved to 1st NUMA node of 2nd package.
*
* If you feel this fixup is wrong, disable it by setting in your environment
* HWLOC_PCI_0000_80_LOCALCPUS= (empty value), and report the problem
* to the hwloc's user mailing list together with the XML output of lstopo.
*
* You may silence this message by setting HWLOC_HIDE_ERRORS=1 in your environment.

This problem may look similar to the previous one but it's actually very different. This is actually a BIOS bug, nothing to fix in the kernel. hwloc detects the issue and fixes it automagically.

Invalid L3 cpuset on AMD 12-core Opteron 6200/6300 (Bulldozer and Piledriver)

****************************************************************************
* Hwloc has encountered what looks like an error from the operating system.
*
* object (L3 cpuset 0x000003f0) intersection without inclusion!

The fix was NEVER pushed to Linux.

Use hwloc >=1.11.2 and set HWLOC_COMPONENTS=x86 in your environment to work around the issue.

Invalid NUMA cpuset on AMD Opteron 6200/6300 (Bulldozer and Piledriver)

****************************************************************************
* Hwloc has encountered what looks like an error from the operating system.
*
* Socket (P#2 cpuset 0x0000ffff,0x0) intersects with NUMANode (P#3 cpuset
0x0000ff00,0xff000000) without inclusion!

This is likely not a kernel bug but rather a BIOS reporting invalid SRAT information.

Upgrading the BIOS is the only chance to get a proper fix. Otherwise try hwloc >=1.11.2 and set HWLOC_COMPONENTS=x86 in your environment to work around the issue.

Clone this wiki locally