Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect CPU kinds on AMD Threadripper PRO 7000 #690

Open
mkuron opened this issue Sep 17, 2024 · 18 comments
Open

Incorrect CPU kinds on AMD Threadripper PRO 7000 #690

mkuron opened this issue Sep 17, 2024 · 18 comments

Comments

@mkuron
Copy link

mkuron commented Sep 17, 2024

What version of hwloc are you using?

2.10.0

Which operating system and hardware are you running on?

Alma Linux 8.10
Linux 4.18.0-553.5.1.el8_10.x86_64
Dell Precision 7875 Tower
BIOS version 1.6.2
AMD Ryzen Threadripper PRO 7975WX 32-Cores

Details of the problem

lstopo shows multiple CPU kinds on AMD Ryzen Threadripper PRO 7975WX due to variations in the max frequency (which looks excessively high) and lack of a base frequency (base frequencies are in general do not seem to be reported by hwloc for AMD CPUs). The AMD Ryzen Threadripper "Storm Peak"/Zen 4 generation is a homogeneous CPU and should have all cores represented as the same kind.

$ lstopo --cpukinds
CPU kind #0 efficiency 0 cpuset 0x0000ffff,0x0000ffff
  FrequencyMaxMHz = 5352
CPU kind #1 efficiency 1 cpuset 0x00800000,0x00800000
  FrequencyMaxMHz = 5517
CPU kind #2 efficiency 2 cpuset 0x00400000,0x00400000
  FrequencyMaxMHz = 5677
CPU kind #3 efficiency 3 cpuset 0x00010000,0x00010000
  FrequencyMaxMHz = 5837
CPU kind #4 efficiency 4 cpuset 0x00040000,0x00040000
  FrequencyMaxMHz = 6001
CPU kind #5 efficiency 5 cpuset 0x00080000,0x00080000
  FrequencyMaxMHz = 6161
CPU kind #6 efficiency 6 cpuset 0x00020000,0x00020000
  FrequencyMaxMHz = 6321
CPU kind #7 efficiency 7 cpuset 0x00100000,0x00100000
  FrequencyMaxMHz = 6482
CPU kind #8 efficiency 8 cpuset 0x00200000,0x00200000
  FrequencyMaxMHz = 6646
CPU kind #9 efficiency 9 cpuset 0x40000000,0x40000000
  FrequencyMaxMHz = 6806
CPU kind #10 efficiency 10 cpuset 0x20000000,0x20000000
  FrequencyMaxMHz = 6966
CPU kind #11 efficiency 11 cpuset 0x80000000,0x80000000
  FrequencyMaxMHz = 7130
CPU kind #12 efficiency 12 cpuset 0x01000000,0x01000000
  FrequencyMaxMHz = 7290
CPU kind #13 efficiency 13 cpuset 0x10000000,0x10000000
  FrequencyMaxMHz = 7451
CPU kind #14 efficiency 14 cpuset 0x04000000,0x04000000
  FrequencyMaxMHz = 7611
CPU kind #15 efficiency 15 cpuset 0x0a000000,0x0a000000
  FrequencyMaxMHz = 7775

This issue bears some similarity to #634, though there the frequencies had only very minor variations and looked much more reasonable. I am not entirely sure whether this CPU really thinks it has such excessively high and varying frequencies, or if this is simply a bug in the BIOS, firmware, or Linux kernel that leads to incorrect reporting.

Notes

The data sheet for this CPU says that the boost frequency is 5.3 GHz (which actually coincides with CPU kind #0), but I can't imagine 7.7 GHz being achievable with any kind of cooling. https://openbenchmarking.org/s/AMD+Ryzen+Threadripper+PRO+7975WX+32-Cores has the lscpu output for the same machine and theirs even goes up to 8.1 GHz. https://www.phoronix.com/review/hp-z6-g5-a/3 actually stated 9 months ago that:

[...] the 7995WX doesn't clock up to 6.44GHz... That's an AMD P-State Linux driver bug not specific to the HP workstation but other Threadripper 7000 series too. I already reported the issue to AMD and they will be posting Linux driver patches soon for fixing that AMD P-State CPU frequency reporting.

As this bug remains unfixed at least in RHEL8's Linux kernel (didn't verify any others), a workaround for this hardware quirk inside hwloc would be desirable. The frequencies reported in /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq correspond to the ones reported by hwloc, so this is clearly not an hwloc bug, but could potentially be worked around in ways similar to #634/#635.

@bgoglin
Copy link
Contributor

bgoglin commented Sep 17, 2024

Hello. This is actually rather related to #502 and we have a workaround that may work on AMD too. Try setting HWLOC_CPUKINDS_MAXFREQ=adjust=50 to ignore frequency differences up to 50% and let us know if you get a single cpukind.
I seem to remember base frequencies might be coming to the amd pstate in the future but I'd need to check again.
Anyway, hybrid CPUs from Intel are supposed to expose /sys/devices/cpu_{atom,core} so one way to avoid this issue could be to ignore frequency differences on x86 as long as these files don't exist.

@mkuron
Copy link
Author

mkuron commented Sep 17, 2024

Try setting HWLOC_CPUKINDS_MAXFREQ=adjust=50 to ignore frequency differences up to 50% and let us know if you get a single cpukind.

It's unchanged, still reporting 16 different kinds.

hybrid CPUs from Intel are supposed to expose /sys/devices/cpu_{atom,core} so one way to avoid this issue could be to ignore frequency differences on x86 as long as these files don't exist.

Or simply ignore frequency differences on AMD entirely, at least until AMD starts making hybrid CPUs. At that point #587 should help tell the core types apart, similar to cpu_{atom,core} on Intel.

@bgoglin
Copy link
Contributor

bgoglin commented Sep 17, 2024

Can you send the tarball foo.tar.bz2 generated by hwloc-gather-topology foo on this machine so that I debug this from here? In the meantime setting HWLOC_CPUKINDS_HOMOGENEOUS=1 should workaround the issue.

The issue with #587 is that it's only in our x86 backend. It'd be easier if Linux exposed it in sysfs but Linux kernel devs aren't convinced it's useful. Intel added /sys/devices/cpu_{atom,core} for PMU but I don't know if AMD will do the same.

I agree ignoring cpukinds on AMD might be easier for now (with an envvar to reenable it if ever needed). But I am going to poke my AMD contacts to better know what's coming. There are some leaks of Zen 5 "strix point" coming with both P and E core soon.

@mkuron
Copy link
Author

mkuron commented Sep 17, 2024

hwloc-gather-topology foo on this machine so that I debug this from here?

Will do.

In the meantime setting HWLOC_CPUKINDS_HOMOGENEOUS=1 should workaround the issue.

Indeed it does.

I agree ignoring cpukinds on AMD might be easier for now

I just realized that there are some Zen 4 CPUs that mix Zen 4 and Zen 4c, see https://www.phoronix.com/review/amd-zen4-zen4c-scaling. So CPU kind detection is desirable even on current generation AMD CPUs.

@superm1
Copy link

superm1 commented Sep 18, 2024

Threadripper 7000 doesn't mix Zen 4 and Zen 4c. I suspect this is actually tied to a preferred cores detection issue. AMD does do rankings via CPPC of which cores on the die are better, even if they can all clock identically.

There is a series that I submitted for 6.12-rc1 that I think will make this behave properly both with acpi-cpufreq and amd-pstate.

The PR for it is already merged, so if you want to try Linus' tree as of today you can see if it helps.

@superm1
Copy link

superm1 commented Sep 18, 2024

I agree ignoring cpukinds on AMD might be easier for now (with an envvar to reenable it if ever needed). But I am going to poke my AMD contacts to better know what's coming. There are some leaks of Zen 5 "strix point" coming with both P and E core soon.

And yes https://www.amd.com/en/products/processors/laptop/ryzen/300-series/amd-ryzen-ai-9-hx-370.html is already public and Strix is on the market. You can see that SKU clocks at 5.1GHz for the performant cores and 3.3 GHz for efficient.
There is a problem with Linux kernel identification of the max frequency for the efficient cores that I have a patch series under internal review right now for it. It will probably be 6.13 material.

@mkuron
Copy link
Author

mkuron commented Sep 18, 2024

There is a series that I submitted for 6.12-rc1 that I think will make this behave properly both with acpi-cpufreq and amd-pstate.

The PR for it is already merged, so if you want to try Linus' tree as of today you can see if it helps.

Thanks @superm1. I assume you are referring to torvalds/linux@9bcf303? I unfortunately don't have admin privileges on that Threadripper 7000 machine, but I'll see if I can get someone else to test it.

And yes https://www.amd.com/en/products/processors/laptop/ryzen/300-series/amd-ryzen-ai-9-hx-370.html is already public and Strix is on the market. You can see that SKU clocks at 5.1GHz for the performant cores and 3.3 GHz for efficient.

Is there a sysfs node that exposes whether a core is performant or efficient?

@superm1
Copy link

superm1 commented Sep 18, 2024

Thanks @superm1. I assume you are referring to torvalds/linux@9bcf303? I unfortunately don't have admin privileges on that Threadripper 7000 machine, but I'll see if I can get someone else to test it.

Yes that's the merge commit that pulls in all the 6.12 content and I expect helps this with acpi-cpufreq OR amd-pstate.
I should mention 6.11 with amd-pstate should also work properly.

Is there a sysfs node that exposes whether a core is performant or efficient?

There's a CPUID explained in the APM volume 2 for it on page 213:
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf

The appendix of volume 3 on page 646 explains more about it too:

https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf

@bgoglin
Copy link
Contributor

bgoglin commented Sep 18, 2024

Is there a sysfs node that exposes whether a core is performant or efficient?

There's a CPUID explained in the APM volume 2 for it on page 213: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf

The appendix of volume 3 on page 646 explains more about it too:

https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf

Support for this CPUID in hwloc is pretty much ready in #587 but I don't have any platform to test it. If strix is already public, it would be nice if you could run "hwloc-gather-cpuid" on it and send me a tarball of the resulting "cpuid" directly (this tools dumps the output of all CPUID leaves on each core so that I can use them remotely in hwloc).

Regarding sysfs itself, as far as I know, the only way to get E-core vs P-core information on Intel is by reading /sys/devices/cpu_{atom,core}/cpus (that's where they store PMU info). Any chance this would be available for AMD too?

I requested the addition of a dedicated "type" in sysfs cpu files recently but it's not clear it'll ever happen, at least because the "atom vs core" type isn't enough on Intel when there are low-power cores (but they could tweak that "type" sysfs file to report something different for E-core and LPE-core).

@superm1
Copy link

superm1 commented Sep 18, 2024

Support for this CPUID in hwloc is pretty much ready in #587 but I don't have any platform to test it. If strix is already public, it would be nice if you could run "hwloc-gather-cpuid" on it and send me a tarball of the resulting "cpuid" directly (this tools dumps the output of all CPUID leaves on each core so that I can use them remotely in hwloc).

Sure, this is with specifically the SKU AMD Ryzen AI 9 HX 370 w/ Radeon 890M.
strix.tar.gz

Regarding sysfs itself, as far as I know, the only way to get E-core vs P-core information on Intel is by reading /sys/devices/cpu_{atom,core}/cpus (that's where they store PMU info). Any chance this would be available for AMD too?
I requested the addition of a dedicated "type" in sysfs cpu files recently but it's not clear it'll ever happen, at least because the "atom vs core" type isn't enough on Intel when there are low-power cores (but they could tweak that "type" sysfs file to report something different for E-core and LPE-core).

As it's already available from the cpuid information, I would to understand the usecase to justify exporting it somewhere. What are you going to do with it? IMO alone it doesn't tell you enough relational information. For example whether the cores are on the same CCX, CCD, the family etc. The CPUID tells you a lot more so you can make informed decisions on it.

@bgoglin
Copy link
Contributor

bgoglin commented Sep 18, 2024

With your argument, should we remove sysfs cpu topology files? The vast majority of topology info from CPUID 0xb, 0x1f on Intel and 0x80000026 on AMD is already exposed in sysfs in a portable way. Hybrid core info is an important piece that it still missing in sysfs, for many users who are going to look at which core is small or big before binding tasks in parallel jobs. CPUID is far less convenient than sysfs because you have to bind to every single core to run Intel or AMD specific CPUID calls to get hybrid info (what #587 will do when the operating system doesn't expose it).

@superm1
Copy link

superm1 commented Sep 18, 2024

With your argument, should we remove sysfs cpu topology files?

No; that would cause regressions from any software that utilized them. Once you introduce such a file, you can' t remove it. That's exactly the reason I want to make sure that it makes sense to create before doing so. It's a maintenance burden to hang on to.

Hybrid core info is an important piece that it still missing in sysfs, for many users who are going to look at which core is small or big before binding tasks in parallel job

I have the view that this is the scheduler's job, not the user's job. The scheduler should be made aware the capacity of the cores and place and migrate tasks based upon that.

Even without the hetero detection code I'm working on for 6.13, I would expect that amd-pstate does a relatively good job using preferred cores and CPPC highest perf values to rank them.

@bgoglin
Copy link
Contributor

bgoglin commented Sep 18, 2024

Hybrid core info is an important piece that it still missing in sysfs, for many users who are going to look at which core is small or big before binding tasks in parallel job

I have the view that this is the scheduler's job, not the user's job. The scheduler should be made aware the capacity of the cores and place and migrate tasks based upon that.

Even without the hetero detection code I'm working on for 6.13, I would expect that amd-pstate does a relatively good job using preferred cores and CPPC highest perf values to rank them.

That's the eternal debate between kernel developers saying the kernel can guess what userspace wants, and HPC users not trusting kernel for understanding anything correctly. In the past, it was only HPC users, but nowadays it's very common because parallel libraries are everywhere. For general purpose irregular workloads, the scheduler may be able to do good things. However when userspace knows what it's doing, it'll create one task per cpu and has better information about which one should go where.

Also, another use case with hybrid info is userspace apps running regular parallelism where you want all your tasks to run at the same speed so that they don't slowdown each other. If you have 8 E-cores and 4 P-cores, you'll want either 4 tasks on P-core or 8 tasks on E-core. But first you have to know how many P and E-core exist in the system.

@superm1
Copy link

superm1 commented Sep 18, 2024

That's the eternal debate between kernel developers saying the kernel can guess what userspace wants, and HPC users not trusting kernel for understanding anything correctly. In the past, it was only HPC users, but nowadays it's very common because parallel libraries are everywhere. For general purpose irregular workloads, the scheduler may be able to do good things. However when userspace knows what it's doing, it'll create one task per cpu and has better information about which one should go where.

Of course affinitizing a task to a certain core could be helpful in some contexts by some userss. The problem is you might not be able to correctly classify it against the available hardware performance capacity from userspace. It's alluded to in this series, but I'll mention that some hardware can actually feed back hints to the scheduler for information about tasks that should be migrated.

Also, another use case with hybrid info is userspace apps running regular parallelism where you want all your tasks to run at the same speed so that they don't slowdown each other

But the thing is it's not just raw max frequency. You have other factors like how much cache the cores have and which cores share that cache. You have to know which cores have not hit thermal limits and can boost for longer.

IMO I think if you're going to try to play with gaming which cores to put tasks on, you're better off using something like sched_ext and working out a userspace scheduler.

If you have 8 E-cores and 4 P-cores, you'll want either 4 tasks on P-core or 8 tasks on E-core. But first you have to know how many P and E-core exist in the system.

But not all performance cores are identical! Just looking at the raw CPPC highest performance characterization for that Strix system I had above let me show you:

$ grep -v foo /sys/bus/cpu/devices/*/cpufreq/amd_pstate_highest_perf
/sys/bus/cpu/devices/cpu0/cpufreq/amd_pstate_highest_perf:208
/sys/bus/cpu/devices/cpu10/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu11/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu12/cpufreq/amd_pstate_highest_perf:208
/sys/bus/cpu/devices/cpu13/cpufreq/amd_pstate_highest_perf:208
/sys/bus/cpu/devices/cpu14/cpufreq/amd_pstate_highest_perf:202
/sys/bus/cpu/devices/cpu15/cpufreq/amd_pstate_highest_perf:196
/sys/bus/cpu/devices/cpu16/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu17/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu18/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu19/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu1/cpufreq/amd_pstate_highest_perf:208
/sys/bus/cpu/devices/cpu20/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu21/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu22/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu23/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu2/cpufreq/amd_pstate_highest_perf:202
/sys/bus/cpu/devices/cpu3/cpufreq/amd_pstate_highest_perf:196
/sys/bus/cpu/devices/cpu4/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu5/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu6/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu7/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu8/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu9/cpufreq/amd_pstate_highest_perf:125

You can probably infer which of those cores are the Zen5c cores and which are the Zen5 without an extra sysfs file. But how would you want to parallelize things? Only a few of the Zen5 cores behave the same. You'll have two SMT pairs at 208, one SMT pair at 202 and 1 at 196. So would you put your workload on the Zen5c cores because they all can boost the same?

No, you would need to know which siblings make sense from shared cache, which make sense because they're SMT pairs (depending upon the workload). And you would need to know if you're going to be jumping from one CCD to another.

@bgoglin
Copy link
Contributor

bgoglin commented Sep 18, 2024

I'd be happy to use sched_ext, but there are tons of existing users and legacy apps who didn't ever trust the kernel for scheduling their apps correctly, it's not going to change (especially because the hardware is much more complicated and they don't see why the kernel would do anything better than them). They just disable things like turboboost to mitigate the issue, ignore tiny differences like your 196 vs 208 above, etc. And rely on library like hwloc to get hardware info (including cache sharing, cache sizes, as you say) to do their own scheduling (which isn't really scheduling but often rather placing one task per thread).

Anyway, this issue is diverging from the original issue. Do you think @mkuron original issue with very different frequencies will go away in future releases, so that disabling hwloc frequency comparison algorithm is enough for now?

@mkuron
Copy link
Author

mkuron commented Sep 19, 2024

Can you send the tarball foo.tar.bz2 generated by hwloc-gather-topology foo on this machine so that I debug this from here?

Sorry for the delay. Here is the topology data from the Threadripper 7975WX with the spurious 16 CPU kinds: threadripper7975WX.tar.gz. I'll be happy to test whatever workaround you might come up with inside hwloc, @bgoglin.

@bgoglin
Copy link
Contributor

bgoglin commented Sep 20, 2024

Thanks @mkuron. The reason why HWLOC_CPUKINDS_MAXFREQ=adjust=50 didn't help is that there is no basefreq like in Intel pstate (I don't adjust max frequencies unless base frequency are found and identical). I'll use acpi_cppc/nominal_freq instead when available. Anyway, the real workaround is HWLOC_CPUKINDS_HOMOGENEOUS=1 and hope the kernel fix works.

@mkuron
Copy link
Author

mkuron commented Sep 20, 2024

I'll use acpi_cppc/nominal_freq instead when available.

Sounds good. That value is consistently reported as 4001 on this machine.

bgoglin added a commit that referenced this issue Sep 26, 2024
cpufreq/base_frequency is only available on Intel so far, and works well.

acpi_cppc/nominal_freq is already available on AMD (and ARM or soon),
so it's likely good for the future.
However it reports incorrect values on Intel SPR and MTL at least.

Hence try cpufreq/base_frequency first,
then fallback to acpi_cppc/nominal_freq.

Refs #690

Signed-off-by: Brice Goglin <[email protected]>
bgoglin added a commit that referenced this issue Sep 26, 2024
cpufreq/base_frequency is only available on Intel so far, and works well.

acpi_cppc/nominal_freq is already available on AMD (and ARM or soon),
so it's likely good for the future.
However it reports incorrect values on Intel SPR and MTL at least.

Hence try cpufreq/base_frequency first,
then fallback to acpi_cppc/nominal_freq.

Refs #690

Signed-off-by: Brice Goglin <[email protected]>
(cherry picked from commit 2292110)
bgoglin added a commit that referenced this issue Sep 26, 2024
cpufreq/base_frequency is only available on Intel so far, and works well.

acpi_cppc/nominal_freq is already available on AMD (and ARM or soon),
so it's likely good for the future.
However it reports incorrect values on Intel SPR and MTL at least.

Hence try cpufreq/base_frequency first,
then fallback to acpi_cppc/nominal_freq.

Refs #690

Signed-off-by: Brice Goglin <[email protected]>
(cherry picked from commit 2292110)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants