Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

balloons: add support for cpu frequency governor tuning #374

Merged
merged 1 commit into from
Oct 7, 2024

Conversation

fmuyassarov
Copy link
Collaborator

This commit introduces a new field in the balloons configuration CR, enabling users to specify CPU frequency governor mode preferences.

This commit introduces a new field in the balloons configuration CR,
enabling users to specify CPU frequency governor mode preferences.

Signed-off-by: Feruzjon Muyassarov <[email protected]>
@fmuyassarov
Copy link
Collaborator Author

@askervin @klihub PTAL. Note that I didn't add a commit for the docs assuming that there might be some changes with the naming or something else. What you guys are happy with the change I will add a commit to update the documentation on this same PR.

@klihub klihub requested review from klihub, askervin and kad October 4, 2024 12:52
Copy link
Collaborator

@klihub klihub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me. I skimmed through this once, and if I got the naming wrt. config hierarchy right, this looks fine to me.

My only question is that since ATM we can't test this functionality in our e2e test VMs, have you tested this on real HW in practice ?

@fmuyassarov
Copy link
Collaborator Author

LGTM to me. I skimmed through this once, and if I got the naming wrt. config hierarchy right, this looks fine to me.

My only question is that since ATM we can't test this functionality in our e2e test VMs, have you tested this on real HW in practice ?

Yes, I have tested it on my own machine and it works as expected. Fro example, I was checking the updated governor mode of a core belonging to a balloon by simply
cat /sys/devices/system/cpu/cpuN/cpufreq/scaling_governor

Copy link
Collaborator

@askervin askervin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work, @fmuyassarov!

@askervin
Copy link
Collaborator

askervin commented Oct 7, 2024

I really wish we had some way to test this in our e2e tests, too. For instance, if we have idleCPUClass defined and no balloons (except for the reserved an possibly the default) to start with, we should have CPUs configured with idle class frequencies and governor.

@fmuyassarov, if this runs in a vm, do we get log error messages about failing to set a governor for a CPU? From those errors we could verify that we at least tried to do something sensible.

Another reason why this should have an e2e test is that we wish to see what happens when someone uses a balloons configuration with a governor in a vm. After all, it's very much possible that the policy runs in a vm, and might (possibly accidentally) have CPU classes defined, too.

@klihub
Copy link
Collaborator

klihub commented Oct 7, 2024

I really wish we had some way to test this in our e2e tests, too. For instance, if we have idleCPUClass defined and no balloons (except for the reserved an possibly the default) to start with, we should have CPUs configured with idle class frequencies and governor.

@fmuyassarov, if this runs in a vm, do we get log error messages about failing to set a governor for a CPU? From those errors we could verify that we at least tried to do something sensible.

Another reason why this should have an e2e test is that we wish to see what happens when someone uses a balloons configuration with a governor in a vm. After all, it's very much possible that the policy runs in a vm, and might (possibly accidentally) have CPU classes defined, too.

One possibility is to use similar plumbing than we use for testing controller-agnostic hook triggers in balloons/n4c16/test21-controller-check.

That would require a few changes to the cpu controller when ENABLE_TEST_APIS is set in the environment:

  • add/set cpuctl.testing when ENABLE_TEST_APIS is set
  • when testing, always claim uncore control is available
  • when testing, don't return an error from enforce{uncore,Cpufreq,CpufreqGovernor}
  • when testing, register a /cpu-control-state HTTP endpoint for querying controller state

In the handler for the HTTP endpoint you dump something like this as JSON

type state struct {
     Classes map[string]Class
     Assignments map[string]idset.ID // might be easier to have map[string]string instead
}

Then in the tests, we use this endpoint just like we now do for querying metrics, but we fetch data from /cpu-control-state instead of /metrics. Since the exposed data is JSON in a known format controlled by us, we can verify the expected test result with jq queries.

@klihub klihub self-requested a review October 7, 2024 08:07
@fmuyassarov
Copy link
Collaborator Author

@fmuyassarov, if this runs in a vm, do we get log error messages about failing to set a governor for a CPU? From those errors we could verify that we at least tried to do something sensible.

I can verify it. My previous tests were on physical machine.

@fmuyassarov
Copy link
Collaborator Author

Then in the tests, we use this endpoint just like we now do for querying metrics, but we fetch data from /cpu-control-state instead of /metrics. Since the exposed data is JSON in a known format controlled by us, we can verify the expected test result with jq queries.

As we discussed offline, let's keep it in mind and I will try to allocate some time for setting up something similar that you described here.

@klihub klihub merged commit 8a8d920 into containers:main Oct 7, 2024
4 checks passed
@fmuyassarov fmuyassarov deleted the c-states branch October 7, 2024 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants