Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use actual node resource utilization in the strategy "LowNodeUtilization" #225

Closed
zhiyxu opened this issue Feb 3, 2020 · 47 comments · Fixed by #1555
Closed

Use actual node resource utilization in the strategy "LowNodeUtilization" #225

zhiyxu opened this issue Feb 3, 2020 · 47 comments · Fixed by #1555
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@zhiyxu
Copy link

zhiyxu commented Feb 3, 2020

Currently, pods' request resource requirements are considered for computing node resource utilization in the strategy "LowNodeUtilization", is it more rational to use actual node resource utilization as judgement?

It is common that resource limit of pod is larger than request, and after the scheduling of default scheduler (by resource request), It's probably that cluster is balanced. But the actual resources usage of pod could be much larger than request, which may lead some nodes under pressure.

So is it much more reasonable to combine with metrics server and use actual node resource utilization.

@seanmalloy
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 3, 2020
@seanmalloy
Copy link
Member

seanmalloy commented Feb 3, 2020

@zhiyxu it looks like you are not the first person to request this feature. See the discussions in #123 , #118, and #7. Based on the discussions in those issues it looks like the descheduler LowNodeUtilization strategy is still using requests because this aligns with how the k8s scheduler works. Also, this feature is mentioned in the roadmap.

@damemi @aveshagarwal @ravisantoshgudimetla has anything changed recently to enable the k8s scheduler to use real load metrics during scheduling? For example could the new scheduler framework some how enable this feature in the scheduler? Maybe a custom plugin using the scheduler framework could be created to take real load metrics into account?

@zhiyxu
Copy link
Author

zhiyxu commented Feb 14, 2020

@seanmalloy @ravisantoshgudimetla @damemi @aveshagarwal Any update or plan about this feature?

@kangtiann
Copy link

+1, We need this feature too.

If we can make a PR for this ?

@seanmalloy @ravisantoshgudimetla @damemi @aveshagarwal

@seanmalloy
Copy link
Member

seanmalloy commented Feb 25, 2020

@zhiyxu and @kangtiann here are my initial thoughts on what the API spec might look like. Please let me know what you think. I'm pretty confident the v1alpha2 LowNodeUtilization strategy will need to be adjusted.

I believe it would be a good idea to write a proposal for this and have SIG scheduling review it.

Create a new v1alpha1 LowNodeAllocation strategy. This strategy will work identically to the request based v1alpha LowNodeUtilization strategy. The usage of the work allocation is inspired by the discussion in #7.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeAllocation":
     enabled: true
     params:
       nodeResourceAllocatedThresholds:
         thresholds:
           "cpu" : 20
           "memory": 20
           "pods": 20
         targetThresholds:
           "cpu" : 50
           "memory": 50
           "pods": 50

Create a new v1alpha2 LowNodeUtilization strategy. This strategy will get data from the metrics API to evict pods from nodes based on actual node utilization metrics. The below proposed YAML API spec is a rough draft and will need to be refined.

The HPA supports custom metrics. Does the descheduler also need to support custom metrics too?

Keep in mind that the k8s scheduler does not take actual node utilization into account when scheduling pods. Pods evicted by this strategy could end up being scheduled on the same node again. Maybe this strategy could be paired with a yet to be created out of tree scheduler plugin that takes node utilization into account when scheduling pods. See the discussions in #123 and #118.

apiVersion: "descheduler/v1alpha2".   # Bump to v1alpha2
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           "cpu" : 20
           "memory": 20
         targetThresholds:
           "cpu" : 50
           "memory": 50

@seanmalloy
Copy link
Member

If we can make a PR for this ?

@kangtiann just want to clarify are you willing to implement this and submit a PR with the required code changes?

@seanmalloy
Copy link
Member

Also, keep in mind that the kubelet will evict pods when a node starts running out of memory or disk, https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-signals.

@damemi
Copy link
Contributor

damemi commented Feb 25, 2020

Would definitely like to get more feedback from the scheduling sig on the feasability of this. Getting "actual" pod usage has been a tricky problem I've personally hit trying to debug flaking e2es and I'm not totally caught up on the current state of getting that info.

However I like @seanmalloy's proposal because we already have this strategy that uses resource requests, which users may desire/prefer/expect, but I don't think it would require an entirely new strategy. I think a simple boolean on the current strategy to flip between spec resources and "actual" resources would be less confusing in code and usage

@zhiyxu
Copy link
Author

zhiyxu commented Feb 28, 2020

@seanmalloy The proposal is great, and there are some further details to consider:

  1. Create a v1alpha1 version of the LowNodeAllocation strategy to replace LowNodeUtilization is a good idea, but it'll result in two resource types in v1alpha1 version have exactly same effect, which will be a little bit confusing. Is it better to create these 2 resource types directly in v1alpha2, of course, LowNodeUtilization would be no longer backward compatible anyway.

  2. @damemi Is it possible that customers want to use both LowNodeAllocation and LowNodeUtilization strategies simultaneously? Maybe these two strategies are not completely opposite.

  3. In order to make LowNodeUtilization take effect in v1alpha2, we need to build a scheduler framework plugin, which contains policy takes realtime node utilization into account. Meanwhile the customer needs to run a scheduler including the plugin as pod in the cluster and change the spec.schedulerName of specific Pods to the name of the scheduler, or more directly, the customer could replace the original kube-scheduler with the new scheduler containing the plugin, to effect on all Pods in the cluster. Whichever way would greatly increase the difficulty of using the project.

  4. If the Metrics API is not installed in the customer's cluster, the realtime node metrics can't be gathered, and this strategy will be completely useless.

  5. Custom extended resources are issues we need to consider further.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2020
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2020
@seanmalloy
Copy link
Member

seanmalloy commented Aug 27, 2020

There was a recent proposal at the SIG Scheduling meeting to add a scheduler plugin to take real load metrics into account during scheduling.

https://docs.google.com/presentation/d/13tleXxfPHRnW_-desRTzOZwpRDJX5u4MPlnQxNs15IU/edit#slide=id.g8fcfb6bb75_2_14

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2020
@seanmalloy
Copy link
Member

Here is the KEP document for Real Load Aware Scheduling: https://docs.google.com/document/d/1ffBpzhqELmhqJxdGMzYzIOoigxn3J0zlP1_nie34f9s/edit#

@seanmalloy
Copy link
Member

Updated KEP document for Real Load Aware Scheduling:
kubernetes-sigs/scheduler-plugins#61

@pgiles
Copy link

pgiles commented Oct 2, 2020

After evaluating Descheduler, we are very hopeful it will help us rebalance our clusters. However, we cannot move forward until this feature is implemented. In short, +1 for this feature request and we'll check back often to see when it is released. Thank you!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2020
@damemi
Copy link
Contributor

damemi commented Jan 4, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021
@stefkkkk
Copy link

stefkkkk commented Jan 22, 2021

any updates about it, please? would be very useful feature

@damemi
Copy link
Contributor

damemi commented Jan 22, 2021

@Stefik95 the linked enhancements around real load aware scheduling are still being worked on (mainly in the scheduler-plugins repo, under the "Trimaran" name).

It was mentioned above, but getting actual pod consumption relies on access to the metrics api. To move forward with this, we should look into what we need to be able to access those metrics from within descheduler (and fallbacks/disable when those metrics aren't available). Any help with this step is welcome, it would likely follow a similar pattern to Trimaran's metrics collection.

As a side note, there were also metrics recently added to report the scheduler's "observed" usage based on limits/requests for administrators to compare to real usage (kubernetes/enhancements#1916 and kubernetes/kubernetes#94866). This is intended to help admins optimize their requests and limits to better reflect actual values.

@stefkkkk
Copy link

stefkkkk commented Jan 22, 2021

@Stefik95 the linked enhancements around real load aware scheduling are still being worked on (mainly in the scheduler-plugins repo, under the "Trimaran" name).

It was mentioned above, but getting actual pod consumption relies on access to the metrics api. To move forward with this, we should look into what we need to be able to access those metrics from within descheduler (and fallbacks/disable when those metrics aren't available). Any help with this step is welcome, it would likely follow a similar pattern to Trimaran's metrics collection.

As a side note, there were also metrics recently added to report the scheduler's "observed" usage based on limits/requests for administrators to compare to real usage (kubernetes/enhancements#1916 and kubernetes/kubernetes#94866). This is intended to help admins optimize their requests and limits to better reflect actual values.

Thanks for answer! Could you tell me please, is it true, or not. At the moment LowNodeUtilization works on requests which were set during pod's deploy, not along time changes of pod's requests?

@damemi damemi pinned this issue Jan 27, 2022
@damemi
Copy link
Contributor

damemi commented Jan 27, 2022

Pinning this issue as it is a common request

See also #225, #437, #270, #118, #90, #702

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2022
@damemi
Copy link
Contributor

damemi commented May 2, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022
@metost
Copy link

metost commented Jul 31, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2022
@robertchgo
Copy link

this would be a really useful feature for us are there any updates on this?

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 4, 2022
@damemi
Copy link
Contributor

damemi commented Nov 4, 2022

@robertchgo not at the moment. there have been a few who have offered to implement it as discussed above but no progress so far. with other ongoing work, this is a backlog feature right now

/lifecycle frozen

@binacs
Copy link
Member

binacs commented Mar 11, 2023

Hello everyone! I have a MR(#1087) to try to solve this problem, and I look forward to everyone's review comments to make it better.

Hope it will help.

@joenzx
Copy link

joenzx commented Sep 14, 2023

This feature is really useful, when will it be updated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.