Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optimize scheduling condition semantics #3741

Merged
merged 1 commit into from
Jul 18, 2023

Conversation

whitewindmills
Copy link
Member

@whitewindmills whitewindmills commented Jun 30, 2023

What type of PR is this?
/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #3586

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`karmada-scheduler`: Introduced new schedule condition reasons, NoClusterFit, SchedulerError, Unschedulable, Success.

@codecov-commenter
Copy link

codecov-commenter commented Jun 30, 2023

Codecov Report

Merging #3741 (6eb8f9a) into master (9bac51d) will decrease coverage by 0.94%.
The diff coverage is 23.52%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master    #3741      +/-   ##
==========================================
- Coverage   56.61%   55.68%   -0.94%     
==========================================
  Files         221      225       +4     
  Lines       20831    21335     +504     
==========================================
+ Hits        11794    11880      +86     
- Misses       8413     8822     +409     
- Partials      624      633       +9     
Flag Coverage Δ
unittests 55.68% <23.52%> (-0.94%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/scheduler/core/generic_scheduler.go 0.00% <0.00%> (ø)
pkg/scheduler/scheduler.go 17.88% <0.00%> (-0.12%) ⬇️
pkg/scheduler/core/division_algorithm.go 85.71% <100.00%> (ø)
pkg/scheduler/core/util.go 75.40% <100.00%> (+0.83%) ⬆️
pkg/scheduler/helper.go 93.54% <100.00%> (+1.71%) ⬆️

... and 9 files with indirect coverage changes

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

pkg/scheduler/core/util.go Outdated Show resolved Hide resolved
@XiShanYongYe-Chang
Copy link
Member

/assign

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 6, 2023
@whitewindmills
Copy link
Member Author

@jwcesign
Copy link
Member

jwcesign commented Jul 7, 2023

LGTM

Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot~
/lgtm

Ask @Garrybest to help take a look.
/cc @Garrybest

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2023
@XiShanYongYe-Chang
Copy link
Member

Hi @whitewindmills, can you help update the release note more clearly, for example, add the new Reason name? It will be more intuitive when the version is released.

@whitewindmills
Copy link
Member Author

Hi @whitewindmills, can you help update the release note more clearly, for example, add the new Reason name? It will be more intuitive when the version is released.

sure

@Garrybest
Copy link
Member

/assign

return result, &framework.FitError{
NumAllClusters: clusterInfoSnapshot.NumOfClusters(),
Diagnosis: diagnosis,
return result, &NoClusterError{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a new error type here.

scheduleSuccessMessage = "Binding has been scheduled"
scheduleSuccessReason = "BindingScheduled"
scheduleFailedReason = "BindingFailedScheduling"
noClusterAvailableReason = "NoClusterAvailable"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This two types is confusing, what's the difference?

I'd like to refer to what kubernetes does.

  1. We can't find any clusters that pass the filter plugins.
    message: 0/1 clusters are available: 1 cluster(s) didn't match the placement cluster affinity constraint.
    reason: Unschedulable
    status: "False"
    type: BindingScheduled
  1. we can't assign replicas due to not enough resources.
    message: failed to assignReplicas: xxxxxx
    reason: Unschedulable
    status: "False"
    type: BindingScheduled

This two types will cover the unschedulable condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoCluster is for the scenarios without clusters. As discussed before, we need to distinguish between scenarios with and without clusters. For scenarios without clusters, retry scheduling actually does not make any sense, but from our requirements, we need to regard it as a successful scheduling. If we introduce Unschedulable, how to define it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you enlighten me about what is without clusters? Do you mean no clusters pass the filter plugin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, and no member cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For scenarios without clusters, retry scheduling actually does not make any sense

What if a new cluster is added, or a taint is removed, should we retry?

I don't think this is a successful scheduling. For kubernetes, a pod is not scheduled successfully unless it is truely assigned to a node. We can easily distinguish no-cluster availablity and scheduler internal error by reason. Unschedulable is a typical reason for no more clusters fit or no more enough resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. But I guess no suitable cluster is a Unschedulable one and internal error is a SchedulerError one, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course you are right. can we seperate the Unschedulable reason into more precise ones? maybe we don't need Unschedulable but seperate it into specific reasons, like NoClusterAvailable, NoClusterFit and InsufficientResources etc. And the internal error is still SchedulerError. how do you feel about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with it. But I'm not sure the difference between NoClusterAvailable and NoClusterFit. Can we merge them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can name it NoClusterFit. but for NoClusterFit, I think we do not need to return error to retry scheduling, cause it makes no sense and resources will be scheduled when new member cluster is joined or scheduling requirements are met.
Let's summarize, it looks like this:

  1. Unschedulable: no cluster fit
message: no clusters available to schedule
reason: NoClusterFit
status: "False"
type: Scheduled
  1. Unschedulable: insufficient resources
message: cluster xxx has no sufficient memory
reason: InsufficientResources
status: "False"
type: Scheduled
  1. Internal error: eg. can not access api-server...
message: xxx timeout
reason: SchedulerError
status: "False"
type: Scheduled
  1. schedule successfully
message: Binding has been scheduled
reason: BindingScheduled
status: "True"
type: Scheduled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Now we have reached an agreement.

@karmada-bot karmada-bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 17, 2023
@XiShanYongYe-Chang
Copy link
Member

Ask an again review from @Garrybest

@whitewindmills
Copy link
Member Author

I slightly adjusted the results of our previous discussions. I think NoClusterFit should be separated from Unschedulable due to its particularity. For NoClusterFit, we do not need to return error to retry scheduling. Now it looks like the following.

    message: Binding has been scheduled successfully.
    reason: Success
    status: "True"
    type: Scheduled
    message: 0/3 clusters are available: 3 cluster(s) didn't match the placement
      cluster affinity constraint.
    reason: NoClusterFit
    status: "False"
    type: Scheduled
    message: Clusters available replicas 3 are not enough to schedule.
    reason: Unschedulable
    status: "False"
    type: Scheduled
    message: failed to select clusters.
    reason: SchedulerError
    status: "False"
    type: Scheduled

/cc @Garrybest @XiShanYongYe-Chang

pkg/apis/work/v1alpha2/binding_types.go Show resolved Hide resolved
pkg/scheduler/scheduler.go Show resolved Hide resolved
pkg/scheduler/core/util.go Outdated Show resolved Hide resolved
pkg/scheduler/helper.go Outdated Show resolved Hide resolved
pkg/scheduler/scheduler.go Show resolved Hide resolved
@whitewindmills
Copy link
Member Author

PTAL~ @Garrybest

@whitewindmills whitewindmills changed the title feat: add no cluster schedule condition reason feat: optimize scheduling condition semantics Jul 17, 2023
@Garrybest
Copy link
Member

Good job.

/lgtm
/approve

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 18, 2023
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Garrybest

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2023
@karmada-bot karmada-bot merged commit 71de164 into karmada-io:master Jul 18, 2023
11 checks passed
@whitewindmills whitewindmills deleted the schedule-condition branch July 18, 2023 02:48
@XiShanYongYe-Chang
Copy link
Member

Good job!
Thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Confusing schedule result condition
7 participants