Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2879] [shim] yunikorn unschedulable pods pending forever #929

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zj619
Copy link

@zj619 zj619 commented Oct 18, 2024

  • task postfail or rejected reschedule

What is this PR for?

when task fail, yunikorn unschedulable pods pending forever,introduce retry

What type of PR is it?

  • - Bug Fix
  • [√ ] - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

Issue associated with this PR:https://issues.apache.org/jira/browse/YUNIKORN-2879

How should this be tested?

test cases

Screenshots (if appropriate)

task fail to retry
image
after retry, bind successful
image

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@craigcondit
Copy link
Contributor

craigcondit commented Oct 18, 2024

I'm -1 on this whole approach. It's far too complex and introduces unnecessary per-pod configuration. We're trying to simplify the shim these days and this drastically complicates it.

At the very least this needs a design doc first.

@pbacsko pbacsko self-requested a review October 18, 2024 07:43
Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 also.

@zhuqi-lucas is already working on sth like this, see YUNIKORN-2804.

This change alone is not even enough, we need to notify the core to cancel the allocation first and then the whole allocation cycle can restart again. We don't want to see an un-bindable pod on the core side to occupy resources.

@pbacsko
Copy link
Contributor

pbacsko commented Oct 18, 2024

Also, next time:

  1. Please add a short description to "What is this PR for?" and remove the template text.
  2. Insert the link to the upstream JIRA ticket properly.

Thanks.

* task postfail or rejected reschedule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants