Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let bsub retry on LSF message "Request from non-LSF host rejected" #9195

Merged
merged 7 commits into from
Nov 13, 2024

Conversation

berland
Copy link
Contributor

@berland berland commented Nov 12, 2024

Issue
Backport to Ert 11.0., for #9185

Approach
🍒

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure unit tests pass locally after every commit (git rebase -i main --exec 'pytest tests/ert/unit_tests -n logical -m "not integration_test"')

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@berland berland self-assigned this Nov 12, 2024
@berland berland added bug release-notes:bug-fix Automatically categorise as bug fix in release notes labels Nov 12, 2024
JHolba and others added 2 commits November 12, 2024 15:09
New version is incompatible with our current code
The LSF driver experiences crashes stemming from bsub returning with the
error message 'Request from non-LSF host rejected'. There are reasons to
believe this is not a permanent error, but some flakyness in the IP
infrastructure, and thus should should be categorized as a retriable
failure.

The reason for believing this is flakyness is mostly from the fact that
the same error is also seen on 'bjobs'-calls. If it was a permanent
failure scenario, there would be an enourmous amount of error from these
bjobs calls, but there is not.
Copy link
Contributor

@andreas-el andreas-el left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍒

A kill window of 1 second is not enough on real-life test nodes.
And add some explanation for further debugging
@berland berland enabled auto-merge (rebase) November 13, 2024 09:05
@berland berland merged commit 6d859ee into equinor:version-11.0 Nov 13, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug release-notes:bug-fix Automatically categorise as bug fix in release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants