Skip to content

Commit

Permalink
Assume non-LSF host error is flaky
Browse files Browse the repository at this point in the history
The LSF driver experiences crashes stemming from bsub returning with the
error message 'Request from non-LSF host rejected'. There are reasons to
believe this is not a permanent error, but some flakyness in the IP
infrastructure, and thus should should be categorized as a retriable
failure.

The reason for believing this is flakyness is mostly from the fact that
the same error is also seen on 'bjobs'-calls. If it was a permanent
failure scenario, there would be an enourmous amount of error from these
bjobs calls, but there is not.
  • Loading branch information
berland committed Nov 12, 2024
1 parent 7d6025e commit 404a7ea
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 2 deletions.
9 changes: 8 additions & 1 deletion src/ert/scheduler/lsf_driver.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,14 @@ class RunningJob:
LSF_INFO_JSON_FILENAME = "lsf_info.json"
FLAKY_SSH_RETURNCODE = 255
JOB_ALREADY_FINISHED_BKILL_MSG = "Job has already finished"
BSUB_FAILURE_MESSAGES = ("Job not submitted",)
BSUB_FAILURE_MESSAGES = (
"Error in rusage section",
"Expeced number, string",
"No such queue",
"Too many processors requested",
"cannot be used in the resource requirement section",
"duplicate section",
)


def _parse_jobs_dict(jobs: Mapping[str, JobState]) -> dict[str, AnyJob]:
Expand Down
3 changes: 2 additions & 1 deletion tests/unit_tests/scheduler/test_lsf_driver.py
Original file line number Diff line number Diff line change
Expand Up @@ -578,7 +578,6 @@ async def test_that_bsub_will_retry_and_fail(
" '&' cannot be used in the resource requirement section. Job not submitted.",
),
(255, "Error in rusage section. Job not submitted."),
(255, "Job not submitted."),
],
)
async def test_that_bsub_will_fail_without_retries(
Expand All @@ -604,6 +603,8 @@ async def test_that_bsub_will_fail_without_retries(
[
(0, "void"),
(FLAKY_SSH_RETURNCODE, ""),
(0, "Request from non-LSF host rejected"),
(FLAKY_SSH_RETURNCODE, "Request from non-LSF host rejected"),
],
)
async def test_that_bsub_will_retry_and_succeed(
Expand Down

0 comments on commit 404a7ea

Please sign in to comment.