Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: stream_logs_by_id incorrectly handles task retry logic #4250

Open
andylizf opened this issue Nov 2, 2024 · 4 comments · May be fixed by #4407
Open

Bug: stream_logs_by_id incorrectly handles task retry logic #4250

andylizf opened this issue Nov 2, 2024 · 4 comments · May be fixed by #4407

Comments

@andylizf
Copy link
Contributor

andylizf commented Nov 2, 2024

The recent commit 46ea0d8 that adds task retry functionality misplaced the retry logic in stream_logs_by_id. The retry check is currently in the wrong branch of the if-else structure:

Current placement:

if returncode == 0:
    if job_status != job_lib.JobStatus.CANCELLED:
        if task_id < num_tasks - 1 and follow:
            # handle next task
        else:  # <- retry logic incorrectly placed here
            task_specs = managed_job_state.get_task_specs(...)
            # retry logic

Should be:

if returncode == 0:
    if job_status != job_lib.JobStatus.CANCELLED:
        if task_id < num_tasks - 1 and follow:
            # handle next task
        else:
            break
else:  # <- retry logic should be here, handling cluster failures
    # check task_specs and handle retry
# old cluster failed logic

This bug prevents proper handling of task retries as it's checking for retries in the wrong scenario (when task completes normally) instead of when task/cluster fails.

          Seems the retry logic is added in the wrong branch. It's currently in the `else` branch of `if task_id < num_tasks - 1 and follow`, which means it only triggers when we want to terminate. The retry check should be in the outer `else` branch where we handle cluster failures.

Originally posted by @andylizf in #4169 (comment)

@andylizf
Copy link
Contributor Author

andylizf commented Nov 2, 2024

@cblmemo PTAL, thanks!

@andylizf
Copy link
Contributor Author

andylizf commented Nov 2, 2024

@Michaelvll I noticed a potential issue with the retry logic placement in stream_logs_by_id. The current implementation puts it in the else branch of task_id < num_tasks - 1 and follow, which seems incorrect as it only handles the last task or non-follow case.

I think it should log retries when tasks actually fail (i.e., in the outer else branch where we handle cluster failures). However, @cblmemo pointed out that the fix might not be as simple as just moving the code block, since there are two cases that can trigger retry: user program failure and log tailing failure. The outer else branch only handles the latter case.

I might be missing something in my understanding of the task retry mechanism. Could you please take a look and verify if this is indeed an issue? cc @romilbhardwaj

@Michaelvll
Copy link
Collaborator

That's a good catch @andylizf! We may need to move the code path for handling the restarts before the place we handle task_id < num_tasks - 1 with some special handling. Would you be able to help dig into it @andylizf?

@andylizf
Copy link
Contributor Author

Thanks for pointing that out! I'll look into it tomorrow.

@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants