Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous job submission and removal #610

Merged
merged 2 commits into from
Aug 30, 2023
Merged

Conversation

jrueb
Copy link
Contributor

@jrueb jrueb commented Jul 12, 2023

This solution is using asyncio.create_subprocess_exec and similar to what was done in PR #473.
However, this other PR lacked an update job removal, which is included here.

The test_jobqueue_job_call required an overhaul to make it compatible with the async code. Its form is basically the same as the other async tests now.

In issue #470 another solution using run_in_executor was proposed. I consider the solution with create_subprocess_exec more rigorous though, because with create_subprocess_exec the code is fully using the async API, instead of wrapping non-async functionality in a run_in_executor call.

This solution is using asyncio.create_subprocess_exec
@jrueb
Copy link
Contributor Author

jrueb commented Jul 12, 2023

I have removed the weakref.finalize call here, which was one way to remove jobs at program exit. This is because finalize interferes with the fact that _close_job is a coroutine now. At program exit, Distributed already takes care of job removal here
https://github.com/dask/distributed/blob/8e0c79bb82b5d9e3f4f9538935520da960588df6/distributed/deploy/spec.py#L692C1-L692C1
In fact, before this PR, jobs were removed twice, which made the program exit unnecessary lengthy.

There is one issue left however: As you can see in the linked Distributed code, there is a timeout of 10 seconds. This is not enough to remove much more than 350 jobs (at least in my tests with HTCondor). One may want to ask to increase this timeout.

@guillaumeeb
Copy link
Member

@jacobtomlinson could you take a quick look at this? I'm not familiar enough with asyncio to give a good advice.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me thanks @jrueb.

I agree that the 10 second timeout may be a little short for HPC jobs. I think we should probably make a change in distirbuted that makes that timeout configurable based on a class attribute. Then we can control that attribute in dask-jobqueue. Do you have any interest in making that PR also?

@guillaumeeb
Copy link
Member

Hi @jrueb, sorry again for the delay... It looks there is formatting problem in the source code, which sounds a little weird considering you didn't touch some of these files...

Could you check flake8 and black output?

@jrueb
Copy link
Contributor Author

jrueb commented Aug 3, 2023

I adjusted the 8 files black was complaining about (removal of blank lines). The PR did not touch the files before though, meaning these blank lines where there before.

@jrueb
Copy link
Contributor Author

jrueb commented Aug 4, 2023

It is unclear what the CI's issue is. The black of pre-commit run --all-files is passing for me and besides core.py this PR does not touch any of these files.

@guillaumeeb
Copy link
Member

The PR did not touch the files before though, meaning these blank lines where there before.

this PR does not touch any of these files.

Yeah, I know, you are totally right. I won't block here because CI fixing is unrelated to your changes. I shouldn't have asked you to fix it in the first place. Sorry it messes up a bit your changes.

Would you prefer to revert your last commit? Or are you OK that I merge as is?

@jrueb
Copy link
Contributor Author

jrueb commented Aug 4, 2023

Would you prefer to revert your last commit? Or are you OK that I merge as is?

Either is fine for me. If it's the easiest to merge as is, I'm okay with that.

@jrueb
Copy link
Contributor Author

jrueb commented Aug 29, 2023

Is anything else needed for this PR?

@guillaumeeb guillaumeeb merged commit ef45c0e into dask:main Aug 30, 2023
7 of 9 checks passed
@guillaumeeb
Copy link
Member

Sorry for the lack of follow-up, I'll merge this one as is, as we opened an issue to fix the CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants