Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make combine_results fail rather than stall upon exception with Dask #39

Merged
merged 1 commit into from
Dec 1, 2023

Conversation

lathanh
Copy link
Member

@lathanh lathanh commented Dec 1, 2023

Background: dask.compute(map(concat_partial, ... (in postprocessing.combine_results()) can fail[1], and currently, such an exception is somehow handled in a way that doesn't kill the script. Instead, the script hangs until the Cloud Run job times out (currently 24h).

Also, this error can occur non-deterministically. Given the particular configuration that I was testing with, rerunning the same exact command/config over and over fails about 30% of the time.

Change: Catch the exception and fail fast (i.e., call sys.exit(1)).

[1] With a error like "distributed.scheduler.KilledWorker: Attempted to run task concat_and_normalize-56525f85-b55a-49e9-97bf-bc2bf253757e on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:43341. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html."

@lathanh lathanh added the bug Something isn't working label Dec 1, 2023
@lathanh lathanh requested a review from nweires December 1, 2023 00:46
@lathanh lathanh self-assigned this Dec 1, 2023
Copy link

github-actions bot commented Dec 1, 2023

File Coverage
All files 86%
base.py 90%
eagle.py 77%
exc.py 57%
local.py 70%
postprocessing.py 83%
utils.py 91%
cloud/docker_base.py 87%
sampler/base.py 79%
sampler/downselect.py 33%
sampler/precomputed.py 93%
sampler/residential_quota.py 61%
test/test_docker.py 33%
test/test_validation.py 97%
workflow_generator/base.py 90%
workflow_generator/commercial.py 53%
workflow_generator/residential_hpxml.py 86%

Minimum allowed coverage is 33%

Generated by 🐒 cobertura-action against 2c178be

Copy link
Collaborator

@nweires nweires left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of poking around makes me wonder if this is a Dask issue. (Something like this, maybe?)

buildstockbatch/postprocessing.py Show resolved Hide resolved
@lathanh
Copy link
Member Author

lathanh commented Dec 1, 2023

A bit of poking around makes me wonder if this is a Dask issue. (Something like this, maybe?)

I think both... I think there's a Dask issue in that it can stall like that, but there's also an issue at a higher level with the exception getting eaten. I'm not sure how to solve either (yet), but I at least want to get something in that addresses the impact of the issues.

@lathanh lathanh merged commit a9675af into gcp Dec 1, 2023
6 of 7 checks passed
@lathanh lathanh deleted the lathanh/pp-dask-fail branch December 1, 2023 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants