Make combine_results fail rather than stall upon exception with Dask #39

lathanh · 2023-12-01T00:46:32Z

Background: dask.compute(map(concat_partial, ... (in postprocessing.combine_results()) can fail[1], and currently, such an exception is somehow handled in a way that doesn't kill the script. Instead, the script hangs until the Cloud Run job times out (currently 24h).

Also, this error can occur non-deterministically. Given the particular configuration that I was testing with, rerunning the same exact command/config over and over fails about 30% of the time.

Change: Catch the exception and fail fast (i.e., call sys.exit(1)).

[1] With a error like "distributed.scheduler.KilledWorker: Attempted to run task concat_and_normalize-56525f85-b55a-49e9-97bf-bc2bf253757e on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:43341. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html."

github-actions · 2023-12-01T01:01:01Z

File	Coverage
All files	`86%`	✅
base.py	`90%`	✅
eagle.py	`77%`	✅
exc.py	`57%`	✅
local.py	`70%`	✅
postprocessing.py	`83%`	✅
utils.py	`91%`	✅
cloud/docker_base.py	`87%`	✅
sampler/base.py	`79%`	✅
sampler/downselect.py	`33%`	✅
sampler/precomputed.py	`93%`	✅
sampler/residential_quota.py	`61%`	✅
test/test_docker.py	`33%`	✅
test/test_validation.py	`97%`	✅
workflow_generator/base.py	`90%`	✅
workflow_generator/commercial.py	`53%`	✅
workflow_generator/residential_hpxml.py	`86%`	✅

Minimum allowed coverage is 33%

Generated by 🐒 cobertura-action against 2c178be

nweires

A bit of poking around makes me wonder if this is a Dask issue. (Something like this, maybe?)

buildstockbatch/postprocessing.py

lathanh · 2023-12-01T19:30:40Z

A bit of poking around makes me wonder if this is a Dask issue. (Something like this, maybe?)

I think both... I think there's a Dask issue in that it can stall like that, but there's also an issue at a higher level with the exception getting eaten. I'm not sure how to solve either (yet), but I at least want to get something in that addresses the impact of the issues.

Make combine_results fail rather than stall upon exception with Dask

2c178be

lathanh added the bug Something isn't working label Dec 1, 2023

lathanh requested a review from nweires December 1, 2023 00:46

lathanh self-assigned this Dec 1, 2023

nweires approved these changes Dec 1, 2023

View reviewed changes

buildstockbatch/postprocessing.py Show resolved Hide resolved

lathanh merged commit a9675af into gcp Dec 1, 2023
6 of 7 checks passed

lathanh deleted the lathanh/pp-dask-fail branch December 1, 2023 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make combine_results fail rather than stall upon exception with Dask #39

Make combine_results fail rather than stall upon exception with Dask #39

lathanh commented Dec 1, 2023

github-actions bot commented Dec 1, 2023

nweires left a comment

lathanh commented Dec 1, 2023

Make combine_results fail rather than stall upon exception with Dask #39

Make combine_results fail rather than stall upon exception with Dask #39

Conversation

lathanh commented Dec 1, 2023

github-actions bot commented Dec 1, 2023

nweires left a comment

Choose a reason for hiding this comment

lathanh commented Dec 1, 2023