Make combine_results fail rather than stall upon exception with Dask #39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background:
dask.compute(map(concat_partial, ...
(inpostprocessing.combine_results()
) can fail[1], and currently, such an exception is somehow handled in a way that doesn't kill the script. Instead, the script hangs until the Cloud Run job times out (currently 24h).Also, this error can occur non-deterministically. Given the particular configuration that I was testing with, rerunning the same exact command/config over and over fails about 30% of the time.
Change: Catch the exception and fail fast (i.e., call
sys.exit(1)
).[1] With a error like "distributed.scheduler.KilledWorker: Attempted to run task concat_and_normalize-56525f85-b55a-49e9-97bf-bc2bf253757e on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:43341. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html."