Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Try to minimize time spent in
xarray.open_zarr
after SDAP startup. We should runopen_zarr
for each new dataset in webapp driver at least once upon discovery to ensure validity. This is an especially prevalent issue with Spark. The Spark workers would open ALL datasets every time they were given a task, which could introduce severe performance penalties.This PR will:
open_zarr
. Either make it lazy (open on use), or run it asynchronously (use a future)Note 1 - Spark Algs
Despite efforts to do so, I could not find a way to make this behavior automatic. There are some manual steps that need to be taken in the Spark algorithm definition. These are fortunately fairly simple and, if done incorrectly or if something goes wrong, the old behavior should be used as a fallback.
NexusTileService.save_to_spark
using theSparkContext
object from theNexusCalcSparkHandler
or the SDAPwebservice.nexus_tornado.app_builders.SparkContextBuilder.SparkContextBuilder.get_spark_context()
method and all the datasets that will be worked with.NexusTileService
instance from the providedtile_service_factory
with the kwargsspark=True, collections=[...]
where thecollections
kwarg is a list of all the dataset names saved in step 1.