Improve performance of SDAP-517 #11

RKuttruff · 2024-07-17T15:45:33Z

Try to minimize time spent in xarray.open_zarr after SDAP startup. We should run open_zarr for each new dataset in webapp driver at least once upon discovery to ensure validity. This is an especially prevalent issue with Spark. The Spark workers would open ALL datasets every time they were given a task, which could introduce severe performance penalties.

This PR will:

Share the opened dataset(s) required for Spark algorithms to the workers. (done: note 1)
Minimize time for Zarr backend updates
- Should avoid regular blocking call to open_zarr. Either make it lazy (open on use), or run it asynchronously (use a future)
  - Did lazy open with checks it's always done in driver
- Credential rotation can stay as is re: blocking behavior

Note 1 - Spark Algs

Despite efforts to do so, I could not find a way to make this behavior automatic. There are some manual steps that need to be taken in the Spark algorithm definition. These are fortunately fairly simple and, if done incorrectly or if something goes wrong, the old behavior should be used as a fallback.

The Spark driver code should invoke NexusTileService.save_to_spark using the SparkContext object from the NexusCalcSparkHandler or the SDAP webservice.nexus_tornado.app_builders.SparkContextBuilder.SparkContextBuilder.get_spark_context() method and all the datasets that will be worked with.
The executor code should get its NexusTileService instance from the provided tile_service_factory with the kwargs spark=True, collections=[...] where the collections kwarg is a list of all the dataset names saved in step 1.

def spark_driver(tiles, ds1, ds2, tile_service_factory, sc, spark_nparts=1):
  NexusTileService.save_to_spark(sc, ds1, ds2)
  
  tiles_spark = [(ds1, ds2, tile) for tile in tiles]
  
  rdd = sc.parallelize(tiles_spark, spark_nparts)
  results = rdd.flatMap(partial(calc_executor, tile_service_factory)).collect()
  results = list(itertools.chain.from_iterable(results))
  return results

def calc_executor(tile_service_factory, spark_tile):
  tile, ds1, ds2 = spark_tile

  tile_service = tile_service_factory(spark=True, collections=[ds1, ds2])
  
  # Do work

  return result

May end up able to walk back some of these. The dask update did what I wanted, but I updated a bunch of other deps while trying to find out. Xarray deps are somewhat complicated, so it may be best to leave the deps as-is unless something is breaking.

For Spark, ensure the dataset is opened before saving it to HDFS

# Conflicts: # CHANGELOG.md

# Conflicts: # poetry.lock

rileykk added 4 commits July 17, 2024 07:47

Saving and loading of dataset backends to Spark HDFS

d4ac3d9

Changes to Spark algs

a056e54

Defer opening of Zarr dataset following refresh until needed

11ee33d

For Spark, ensure the dataset is opened before saving it to HDFS

RKuttruff marked this pull request as ready for review July 22, 2024 18:25

rileykk added 4 commits July 22, 2024 11:48

Changelog

f03e1d0

Merge branch 'SDAP-517-daac-creds' into SDAP-517b-performance

c81dd8b

# Conflicts: # CHANGELOG.md

Merge branch 'SDAP-517-daac-creds' into SDAP-517b-performance

14b0aca

# Conflicts: # poetry.lock

Poetry re-lock

5586bba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of SDAP-517 #11

Improve performance of SDAP-517 #11

RKuttruff commented Jul 17, 2024 •

edited

Loading

Improve performance of SDAP-517 #11

Are you sure you want to change the base?

Improve performance of SDAP-517 #11

Conversation

RKuttruff commented Jul 17, 2024 • edited Loading

Note 1 - Spark Algs

RKuttruff commented Jul 17, 2024 •

edited

Loading