xr.Dataset as core data structure for FieldSet #1796

VeckoTheGecko · 2024-12-13T16:03:07Z

The first commit to Parcels was in September 2015 (almost 10 years ago now). Parcels has implemented FieldSet, Field, and Grid to manage the GCM data (in arrays) that is needed for computation.

With the addition of features such as deffered loading and chunking of General Circulation Model (GCM) data, Parcels has accumulated technical debt through this bespoke code. This is particularly evident in fieldfilebuffer.py, grid.py, as well as .computeTimeChunk() methods (and the general length of the methods in the field.py and fieldset.py classes). Our current data structures also make it difficult to adapt (#1772 (comment)).

There has previously been some discussion from @rabernat about relying more on xarray for Parcels internals. It's important to read that discussion to fully understand this issue.

We have already implemented (1) and added a FieldSet.from_xarray() method, however this is eagerly computed due to the structure of Parcels, which limits its usage to in-memory datasets. We haven't yet implemented (2) and (3). I do think that it is possible to implement (2) and (3). That is, to refactor Parcels internals such that xr.Dataset objects are the core data structure that we work with, operating with these in a lazy manner until the arrays themselves are actually needed. I also think it would be possible to do this without changing our approach to JIT mode. This issue is limited in scope to FieldSet loading, though I do also think that we can better use xarray for incremental writing of particle data.

As this is a major change in Parcels internals with several (positive) side-effects, this will be a version 4 change. This is also great timing as v4 will look at providing unstructured grid support using uxarray.

I propose that we do the following from xarray and xgcm:

Rely on xr.Dataset for time-varying field data
Rely on xgcm.Grid (which uses xr.Dataset under the hood) for GCM grid data and geometry

likely resulting in the following change to the public API:

ds = xr.Dataset(...)
grid = xgcm.Grid(...)

FieldSet(ds, grid, use_fields = ["U", "V", ...]) # That's it! No other kwargs

This appoarch has the following benefits:

for free from xarray
- support for all datasets or objects that xarray supports (See Xarray: Reading and writing files). Including Netcdf (3 and 4), Zarr, OPeNDAP, HDF5, and also including non-local datasets (e.g., stored in a bucket).
- free indexing (would need to see about its performance)
for free from xgcm
- support for different grid types
- grid aware interpolation
- discovery of grid type from metadata (not yet released - not sure what the status on that is since they haven't had a release in 2 years)
clearer separation of responsibility
- field or grid manipulation should be handled by the user before creating the FieldSet by operating directly on the xr.Dataset objects
potential for better integration with CF conventions
- U and V fields can be automatically discovered by inspecting metadata (assuming GCM output has metadata in line with CF conventions)
Cleaner test suite
- We can work on providing several fixtures for the different datasets that we want to support

This will result in a smaller codebase and test suite to maintain as we would:

rely on stable data structures provided by xarray and xgcm
not need code for deffered loading
not need code for file path handling/discovery
not need configuration dictionaries and sanitization
simplify FieldSet API
- A lot of the kwargs in FieldSet() and its classmethods would no longer be relevant. allow_time_extrapolation, time_periodic can be moved to the execution of the particleset where I feel they would be more at home.

These changes will likely result in the outright deletion of field.py, grid.py, gridset.py, and filefieldbuffer.py.

These changes introduce an increased reliance on xarray and (newly added dependency) xgcm. I am not concerned about depending on xgcm despite it being a new package. It looks to be actively maintained with a large overlap of users with xarray. It also looks to be quite a small package defering a lot of responsibility to xarray which is great. As it is an actively maintained package and is in v0 at the moment with API changes, I think it would be worth pinning it's version until its API settles. With our increased reliance on xarray and xgcm, it would be necessary that core Parcels developers stay up to date with their releases so that we can make sure that we're making the most out of them.

With these changes, it would be good to get an accurate picture on it's impact on performance (which is a subject for #1761).

I will start experimenting with this focussing on Scipy mode. Since there is a lot of coupling between FieldSet, Field, Grid and the rest of the codebase, I'll just start my experiments in a notebook for now.

The text was updated successfully, but these errors were encountered:

erikvansebille · 2024-12-16T08:06:30Z

Sounds like a very good plan, @VeckoTheGecko; I'm excited to see if we can make this major refactor work for v4. It would definitely reduce the maintenance burden if we shift a lot of code/responsibility to xarray and xgcm, and also make it simpler for new users to get started!

VeckoTheGecko added this to the Parcels 4 milestone Dec 13, 2024

github-project-automation bot added this to Parcels development Dec 13, 2024

github-project-automation bot moved this to Backlog in Parcels development Dec 13, 2024

VeckoTheGecko added coding/Python enhancement cleanup Cleaning up legacy code labels Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xr.Dataset as core data structure for FieldSet #1796

xr.Dataset as core data structure for FieldSet #1796

VeckoTheGecko commented Dec 13, 2024 •

edited

Loading

erikvansebille commented Dec 16, 2024

xr.Dataset as core data structure for FieldSet #1796

xr.Dataset as core data structure for FieldSet #1796

Comments

VeckoTheGecko commented Dec 13, 2024 • edited Loading

erikvansebille commented Dec 16, 2024

VeckoTheGecko commented Dec 13, 2024 •

edited

Loading