Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xr.Dataset as core data structure for FieldSet #1796

Open
VeckoTheGecko opened this issue Dec 13, 2024 · 1 comment
Open

xr.Dataset as core data structure for FieldSet #1796

VeckoTheGecko opened this issue Dec 13, 2024 · 1 comment
Labels
Milestone

Comments

@VeckoTheGecko
Copy link
Contributor

VeckoTheGecko commented Dec 13, 2024

The first commit to Parcels was in September 2015 (almost 10 years ago now). Parcels has implemented FieldSet, Field, and Grid to manage the GCM data (in arrays) that is needed for computation.

With the addition of features such as deffered loading and chunking of General Circulation Model (GCM) data, Parcels has accumulated technical debt through this bespoke code. This is particularly evident in fieldfilebuffer.py, grid.py, as well as .computeTimeChunk() methods (and the general length of the methods in the field.py and fieldset.py classes). Our current data structures also make it difficult to adapt (#1772 (comment)).

There has previously been some discussion from @rabernat about relying more on xarray for Parcels internals. It's important to read that discussion to fully understand this issue.

We have already implemented (1) and added a FieldSet.from_xarray() method, however this is eagerly computed due to the structure of Parcels, which limits its usage to in-memory datasets. We haven't yet implemented (2) and (3). I do think that it is possible to implement (2) and (3). That is, to refactor Parcels internals such that xr.Dataset objects are the core data structure that we work with, operating with these in a lazy manner until the arrays themselves are actually needed. I also think it would be possible to do this without changing our approach to JIT mode. This issue is limited in scope to FieldSet loading, though I do also think that we can better use xarray for incremental writing of particle data.

As this is a major change in Parcels internals with several (positive) side-effects, this will be a version 4 change. This is also great timing as v4 will look at providing unstructured grid support using uxarray.


I propose that we do the following from xarray and xgcm:

  • Rely on xr.Dataset for time-varying field data
  • Rely on xgcm.Grid (which uses xr.Dataset under the hood) for GCM grid data and geometry

likely resulting in the following change to the public API:

ds = xr.Dataset(...)
grid = xgcm.Grid(...)

FieldSet(ds, grid, use_fields = ["U", "V", ...]) # That's it! No other kwargs

This appoarch has the following benefits:

  • for free from xarray
    • support for all datasets or objects that xarray supports (See Xarray: Reading and writing files). Including Netcdf (3 and 4), Zarr, OPeNDAP, HDF5, and also including non-local datasets (e.g., stored in a bucket).
    • free indexing (would need to see about its performance)
  • for free from xgcm
    • support for different grid types
    • grid aware interpolation
    • discovery of grid type from metadata (not yet released - not sure what the status on that is since they haven't had a release in 2 years)
  • clearer separation of responsibility
    • field or grid manipulation should be handled by the user before creating the FieldSet by operating directly on the xr.Dataset objects
  • potential for better integration with CF conventions
    • U and V fields can be automatically discovered by inspecting metadata (assuming GCM output has metadata in line with CF conventions)
  • Cleaner test suite
    • We can work on providing several fixtures for the different datasets that we want to support

This will result in a smaller codebase and test suite to maintain as we would:

  • rely on stable data structures provided by xarray and xgcm
  • not need code for deffered loading
  • not need code for file path handling/discovery
  • not need configuration dictionaries and sanitization
  • simplify FieldSet API
    • A lot of the kwargs in FieldSet() and its classmethods would no longer be relevant. allow_time_extrapolation, time_periodic can be moved to the execution of the particleset where I feel they would be more at home.

These changes will likely result in the outright deletion of field.py, grid.py, gridset.py, and filefieldbuffer.py.

These changes introduce an increased reliance on xarray and (newly added dependency) xgcm. I am not concerned about depending on xgcm despite it being a new package. It looks to be actively maintained with a large overlap of users with xarray. It also looks to be quite a small package defering a lot of responsibility to xarray which is great. As it is an actively maintained package and is in v0 at the moment with API changes, I think it would be worth pinning it's version until its API settles. With our increased reliance on xarray and xgcm, it would be necessary that core Parcels developers stay up to date with their releases so that we can make sure that we're making the most out of them.

With these changes, it would be good to get an accurate picture on it's impact on performance (which is a subject for #1761).


I will start experimenting with this focussing on Scipy mode. Since there is a lot of coupling between FieldSet, Field, Grid and the rest of the codebase, I'll just start my experiments in a notebook for now.

@erikvansebille
Copy link
Member

Sounds like a very good plan, @VeckoTheGecko; I'm excited to see if we can make this major refactor work for v4. It would definitely reduce the maintenance burden if we shift a lot of code/responsibility to xarray and xgcm, and also make it simpler for new users to get started!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog
Development

No branches or pull requests

2 participants