You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first commit to Parcels was in September 2015 (almost 10 years ago now). Parcels has implemented FieldSet, Field, and Grid to manage the GCM data (in arrays) that is needed for computation.
With the addition of features such as deffered loading and chunking of General Circulation Model (GCM) data, Parcels has accumulated technical debt through this bespoke code. This is particularly evident in fieldfilebuffer.py, grid.py, as well as .computeTimeChunk() methods (and the general length of the methods in the field.py and fieldset.py classes). Our current data structures also make it difficult to adapt (#1772 (comment)).
There has previously been some discussion from @rabernat about relying more on xarray for Parcels internals. It's important to read that discussion to fully understand this issue.
We have already implemented (1) and added a FieldSet.from_xarray() method, however this is eagerly computed due to the structure of Parcels, which limits its usage to in-memory datasets. We haven't yet implemented (2) and (3). I do think that it is possible to implement (2) and (3). That is, to refactor Parcels internals such that xr.Dataset objects are the core data structure that we work with, operating with these in a lazy manner until the arrays themselves are actually needed. I also think it would be possible to do this without changing our approach to JIT mode. This issue is limited in scope to FieldSet loading, though I do also think that we can better use xarray for incremental writing of particle data.
As this is a major change in Parcels internals with several (positive) side-effects, this will be a version 4 change. This is also great timing as v4 will look at providing unstructured grid support using uxarray.
I propose that we do the following from xarray and xgcm:
Rely on xr.Dataset for time-varying field data
Rely on xgcm.Grid (which uses xr.Dataset under the hood) for GCM grid data and geometry
likely resulting in the following change to the public API:
ds=xr.Dataset(...)
grid=xgcm.Grid(...)
FieldSet(ds, grid, use_fields= ["U", "V", ...]) # That's it! No other kwargs
This appoarch has the following benefits:
for free from xarray
support for all datasets or objects that xarray supports (See Xarray: Reading and writing files). Including Netcdf (3 and 4), Zarr, OPeNDAP, HDF5, and also including non-local datasets (e.g., stored in a bucket).
free indexing (would need to see about its performance)
for free from xgcm
support for different grid types
grid aware interpolation
discovery of grid type from metadata (not yet released - not sure what the status on that is since they haven't had a release in 2 years)
clearer separation of responsibility
field or grid manipulation should be handled by the user before creating the FieldSet by operating directly on the xr.Dataset objects
potential for better integration with CF conventions
U and V fields can be automatically discovered by inspecting metadata (assuming GCM output has metadata in line with CF conventions)
Cleaner test suite
We can work on providing several fixtures for the different datasets that we want to support
This will result in a smaller codebase and test suite to maintain as we would:
rely on stable data structures provided by xarray and xgcm
not need code for deffered loading
not need code for file path handling/discovery
not need configuration dictionaries and sanitization
simplify FieldSet API
A lot of the kwargs in FieldSet() and its classmethods would no longer be relevant. allow_time_extrapolation, time_periodic can be moved to the execution of the particleset where I feel they would be more at home.
These changes will likely result in the outright deletion of field.py, grid.py, gridset.py, and filefieldbuffer.py.
These changes introduce an increased reliance on xarray and (newly added dependency) xgcm. I am not concerned about depending on xgcm despite it being a new package. It looks to be actively maintained with a large overlap of users with xarray. It also looks to be quite a small package defering a lot of responsibility to xarray which is great. As it is an actively maintained package and is in v0 at the moment with API changes, I think it would be worth pinning it's version until its API settles. With our increased reliance on xarray and xgcm, it would be necessary that core Parcels developers stay up to date with their releases so that we can make sure that we're making the most out of them.
With these changes, it would be good to get an accurate picture on it's impact on performance (which is a subject for #1761).
I will start experimenting with this focussing on Scipy mode. Since there is a lot of coupling between FieldSet, Field, Grid and the rest of the codebase, I'll just start my experiments in a notebook for now.
The text was updated successfully, but these errors were encountered:
Sounds like a very good plan, @VeckoTheGecko; I'm excited to see if we can make this major refactor work for v4. It would definitely reduce the maintenance burden if we shift a lot of code/responsibility to xarray and xgcm, and also make it simpler for new users to get started!
The first commit to Parcels was in September 2015 (almost 10 years ago now). Parcels has implemented
FieldSet
,Field
, andGrid
to manage the GCM data (in arrays) that is needed for computation.With the addition of features such as deffered loading and chunking of General Circulation Model (GCM) data, Parcels has accumulated technical debt through this bespoke code. This is particularly evident in
fieldfilebuffer.py
,grid.py
, as well as.computeTimeChunk()
methods (and the general length of the methods in thefield.py
andfieldset.py
classes). Our current data structures also make it difficult to adapt (#1772 (comment)).There has previously been some discussion from @rabernat about relying more on xarray for Parcels internals. It's important to read that discussion to fully understand this issue.
We have already implemented (1) and added a
FieldSet.from_xarray()
method, however this is eagerly computed due to the structure of Parcels, which limits its usage to in-memory datasets. We haven't yet implemented (2) and (3). I do think that it is possible to implement (2) and (3). That is, to refactor Parcels internals such thatxr.Dataset
objects are the core data structure that we work with, operating with these in a lazy manner until the arrays themselves are actually needed. I also think it would be possible to do this without changing our approach to JIT mode. This issue is limited in scope to FieldSet loading, though I do also think that we can better use xarray for incremental writing of particle data.As this is a major change in Parcels internals with several (positive) side-effects, this will be a version 4 change. This is also great timing as v4 will look at providing unstructured grid support using
uxarray
.I propose that we do the following from
xarray
andxgcm
:xr.Dataset
for time-varying field dataxgcm.Grid
(which usesxr.Dataset
under the hood) for GCM grid data and geometrylikely resulting in the following change to the public API:
This appoarch has the following benefits:
xr.Dataset
objectsThis will result in a smaller codebase and test suite to maintain as we would:
FieldSet()
and its classmethods would no longer be relevant.allow_time_extrapolation
,time_periodic
can be moved to the execution of the particleset where I feel they would be more at home.These changes will likely result in the outright deletion of
field.py
,grid.py
,gridset.py
, andfilefieldbuffer.py
.These changes introduce an increased reliance on
xarray
and (newly added dependency)xgcm
. I am not concerned about depending onxgcm
despite it being a new package. It looks to be actively maintained with a large overlap of users withxarray
. It also looks to be quite a small package defering a lot of responsibility toxarray
which is great. As it is an actively maintained package and is inv0
at the moment with API changes, I think it would be worth pinning it's version until its API settles. With our increased reliance onxarray
andxgcm
, it would be necessary that core Parcels developers stay up to date with their releases so that we can make sure that we're making the most out of them.With these changes, it would be good to get an accurate picture on it's impact on performance (which is a subject for #1761).
I will start experimenting with this focussing on Scipy mode. Since there is a lot of coupling between FieldSet, Field, Grid and the rest of the codebase, I'll just start my experiments in a notebook for now.
The text was updated successfully, but these errors were encountered: