-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use xarray.open_zarr and make aiohttp and s3fs optional #1016
use xarray.open_zarr and make aiohttp and s3fs optional #1016
Conversation
This could still work with kerchunk references, do you know? If not I will try to test it. It may not matter if we go completely with icechunk stores (which should have the same access API as vanilla zarr stores) but it would be good to know if we were losing support for kerchunk references. |
@abarciauskas-bgse good point, I haven't tested but I will 🙏 |
07e4beb
to
943f5eb
Compare
"fo": src_path.replace("reference://", ""), | ||
"remote_options": {"anon": anon}, | ||
} | ||
return fsspec.filesystem("reference", **reference_args).get_mapper("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if we have a reference://
url or a .json
file
@abarciauskas-bgse I've added (and tested) back the Kerchunk reference support but I would love a second pair of 👀 to check 🙏 |
This comment was marked as resolved.
This comment was marked as resolved.
elif protocol == "reference" or src_path.lower().endswith(".json"): | ||
reference_args = { | ||
"fo": src_path.replace("reference://", ""), | ||
"remote_options": {"anon": anon}, | ||
} | ||
return fsspec.filesystem("reference", **reference_args).get_mapper("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are at least three ways to store virtual references:
- json following kerchunk spec
- parquet following kerchunk spec
- icechunk spec (currently mostly messagepack)
Do you think the "reference" protocol will be used for all of these in the future, with some automatic differentiation between the storage options? Another option would be to specifically have "kerchunk" and "icechunk" as protocol options, with automatic differentiations between json and parquet based on the file extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could just support reference://
prefix instead of checking the .json
I'm also not sure how to support S3://
referenced files, I'm reading that you can use "remote_protocol": "s3",
but we don't have ways to know this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only support reference
with reference://
prefix and added support for parquet file (but fastparquet
dependency should be installed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to try and figure out how to test this locally but in the mean time, here is what I am wondering:
json following kerchunk spec
- the current implementation should work for local json references (reference://file.json
) but would it work for remotely stored references, e.g.s3://file.json
orhttp://file.json
?parquet following kerchunk spec
- files will typically end in.parq
or.parquet
- again I think this would work for local parquet stores but not sure about remote parquet stores which will be using the s3 or https protocol- @maxrjones have you tested opening an icechunk store with fsspec and passing that as an argument to xarray? I haven't so I am not sure if it will work, but I think we can address icechunk support later.
If we have src_path.lower().endswith(".json")
I feel like we should also have .parquet
conditional here, but I think this is too coupled with what formats kerchunk is implemented in and we should rely on something else... like a path parameter 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the latest commit I've removed the .json
test to only accept reference://
prefix
I guess we could accept things like reference+s3://
🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abarciauskas-bgse 🤔 I'm not sure we can really support non-local reference file!
FYI, I've tried reference://s3://ds-satellite/netcdf/reference.json
and it correctly read the reference file but fails at reading the referenced netcdf (stored on S3 as well). I guess I need to set remote_protocol="s3"
somewhere.
@abarciauskas-bgse are we using kerchunk in any production project? to me it seems this need more input
definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using it here: https://www.earthdata.nasa.gov/dashboard/data-catalog/combined_CMIP6_daily_GISS-E2-1-G_tas_kerchunk_DEMO (which I believe uses an s3 protocol for the reference file (private bucket) and the netcdfs it points to are in a public s3 bucket) so I know it should work. I can take a closer look later today ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the way, I'm on the fence about whether we support kerchunk-style references in this new xarray integration to core titiler, since I think we will be able to use icechunk for reference-style (aka virtual) access in the future. Anyways, kerchunk-style references are still supported by titiler-xarray so we wouldn't have to deprecate https://www.earthdata.nasa.gov/dashboard/data-catalog/combined_CMIP6_daily_GISS-E2-1-G_tas_kerchunk_DEMO right away.
@maxrjones thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the motivation for dropping support, since it seems that icechunk is committed to avoiding the fsspec method of using keyword arguments for configuration which is the source of a lot of the complexity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, ok, @vincentsarago sorry for the delay in decision here, but lets just drop reference support for now and we can add it back if we decide we need it.
Ideally we could have some documentation about how to easily add it, if people want kerchunk-style reference support with titiler, but I'm not sure at this point how we would do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
"zarr", | ||
"h5netcdf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO Zarr and h5netcdf should be optional dependencies so people could deploy titiler-xarray with only one option to keep image size down. This may also make it easier to support other HDF5 readers that work better for remote files in the future (e.g., https://github.com/gauteh/hidefix).
"zarr", | |
"h5netcdf", |
all = [ | ||
"s3fs", | ||
"aiohttp", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all = [ | |
"s3fs", | |
"aiohttp", | |
] | |
zarr = [ | |
"zarr", | |
] | |
hdf5 = [ | |
"h5netcdf", | |
] | |
all = [ | |
"s3fs", | |
"aiohttp", | |
"zarr", | |
"hdf5", | |
] |
Suggestion to align with above comment about optional dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxrjones I'll take care of this either in the main PR or in another one later
8e10da4
to
f45ebe4
Compare
f45ebe4
to
ea4709c
Compare
* sketch * add tests * add pyramid tests * remove multiscale option * Update src/titiler/xarray/tests/test_factory.py Co-authored-by: Henry Rodman <[email protected]> * use xarray.open_zarr and make aiohttp and s3fs optional (#1016) * use xarray.open_zarr and make aiohttp and s3fs optional * add support for references * tests prefixed protocol * use tmp_dir for reference * add parquet support * remove kerchunk support * create variable extension * add aiohttp * remove cache layer (#1019) * remove cache layer * Update src/titiler/xarray/README.md Co-authored-by: Aimee Barciauskas <[email protected]> * add tile example --------- Co-authored-by: Aimee Barciauskas <[email protected]> * Update src/titiler/xarray/titiler/xarray/io.py Co-authored-by: Jonas <[email protected]> * Update src/titiler/xarray/titiler/xarray/io.py Co-authored-by: Jonas <[email protected]> * Update src/titiler/xarray/titiler/xarray/io.py Co-authored-by: Jonas <[email protected]> * lint * fix zarr pyramid tests * Update src/titiler/xarray/pyproject.toml Co-authored-by: Max Jones <[email protected]> * refactor dependencies * update docs --------- Co-authored-by: Henry Rodman <[email protected]> Co-authored-by: Aimee Barciauskas <[email protected]> Co-authored-by: Jonas <[email protected]> Co-authored-by: Max Jones <[email protected]>
This PR does:
reference
andconsolidated
optionsopen_zarr
when not dealing with netcdf filesaiohttp
ands3fs
optional dependenciesNote:
h5netcdf
andzarr
optional but for this first version I think it's better to at least have something that support both formatopener
method (Callable[..., xarray.Dataset]
) making all the custom xarray dataset opener optional