Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility for zarr-python 3.x #9552

Open
wants to merge 86 commits into
base: main
Choose a base branch
from

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 27, 2024

This PR begins the process of adding compatibility with zarr-python 3.x. It's intended to be run against zarr-python v3 + the open PRs referenced in #9515.

All of the zarr test cases should be parameterized by zarr_format=[2, 3] with zarr-python 3.x to exercise reading and writing both formats.

This is currently passing with zarr-python==2.18.3. zarr-python 3.x has about 61 failures, all of which are related to data types that aren't yet implemented in zarr-python 3.x.

I'll also note that #5475 is going to become a larger issue once people start writing Zarr-V3 datasets.

@TomAugspurger TomAugspurger force-pushed the fix/zarr-v3 branch 2 times, most recently from 1ed4ef1 to bb2bb6c Compare September 30, 2024 14:04
Copy link
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set of changes should be backwards compatible and work with zarr-python 2.x (so reading and writing zarr v2 data).

I'll work through zarr-python 3.x now. I think we might want to parametrize most of these tests by zarr_version=[2, 3] to confirm that we can read / write zarr v2 data with zarr-python 3.x

xarray/backends/zarr.py Show resolved Hide resolved
xarray/backends/zarr.py Show resolved Hide resolved

if _zarr_v3() and zarr_array.metadata.zarr_format == 3:
encoding["codec_pipeline"] = [
x.to_dict() for x in zarr_array.metadata.codecs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this instead?

Suggested change
x.to_dict() for x in zarr_array.metadata.codecs
zarr_array.metadata.to_dict()["codecs"]

A bit wasteful since everything has to be serialized, but presumably zarr knows better how to serialize the codec pipeline than we do here?

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress here @TomAugspurger. I'm impressed by how little you've changed in the backend itself and I'm noting the pain around testing (I felt some of that w/ dask as well).

if consolidated is None:
try:
zarr_group = zarr.open_consolidated(store, **open_kwargs)
except KeyError:
except (ValueError, KeyError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the Zarr side, it may be nice to raise a a custom exception when consolidated metadata is not found. Something like:

class ConsolidatedMetadataNotFound(FileNotFoundError):
    pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray/backends/zarr.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

xarray/tests/test_backends.py::test_zarr_storage_options is tricky to test. We should be able to just pass through storage_options to the call to zarr.open_group.

Currently, the only zarr store that supports storage options is RemoteStores with an async filesystem. The current test uses the memory filesystem, which doesn't support async operations. Zarr-python sets up a moto server to use the S3 filesystem.

@jhamman
Copy link
Member

jhamman commented Oct 14, 2024

xarray/tests/test_backends.py::test_zarr_storage_options is tricky to test. We should be able to just pass through storage_options to the call to zarr.open_group.

Let's skip this test with v3.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving this the 👍 ! Thanks @TomAugspurger for all the work here (and in Zarr-Python)! A major step forward for Xarray and Zarr.

@jhamman
Copy link
Member

jhamman commented Oct 14, 2024

@pydata/xarray - are there others that are hoping to review this before it goes in? Noting that most of the surface area here is in the tests, the actual code changes are quite minimal.

I'll also note that the remaining failing CI checks are, as far as I can tell, unrelated to this PR.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally seems okay to me, as long as we're confident about changing those values in some of the tests...

My comments are just pointing out ToDos and commented-out lines that have maybe been forgotten about.

doc/user-guide/io.rst Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
@@ -1307,6 +1346,8 @@ def test_explicitly_omit_fill_value(self) -> None:
with self.roundtrip(ds) as actual:
assert "_FillValue" not in actual.x.encoding

# TODO: decide if this test is really necessary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: decide if this test is really necessary

xarray/tests/test_backends.py Outdated Show resolved Hide resolved
Comment on lines +3350 to +3351
"get": 16, # TODO: fixme upstream (should be 8)
"list_dir": 3, # TODO: fixme upstream (should be 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to specific issues?

@TomAugspurger
Copy link
Contributor Author

I just pushed a commit reverting the changes to avoid values equal to the fill_value in the test cases, which aren't needed after Ryan's changes to fill value handling.

I think this is ready to go once CI finishes. I expect upstream-ci to fail on the xarray/tests/test_groupby.py::test_gappy_resample_reductions tests, but those should be unrelated.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 15, 2024

There's one typing failure we might want to address:

xarray/tests/test_backends_datatree.py: note: In member "test_to_zarr_zip_store" of class "TestZarrDatatreeIO":
xarray/tests/test_backends_datatree.py:289: error: Argument 1 to "open_datatree" has incompatible type "ZipStore"; expected "str | PathLike[Any] | BufferedIOBase | AbstractDataStore"  [arg-type]

I'll do some reading about how best to handle type annotations when the proper type depends on the version of a dependency.

Edit: a complication here is that this is in open_datatree, which I think supports other backends like NetCDF too. It's not immediately clear to me whether the signature can of open_datatree needs to match for every implementation. This can maybe be addressed once we fully support datatree with zarr-python 3?

@TomNicholas
Copy link
Member

TomNicholas commented Oct 16, 2024

Edit: a complication here is that this is in open_datatree, which I think supports other backends like NetCDF too. It's not immediately clear to me whether the signature can of open_datatree needs to match for every implementation. This can maybe be addressed once we fully support datatree with zarr-python 3?

I don't see why the typing of open_datatree(<zarr-store>) would need to be any different to that of open_dataset(<zarr-store>). But I think it's fine to ignore this for now as we know we need to come back to it in another PR anyway.

@TomAugspurger
Copy link
Contributor Author

Good catch, this affects both.

I was hoping something like this would work:

from pathlib import Path

try:
    from zarr.storage import StoreLike as _StoreLike

except ImportError:
    _StoreLike = str | Path

StoreLike = type[_StoreLike]


def f(x: StoreLike) -> StoreLike:
    return x

but mypy doesn't like that

test.py:7: error: Cannot assign multiple types to name "_StoreLike" without an explicit "Type[...]" annotation  [misc]
Found 1 error in 1 file (checked 1 source file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-upstream Run upstream CI topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants