Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ghidalgo3
Copy link
Contributor

@ghidalgo3 ghidalgo3 commented Jul 3, 2024

Before going further with these changes, I'd like to discuss if this direction is correct or not. Based on #17, I tried using Zarr-V3's Array definition instead of VirtualiZarr's ZArray definition. Keeping the names straight is hard, ZArray is the current code and Array is zarr's type.

The problems I ran into were:

  1. The set of allowed kwargs for Array are different between v2 and v3 for zarr, but VirtualiZarr.ZArray I believe is first created with the v2 properties and then transformed to v3 format when zarr_v3_array_metadata is called. After the replacement, anywhere VirtualiZarr creates a Array, one must choose v2 or v3 kwargs, or again introduce something that abstracts over the versions. Unless VirtualiZarr chooses to only support Zarr V3 only, which I believe is the intention.
  2. During concatenation, VirtualiZarr checks if the codecs between ZArrays all match, which was easier when ZArray declared the codecs. Now, if you support v2 and v3 then the code checking logic has to branch based on the type of zarr.array.metadata because the internal zarr property names have changed. Probably there will be other code-level incompatibilities from metadata property name differences, I think Is the compressor type right?  #94 ran into this too.

This is just from a few hours of trying this out.

If VirtualiZarr needs to support both Zarr V2 and V3 stores, then this change needs to handle all the little differences in Zarr's API between 2 and 3. But if only V3 is supported, this becomes easier and replacing ZArray with Array should not be too difficult.

Thoughts?

Incompatibilities:

@rabernat
Copy link
Collaborator

rabernat commented Jul 3, 2024

Just wanted to say that the Zarr V3 class behaviors are not set in stone yet. If you have feedback on the V3 API and potential incompatibilities between V2, please report them issues on Zarr-python: https://github.com/zarr-developers/zarr-python/issues

@jsignell
Copy link
Contributor

jsignell commented Jul 3, 2024

My sense is that in terms of file format both zarr v2 and zarr v3 should be supported but in terms of library dependencies it would be fine to mandate that only zarr-python >= 3 is supported.

@TomNicholas
Copy link
Member

TomNicholas commented Jul 3, 2024

Thanks for taking the initiative here @ghidalgo3 !

I tried using Zarr-V3's Array definition instead of VirtualiZarr's ZArray definition

I was actually hoping we could use some class from zarr-python that wasn't a fully-featured zarr.Array, instead using something that just represented the metadata for one array. This ArrayV3Metadata class might be what we need? Is that public API?

If VirtualiZarr needs to support both Zarr V2 and V3 stores, then this change needs to handle all the little differences in Zarr's API between 2 and 3. But if only V3 is supported, this becomes easier and replacing ZArray with Array should not be too difficult.

I see two ways to do this:

  1. Make VirtualiZarr internally agnostic to v2 vs v3. Then a ManifestArray can represent either a v2 array or a v3 array, distinguished via its metadata. For example the attributes of a ManifestArray might become
    from zarr.metadata import ArrayMetadata
    
    class ManifestArray:
        _chunkmanifest: ChunkManifest
        _zarray: ArrayMetadata
    where ArrayMetadata is an ABC that needs to be a concrete ArrayV2Metadata or ArrayV3Metadata.
  2. VirtualiZarr internally only uses v3 arrays, and whilst we support reading zarr and writing zarr v2 arrays, they are coerced at reading / writing time.

I'm not sure which of these two is the best approach.

I also completely agree with @jsignell - all of this should be supported through only a dependency on zarr>=3.0.0, as this PR adds.

@ghidalgo3
Copy link
Contributor Author

To minimize the code changes needed in VirtualiZarr, I'll attempt the first option.

@ghidalgo3
Copy link
Contributor Author

Adding a dependency on zarr==3 causes a problem for Kerchunk. Kerchunk expects to be able to import zarr.meta but in ZarrV3, that code was moved to the v2 directory such that Kerchunk has to do from zarr.v2.meta import encode_fill_value instead of from zarr.meta import encode_fill_value. Then this causes unit tests to fail in VirtualiZarr.

See the reorg that happened here about 3 months ago for ZarrV3: zarr-developers/zarr-python#1809 and some attempts in Kerchunk to support ZarrV3:

  1. Kerchunk and Zarr V3 fsspec/kerchunk#235
  2. Zarr v3 support fsspec/kerchunk#292

Maybe the v2 module name is temporary in Zarr? But until then, any tests that call into Kerchunk are broken :(.

@rabernat
Copy link
Collaborator

rabernat commented Jul 9, 2024

@ghidalgo3 please please open a Zarr issue to track any API incompatibilities between V2 and V3. 🙏

https://github.com/zarr-developers/zarr-python/issues

@TomNicholas
Copy link
Member

@ghidalgo3 it might be a good idea to have a separate first PR that makes virtualizarr pass every test using zarr v3, and then coming back to this PR once that compatibility can be relied upon. Otherwise this might become a rabbit hole.

@d-v-b
Copy link

d-v-b commented Jul 9, 2024

I was actually hoping we could use some class from zarr-python that wasn't a fully-featured zarr.Array, instead using something that just represented the metadata for one array. This ArrayV3Metadata class might be what we need? Is that public API?

That class is public API, but it might be changing soon, and while it is intended for this use case it also has some extra stuff that goes beyond "representing metadata". So it should do what you want, eventually :)

@TomNicholas TomNicholas added the zarr-python Relevant to zarr-python upstream label Jul 9, 2024
@ghidalgo3 ghidalgo3 mentioned this pull request Jul 10, 2024
9 tasks
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far @ghidalgo3 !

"""
Individual chunk size by number of elements.
"""
if isinstance(self._zarray.chunk_grid, RegularChunkGrid):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice check, but I think we could actually perform this check at ManifestArray construction time, because right now a lot of other things will break if the chunk grid is not regular.

Comment on lines +62 to +68
match zarray:
case ArrayV2Metadata(compressor=compressor, filters=filters):
return Codec(compressor=compressor, filters=filters)
case ArrayV3Metadata(codecs=codecs):
return codecs
case _:
raise ValueError("Unknown ArrayMetadata type")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer this match...case over a standard if isinstance()...else syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly a style preference coming from languages with discriminated unions and compilers, but if you'd like the whole code base to be consistent I can rewrite it with isinstance.

chunks=chunks,
compressor="zlib",
compressor={"id": "zlib"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the presence of "id" a v2 vs v3 difference? because you don't seem to use it in the other tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In V3, the codecs array is made up of (name, configuration) objects, but in V2 the id field serves as the name. In this particular test, I just chose to convert the ZArray to an ArrayV2Metadata

Actually please ignore any of the changes in tests for now because I will need to edit almost every test to excise ZArray :) The few changes I added were done up until I hit the kerchunk module import issue

Comment on lines 27 to 40
class Codec(BaseModel):
compressor: str | None = None
"""
ZarrayV2 codec definition.
"""

compressor: str | dict[str, Any] | None = None
"""
If it's a string, it's the compressor ID.
If it's a dict, it's the full compressor configuration.
"""
filters: list[dict] | None = None

def __repr__(self) -> str:
return f"Codec(compressor={self.compressor}, filters={self.filters})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this class be moved upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, because ZarrV3 now has a notion of codecs which is different from VirtualiZarr's notion of codecs. A VirtualiZarr codec is really the combination of filters and compressor from V2 metadata, which are easily transformable to V3 codecs.

Maybe a zarr.ArrayV2Metadata can have a property that projects the compressor and filters into V3 Codecs, @d-v-b thoughts on that? In that case, Codec could even be a property of the ABC ArrayMetadata.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that codecs: ListWithInternalStructure[Codec] (i.e., what zarr v3 metadata uses) is not the best interface. See a writeup on that point of view here.

But if we ignore that concern for a moment, I'm curious about @TomNicholas's question: what could we change in zarr-python so that you could just import a functionally equivalent class from there (or import a dataclass and wrap with pydantic)? I have a selfish interest in this question since I'm likely going to follow in your footsteps when I implement better codec support over in pydantic-zarr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for forcing me to read the Zarr code more closely! I found VirtualiZarr could probably just use create_codec_pipeline to turn V2 or V3 metadata into a BatchedCodecPipeline and then VirtualiZarr can just deal with CodecPipeline as the abstraction over V2 and V3 . Additionally, VirtualiZarr could change its Codec to use V2Filters and V2Compressor from Zarr, or I might get rid of it entirely.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does Virtualizarr do with the codecs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All virtualizarr currently needs to do with the codecs is:

  1. Check that any two arrays you ask it to concatenate have the same codecs (see Virtual concatenation of arrays with different codecs or dtypes #5)
  2. Make sure the codecs are recorded in the kerchunk references when they get written out to disk.

@TomNicholas TomNicholas changed the title Replace VirtualiZarr.ZArray with zarr.array.Array Replace VirtualiZarr.ZArray with zarr ArrayMetadata Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
zarr-python Relevant to zarr-python upstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants