-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating sharded arrays easily #2170
Comments
curious to hear if anyone has thoughts on a function like this: async def create_array(
store: str | StoreLike,
*,
path: PathLike | None = None,
shape: ChunkCoords,
shard_shape: ChunkCoords,
chunk_shape: ChunkCoords,
dtype: npt.DTypeLike | None = None,
filters: Iterable[dict[str, JSON] | Codec] = (),
compressors: Iterable[dict[str, JSON] | Codec] = (),
fill_value: Any | None = 0,
order: MemoryOrder | None = None,
overwrite: bool = False,
zarr_format: ZarrFormat | None = None,
attributes: dict[str, JSON] | None = None,
chunk_key_encoding: (
ChunkKeyEncoding
| tuple[Literal["default"], Literal[".", "/"]]
| tuple[Literal["v2"], Literal[".", "/"]]
| None
) = None,
dimension_names: Iterable[str] | None = None,
storage_options: dict[str, Any] | None = None,
) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]:
"""Create an array.
Parameters
----------
Returns
-------
z : array
The array.
"""
if zarr_format is None:
zarr_format = _default_zarr_version()
# TODO: figure out why putting these imports at top-level causes circular import errors
from zarr.codecs.bytes import BytesCodec
from zarr.codecs.sharding import ShardingCodec
mode: Literal["a"] = "a"
store_path = await make_store_path(store, path=path, mode=mode, storage_options=storage_options)
sub_codecs = (*filters, BytesCodec(), *compressors)
if zarr_format == 2:
if shard_shape != chunk_shape:
raise ValueError('shard_shape must be equal to chunk_shape when zarr_format is 2')
codecs = sub_codecs
else:
codecs = (ShardingCodec(chunk_shape=chunk_shape, codecs=sub_codecs),)
return await AsyncArray.create(
store=store_path,
shape=shape,
dtype=dtype,
zarr_format=zarr_format,
fill_value=fill_value,
attributes=attributes,
chunk_shape=chunk_shape,
chunk_key_encoding=chunk_key_encoding,
codecs=codecs,
dimension_names=dimension_names,
order=order,
exists_ok=overwrite,
) |
for v3 arrays, this only creates arrays with a sharding codec. the size of the "outer chunks" is set by the |
notes from the dev meeting:
|
Zarr v3 introduces support for sharding, i.e. saving data in chunks that are themselves divided into subchunks which can be read individually. We anticipate that sharding is one of the key features to motivate users to switch from zarr v2 to zarr v3, and so I think there's good reason to ensure that sharded arrays can be made easily in
zarr-python
.status quo
In the
v3
branch today, you create a sharded array by including a special codec in thecodecs
kwarg in the array constructor. The sharding codec contains codecs of its own. See this example from the test suite:zarr-python/tests/v3/test_codecs/test_sharding.py
Lines 43 to 60 in ac6c6a3
Complaints:
chunks=(10,10)
and you are done. V3 requires much more work.ShardingCodec
), or using in the dictionary configuration of that codec, which we don't have autocomplete for right now (e.g., via a typeddict).codecs
kwarg of the array. if you are using sharding, then thecodecs
kwarg should only be the sharding codec, with the array compression config sent to thecodecs
kwarg of the sharding codec itself. This may not be intuitive to people, and it's a rather indirect implementation of something that was simple in zarr v2.solutions
I have 1 concrete idea for making this easier, which I have prototyped in #2169. that PR introduces a way to specify both sharding and regular chunking with a single keyword argument to
Array.create
. I think specifying how the chunks of the array are organized with a single data structure is a promising approach, but I am curious to see if anyone has alternative ideas (or if everyone thinks the status quo is fine).The text was updated successfully, but these errors were encountered: