Creating sharded arrays easily #2170

d-v-b · 2024-09-10T19:51:27Z

Zarr v3 introduces support for sharding, i.e. saving data in chunks that are themselves divided into subchunks which can be read individually. We anticipate that sharding is one of the key features to motivate users to switch from zarr v2 to zarr v3, and so I think there's good reason to ensure that sharded arrays can be made easily in zarr-python.

status quo

In the v3 branch today, you create a sharded array by including a special codec in the codecs kwarg in the array constructor. The sharding codec contains codecs of its own. See this example from the test suite:

zarr-python/tests/v3/test_codecs/test_sharding.py

Lines 43 to 60 in ac6c6a3

    
           arr = Array.create( 
        
               spath, 
        
               shape=tuple(s + offset for s in data.shape), 
        
               chunk_shape=(64,) * data.ndim, 
        
               dtype=data.dtype, 
        
               fill_value=6, 
        
               codecs=[ 
        
                   ShardingCodec( 
        
                       chunk_shape=(32,) * data.ndim, 
        
                       codecs=[ 
        
                           TransposeCodec(order=order_from_dim("F", data.ndim)), 
        
                           BytesCodec(), 
        
                           BloscCodec(cname="lz4"), 
        
                       ], 
        
                       index_location=index_location, 
        
                   ) 
        
               ], 
        
           )

Complaints:

This is a lot of code / specification. In zarr v2, you could just say chunks=(10,10) and you are done. V3 requires much more work.
it requires importing a specific class (ShardingCodec), or using in the dictionary configuration of that codec, which we don't have autocomplete for right now (e.g., via a typeddict).
the array compression configuration (the other codecs) goes in two completely different places depending on whether you are using sharding or not -- if you are not using sharding, then all the codecs go in the codecs kwarg of the array. if you are using sharding, then the codecs kwarg should only be the sharding codec, with the array compression config sent to the codecs kwarg of the sharding codec itself. This may not be intuitive to people, and it's a rather indirect implementation of something that was simple in zarr v2.

solutions

I have 1 concrete idea for making this easier, which I have prototyped in #2169. that PR introduces a way to specify both sharding and regular chunking with a single keyword argument to Array.create. I think specifying how the chunks of the array are organized with a single data structure is a promising approach, but I am curious to see if anyone has alternative ideas (or if everyone thinks the status quo is fine).

The text was updated successfully, but these errors were encountered:

d-v-b · 2024-11-29T13:59:29Z

curious to hear if anyone has thoughts on a function like this:

async def create_array(
    store: str | StoreLike,
    *,
    path: PathLike | None = None,
    shape: ChunkCoords,
    shard_shape: ChunkCoords,  
    chunk_shape: ChunkCoords,
    dtype: npt.DTypeLike | None = None,
    filters: Iterable[dict[str, JSON] | Codec] = (),
    compressors: Iterable[dict[str, JSON] | Codec] = (),  
    fill_value: Any | None = 0,
    order: MemoryOrder | None = None,
    overwrite: bool = False,
    zarr_format: ZarrFormat | None = None,
    attributes: dict[str, JSON] | None = None,
    chunk_key_encoding: (
        ChunkKeyEncoding
        | tuple[Literal["default"], Literal[".", "/"]]
        | tuple[Literal["v2"], Literal[".", "/"]]
        | None
    ) = None,
    dimension_names: Iterable[str] | None = None,
    storage_options: dict[str, Any] | None = None,
) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]:
    """Create an array.

    Parameters
    ----------


    Returns
    -------
    z : array
        The array.
    """

    if zarr_format is None:
        zarr_format = _default_zarr_version()

    # TODO: figure out why putting these imports at top-level causes circular import errors
    from zarr.codecs.bytes import BytesCodec
    from zarr.codecs.sharding import ShardingCodec

    mode: Literal["a"] = "a"

    store_path = await make_store_path(store, path=path, mode=mode, storage_options=storage_options)
    sub_codecs = (*filters, BytesCodec(),  *compressors)
    
    if zarr_format == 2:
        if shard_shape != chunk_shape:
            raise ValueError('shard_shape must be equal to chunk_shape when zarr_format is 2')
        codecs = sub_codecs
    else:
        codecs = (ShardingCodec(chunk_shape=chunk_shape, codecs=sub_codecs),)

    return await AsyncArray.create(
            store=store_path,
            shape=shape,
            dtype=dtype,
            zarr_format=zarr_format,
            fill_value=fill_value,
            attributes=attributes,
            chunk_shape=chunk_shape,
            chunk_key_encoding=chunk_key_encoding,
            codecs=codecs,
            dimension_names=dimension_names,
            order=order,
            exists_ok=overwrite,
        )

d-v-b · 2024-11-29T14:01:56Z

for v3 arrays, this only creates arrays with a sharding codec. the size of the "outer chunks" is set by the shard_size keyword argument, while the size of the "inner chunks" is set by the "chunk_size" keyword argument. Because it's not sensible to do sharding inside of sharding , the ArrayBytes codec is set to BytesCodec, and I am semi-arbitrarily mapping the "filters" keyword argument to ArrayArrayCodec, and the "compressors" keyword argument maps to BytesBytesCodec

d-v-b · 2024-11-29T15:34:03Z

notes from the dev meeting:

wire up order to ensure that it works for v2 (changes the on-disk representation) and v3 (inserts a transpose codec)
ensure that v2 order plays well with the global order config value
make shard_size nullable, which disables sharding

d-v-b mentioned this issue Sep 11, 2024

feat: change array creation signature to allow sharding specification [do not merge] #2169

Draft

6 tasks

jhamman added the V3 label Sep 13, 2024

jhamman added this to the After 3.0.0 milestone Sep 13, 2024

jhamman added this to Zarr-Python - 3.0 Sep 13, 2024

dstansby removed the V3 label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating sharded arrays easily #2170

Creating sharded arrays easily #2170

d-v-b commented Sep 10, 2024

d-v-b commented Nov 29, 2024

d-v-b commented Nov 29, 2024

d-v-b commented Nov 29, 2024

Creating sharded arrays easily #2170

Creating sharded arrays easily #2170

Comments

d-v-b commented Sep 10, 2024

status quo

solutions

d-v-b commented Nov 29, 2024

d-v-b commented Nov 29, 2024

d-v-b commented Nov 29, 2024