Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change array creation signature to allow sharding specification [do not merge] #2169

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Sep 10, 2024

The goal of this PR is to demonstrate one strategy to simplify the creation of arrays that use sharding. Don't consider merging this until we get a good look at some alternatives.

This PR alters the Array.create routine, removing the chunk_shape kwarg and instead beefing up the semantics of the chunks kwarg. Specifically, the chunks kwarg supports a new variant, ChunkSpec, which aims to compactly specify both the chunk shape of an array as well as the (optional) sub-chunk shape.

ChunkSpec is a typed dictionary with two keys: read_shape and write_shape. write_shape specifies the shape of array chunks that can be written concurrently, i.e. the shape in array coordinates of the chunk files. read_shape specifies the shape of array chunks that can be read concurrently, i.e. the shape in array coordinates of the sub-chunks contained in a chunk constructed with a sharding codec.

  • passing chunks = None or chunks = {} (we support the latter case because of how non-total typeddicts work) to Array.create will automatically specify chunks using old v2 logic.
  • passing chunks = {'write_shape': (20, 20)} OR chunks = {'read_shape': (20, 20)} to Array.create will configure that array with no sharding and a chunk size of (20,20).
  • passing chunks = {'write_shape': (20, 20), 'read_shape': (10,10)} to Array.create will configure that array with sharding, with a sub-chunk size of (10,10), and a chunk size of (20,20). This will also route all the of the user-specified codecs, if any, to the sharding codec.

Note that this PR does not change the signature of the array class itself. That would be a separate effort.

addresses #2170

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b d-v-b changed the title feat: change array creation signature to allow sharing specification [do not merge] feat: change array creation signature to allow sharding specification [do not merge] Sep 10, 2024
_codecs = tuple(codecs) if codecs is not None else (BytesCodec(),)

if shard_shape is not None:
_codecs = (ShardingCodec(chunk_shape=shard_shape, codecs=_codecs),)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively hard-codes sharding into the spec, something like a sharded=True flag that might have existed on the CHunkSpec object. How do you expect this to extend to variable chunking or other schemes that might be created in the future?

Copy link
Contributor Author

@d-v-b d-v-b Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal can use whatever specification for variable length chunks we come up with, e.g. tuples of tuples of ints. You could specify variable length chunking with no sharding via something like chunks = {'write_shape': ((10,5), (1,2,3)}, and variable length chunking with sharding via something like chunks = {'write_shape: ((10,5), (1,2,3)), 'read_shape': (1,1)}. The read shape would have to checked for consistency with all the unique chunk shapes in this case. We would of course need to widen the type of ChunkSpec for this to accept tuple[tuple[int, ...]] for the write_shape keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like a sharded=True flag that might have existed on the CHunkSpec object.

If we had such a flag on the chunkspec object, then it would semantically collide with read_shape: {'write_shape': (10,10), 'read_shape': (2,2), sharding: False} would not be valid, because there's no way to have read_shape and write_shape differ without sharding. BTW when I say "sharding" i don't mean "the sharding codec", I mean the general concept of packing multiple subchunks into a single file. If a non-codec implementation of sharding emerges, then I would like to imagine that this API could wrap that.

@jhamman jhamman added the V3 label Sep 13, 2024
@jhamman jhamman changed the base branch from v3 to main October 14, 2024 20:59
@jhamman jhamman added this to the After 3.0.0 milestone Oct 17, 2024
@dstansby dstansby removed the V3 label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

4 participants