-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed consolidated Group getitem with multi-part key #2363
base: main
Are you sure you want to change the base?
Fixed consolidated Group getitem with multi-part key #2363
Conversation
This fixes `Group.__getitem__` when indexing with a key like 'subgroup/array'. The basic idea is to rewrite the indexing operation as `group['subgroup']['array']` by splitting the key and doing each operation independently. This is fine for consolidated metadata which doesn't need to do IO. There's a complication around unconsolidated metadata, though. What if we encounter a node where `Group.getitem` returns a sub Group without consolidated metadata. Then we need to fall back to non-consolidated metadata. We've written _getitem_consolidated as a regular (non-async) function so we need to pop back up to the async caller and have *it* fall back. Closes zarr-developers#2358
src/zarr/core/group.py
Outdated
@@ -97,6 +97,24 @@ def _parse_async_node( | |||
raise TypeError(f"Unknown node type, got {type(node)}") | |||
|
|||
|
|||
class _MixedConsolidatedMetadataException(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if this MixConsolidatedMetadataException
stuff makes sense. The crux of the issue is that Group.__getitem__
on a Group with consolidated metadata can be IO-free. To make that explicit, we define it as a plain (non-async) function.
But this group['subgroup']['subarray']
presents a new challenge. What if group
has consolidated metadata, but subgroup
doesn't? Its consolidated metadata might be None
, meaning we need to fall back to the non-consolidated metadata. We discover this fact in AsyncGroup._getitem_consolidated
, which isn't an async function, so it can't call the async non-consolidated getitem.
To break through that tangle, I've added this custom exception. The expectation is that users never see it; we require (through docs & being careful?) that all the callers of _getitem_consolidated
handle this case by catching the error and falling back if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to mention: this "mixed" case isn't going to be common. Under normal operation, where users call consolidate_metadata
and don't mutate the group afterwards we consolidate everything and so won't hit this. I believe the only time this can happen through the public API is when users add a new group to group with consolidated metadata, and don't then re-consolidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the only time this can happen through the public API is when users add a new group to group with consolidated metadata, and don't then re-consolidate.
Do we need to support this case? I would think that users could expect that, when using consolidated metadata, then metadata lookups (beyond the first) will never trigger IO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some discussion on this here: #2363 (comment). I think there isn't really a clear answer. I'm sympathetic to users who want to know for certain that they're done with I/O after loading a Group with consolidated metadata.
If we decide not to support this, we do still have a decision to make: when we try to index into a nested Group and reach a node with non-consolidated metadata, do we raise a KeyError
(saying we know this key isn't there) or some public version of this MixedConsolidatedMetadataError (saying we don't know whether or not this key is there)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a user of consolidated metadata, so take this with a grain of salt, but I would default to being strict: if people open a group in consolidated metadata mode, then IMO the contents of the consolidated metadata should purport to provide the complete model of the hierarchy, even if that model happens to be wrong :) But I'd prefer to temper my strict default with whatever expectations consolidated metadata users have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I would default to being strict
Just to be clear: what does "strict" mean to you here?
I actually am having trouble constructing a hierarchy that hits this using the public API:
In [1]: import zarr
In [2]: store = zarr.storage.MemoryStore(mode="w")
In [3]: group = zarr.open_group(store=store, path="a/b/c/d")
In [4]: g = zarr.consolidate_metadata(store)
In [5]: g['a/b/c/d'].create_group("x")
Out[5]: <Group memory://4480609344/a/b/c/d/x>
In [6]: g['a/b/c/d'].metadata.consolidated_metadata
Out[6]: ConsolidatedMetadata(metadata={}, kind='inline', must_understand=False)
I think the semantics around what exact state in after mutating a consolidated hierarchy, but before reconsolidating, that we can do whatever we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I would default to being strict
Just to be clear: what does "strict" mean to you here?
for me, "strict" would mean that if a user opens a group with use_consolidated=True
, then the consolidated metadata for that group will be the single source of truth about the hierarchy, and so attempting to access a sub-node that actually exists, but isn't in that document, would raise a KeyError
. Not sure if this is actually a reasonable position
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work here @TomAugspurger - The (somewhat messy) handling of mixed consolidated / non-consolidated metadata makes me wonder if we want to revisit supporting it at all. It would be pretty reasonable to ask callers to switch to consolidated=False
if they go looking for a key outside of the range of the consolidated metadata.
But we can have that conversation another day.
Makes sense. I think it depends on exactly how transparent a cache we want consolidated metadata to be. If it's really supposed to be transparent, then falling back is probably the right thing to do. Really, the complexity comes from my stubbornness around keeping But overall, I think the guarantee we get that An alternative to using Exceptions to signal an issue with mixed consolidated metadata, we could return a |
indexers.reverse() | ||
metadata: ArrayV2Metadata | ArrayV3Metadata | GroupMetadata = self.metadata | ||
|
||
while indexers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't the flattened form of the consolidated metadata make this a lot simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the problem with that is that it requires that the consolidated metadata be complete? whereas the iterative approach can handle a group with 'live' metadata.
…-getitem' into tom/fix/cm-nested-getitem
This fixes
Group.__getitem__
when indexing with a keylike 'subgroup/array'. The basic idea is to rewrite the indexing
operation as
group['subgroup']['array']
by splitting the keyand doing each operation independently. This is fine for consolidated
metadata which doesn't need to do IO.
There's a complication around unconsolidated metadata, though. What
if we encounter a node where
Group.getitem
returns a sub Groupwithout consolidated metadata. Then we need to fall back to
non-consolidated metadata. We've written _getitem_consolidated
as a regular (non-async) function so we need to pop back up to
the async caller and have it fall back.
Closes #2358