Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constrained chunk shape estimation #997

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from
Draft

Conversation

CodyCBakerPhD
Copy link
Contributor

@CodyCBakerPhD CodyCBakerPhD commented Nov 8, 2023

Motivation

fix #995

Potential replacement of #996

This is actually really good timing; the latest stage of Zarr integration in NeuroConv (catalystneuro/neuroconv#569) benefits greatly from having these types of methods become static methods of the iterator (and its children) for broader usage, much as @oruebel suggests in #995 (comment)

@bendichter Anyway, I did some mathematical reduction and refactoring of the suggested approach from #996; it's actually quite similar to the original axis ratio method in that one could imagine an altogether more general algorithm that places 'weight factors' on each axis to determine how strongly to bias the distribution of bytes throughout the process...

In the recently suggested update these weight factors would all be 1; in the original, they are something proportional to the relative length of each axis

This PR also addresses @oruebel's comments by no longer requiring secondary methods

Two other changes:

I strongly suggest keeping the previous method accessible 'just in case' something unexpected arises in the new implementation; a lesson from learned from previous bugs in this iterator. The other fallback would be to force installation of a previous HDMF version, but that could be less desirable and cause other problems (such as incompatibilities) on top of that

Another lesson learned from the past, always use math.prod for these operations. Eventually these methods will be used on shapes with very large values, and np.prod overflows easily and silently (well, occasionally there might be a warning thrown but not always)

How to test the behavior?

Existing tests may require updating to the newly predicted default values

A few new tests should be added for the static usage and showcase of the expected results from each strategy

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

@CodyCBakerPhD CodyCBakerPhD self-assigned this Nov 8, 2023
Comment on lines +371 to +387
estimated_chunk_shape = np.array(chunk_shape_constraints)
capped_axes = none_axes
while any(capped_axes):
number_of_free_axes = len(none_axes)

# Estimate the amount to fill uniformly across the unconstrained axes
# Note that math.prod is 1 if all axes are None
constrained_bytes = math.prod(estimated_chunk_shape[constrained_axes]) * itemsize
estimated_fill_factor = math.floor((chunk_bytes / constrained_bytes) ** (1 / number_of_free_axes))

# Update axes and constraints in the event that the fill factor pushed some axes beyond their maximum
capped_axes = none_axes[np.where(maxshape[none_axes] <= estimated_fill_factor)[0]]
estimated_chunk_shape[capped_axes] = maxshape[capped_axes] # Cap the estimate at the max
chunk_shape_constraints[capped_axes] = maxshape[capped_axes] # Consider capped axis a constraint
none_axes = np.where([x is None for x in chunk_shape_constraints])[0]
constrained_axes = np.where([x is not None for x in chunk_shape_constraints])[0]
estimated_chunk_shape[none_axes] = estimated_fill_factor
Copy link
Contributor

@bendichter bendichter Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this seems like a good alternative solution, which should achieve the same answer in far fewer loops, but I agree with @oruebel that it could go in a separate function. I liked the way I broke it apart where the math was in a separate function that was just trying to achieve a vector with a certain product (not knowing that it was related to the size and shape of an array).

Also, a minor point: I would remove the np.wheres. If you are going this way you can just index with the boolean np arrays, which will make the syntax a bit cleaner.

Comment on lines +439 to +450
maxshape, itemsize, chunk_shape, buffer_gb = getargs(
"maxshape", "chunk_shape", "itemsize", "buffer_gb", kwargs
)
assert buffer_gb > 0, f"buffer_gb ({buffer_gb}) must be greater than zero!"
assert all(
chunk_axis > 0 for chunk_axis in chunk_shape
), f"Some dimensions of chunk_shape ({chunk_shape}) are less than zero!"

k = math.floor(
(
buffer_gb * 1e9 / (math.prod(self.chunk_shape) * self.dtype.itemsize)
) ** (1 / len(self.chunk_shape))
)
return tuple(
[
min(max(k * x, self.chunk_shape[j]), self.maxshape[j])
for j, x in enumerate(self.chunk_shape)
]
(buffer_gb * 1e9 / (math.prod(chunk_shape) * itemsize)) ** (1 / len(chunk_shape))
)
return tuple([min(max(k * x, chunk_shape[j]), maxshape[j]) for j, x in enumerate(chunk_shape)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, pulling these attributes out made the code much more readable.

@bendichter
Copy link
Contributor

bendichter commented Nov 9, 2023

I think you probably want the ratio to be informed by the overall size of the dataset. I.e., for (time, electrodes) where time is much longer than electrodes, then you'd also want the chunks to be larger in the time dimension. (@oruebel #996 (comment))

@oruebel, this was our original guess, and it was conveniently also the approach h5py uses, but our experience with @magland has shown that this can cause big problems for certain common applications and at this point I am leaning towards the hypercube approach, as I don't expect it to totally blow up in most cases. The (time, electrode) is the perfect example- it's actually much more common for users to want many channels over a short time than one channel over a long time, particularly in streaming applications.

This PR also addresses @oruebel's comments by no longer requiring secondary methods

See the way I separated the function out in my PR. I think this the function as-is is a bit too long and should be broken up, but that's more of a style thing.

I'm fine with keeping the old method for now. Let's see how often we or anyone else tends to use it.

And yes, using math.prod is a good idea. I do remember that being a particularly sneaky bug we definitely do not want to reintroduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Define partial chunk shape for GenericDataChunkIterator
2 participants