-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constrained chunk shape estimation #997
base: dev
Are you sure you want to change the base?
Conversation
estimated_chunk_shape = np.array(chunk_shape_constraints) | ||
capped_axes = none_axes | ||
while any(capped_axes): | ||
number_of_free_axes = len(none_axes) | ||
|
||
# Estimate the amount to fill uniformly across the unconstrained axes | ||
# Note that math.prod is 1 if all axes are None | ||
constrained_bytes = math.prod(estimated_chunk_shape[constrained_axes]) * itemsize | ||
estimated_fill_factor = math.floor((chunk_bytes / constrained_bytes) ** (1 / number_of_free_axes)) | ||
|
||
# Update axes and constraints in the event that the fill factor pushed some axes beyond their maximum | ||
capped_axes = none_axes[np.where(maxshape[none_axes] <= estimated_fill_factor)[0]] | ||
estimated_chunk_shape[capped_axes] = maxshape[capped_axes] # Cap the estimate at the max | ||
chunk_shape_constraints[capped_axes] = maxshape[capped_axes] # Consider capped axis a constraint | ||
none_axes = np.where([x is None for x in chunk_shape_constraints])[0] | ||
constrained_axes = np.where([x is not None for x in chunk_shape_constraints])[0] | ||
estimated_chunk_shape[none_axes] = estimated_fill_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think this seems like a good alternative solution, which should achieve the same answer in far fewer loops, but I agree with @oruebel that it could go in a separate function. I liked the way I broke it apart where the math was in a separate function that was just trying to achieve a vector with a certain product (not knowing that it was related to the size and shape of an array).
Also, a minor point: I would remove the np.where
s. If you are going this way you can just index with the boolean np arrays, which will make the syntax a bit cleaner.
maxshape, itemsize, chunk_shape, buffer_gb = getargs( | ||
"maxshape", "chunk_shape", "itemsize", "buffer_gb", kwargs | ||
) | ||
assert buffer_gb > 0, f"buffer_gb ({buffer_gb}) must be greater than zero!" | ||
assert all( | ||
chunk_axis > 0 for chunk_axis in chunk_shape | ||
), f"Some dimensions of chunk_shape ({chunk_shape}) are less than zero!" | ||
|
||
k = math.floor( | ||
( | ||
buffer_gb * 1e9 / (math.prod(self.chunk_shape) * self.dtype.itemsize) | ||
) ** (1 / len(self.chunk_shape)) | ||
) | ||
return tuple( | ||
[ | ||
min(max(k * x, self.chunk_shape[j]), self.maxshape[j]) | ||
for j, x in enumerate(self.chunk_shape) | ||
] | ||
(buffer_gb * 1e9 / (math.prod(chunk_shape) * itemsize)) ** (1 / len(chunk_shape)) | ||
) | ||
return tuple([min(max(k * x, chunk_shape[j]), maxshape[j]) for j, x in enumerate(chunk_shape)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, pulling these attributes out made the code much more readable.
@oruebel, this was our original guess, and it was conveniently also the approach h5py uses, but our experience with @magland has shown that this can cause big problems for certain common applications and at this point I am leaning towards the hypercube approach, as I don't expect it to totally blow up in most cases. The (time, electrode) is the perfect example- it's actually much more common for users to want many channels over a short time than one channel over a long time, particularly in streaming applications.
See the way I separated the function out in my PR. I think this the function as-is is a bit too long and should be broken up, but that's more of a style thing. I'm fine with keeping the old method for now. Let's see how often we or anyone else tends to use it. And yes, using |
Motivation
fix #995
Potential replacement of #996
This is actually really good timing; the latest stage of Zarr integration in NeuroConv (catalystneuro/neuroconv#569) benefits greatly from having these types of methods become static methods of the iterator (and its children) for broader usage, much as @oruebel suggests in #995 (comment)
@bendichter Anyway, I did some mathematical reduction and refactoring of the suggested approach from #996; it's actually quite similar to the original axis ratio method in that one could imagine an altogether more general algorithm that places 'weight factors' on each axis to determine how strongly to bias the distribution of bytes throughout the process...
In the recently suggested update these weight factors would all be
1
; in the original, they are something proportional to the relative length of each axisThis PR also addresses @oruebel's comments by no longer requiring secondary methods
Two other changes:
I strongly suggest keeping the previous method accessible 'just in case' something unexpected arises in the new implementation; a lesson from learned from previous bugs in this iterator. The other fallback would be to force installation of a previous HDMF version, but that could be less desirable and cause other problems (such as incompatibilities) on top of that
Another lesson learned from the past, always use
math.prod
for these operations. Eventually these methods will be used on shapes with very large values, andnp.prod
overflows easily and silently (well, occasionally there might be a warning thrown but not always)How to test the behavior?
Existing tests may require updating to the newly predicted default values
A few new tests should be added for the static usage and showcase of the expected results from each strategy
Checklist
ruff
from the source directory.